https://arxiv.org/api/jFrbMaLw4KATGHoW11XJ2WE+Xqw2026-06-22T05:48:05Z266434515http://arxiv.org/abs/2504.08009v3Ozaki Scheme II: A GEMM-oriented emulation of floating-point matrix multiplication using an integer modular technique2025-04-27T11:14:02ZThis paper addresses emulation algorithms for matrix multiplication. General Matrix-Matrix Multiplication (GEMM), a fundamental operation in the Basic Linear Algebra Subprograms (BLAS), is typically optimized for specific hardware architectures. The Ozaki scheme is a well-established GEMM-based emulation method for matrix multiplication, wherein input matrices are decomposed into several low-precision components to ensure that the resulting matrix product is computed exactly through numerical operations. This study proposes a novel GEMM-based emulation method for matrix multiplication that leverages the Chinese Remainder Theorem. The proposed method inherits the computational efficiency of highly optimized GEMM routines and further enables control over the number of matrix multiplications, which can enhance computational accuracy. We present numerical experiments featuring INT8 Tensor Core operations on GPUs and FP64 arithmetic on CPUs as case studies. The results demonstrate that FP64 emulation using the proposed method achieves performance levels of up to 7.4 to 9.8 TFLOPS on the NVIDIA RTX 4090 and 56.6 to 80.2 TFLOPS on the NVIDIA GH200, exceeding the measured performance of native FP64 arithmetic. Furthermore, for FP64 computations on CPUs, the proposed method achieved up to a 2.3x speedup in emulating quadruple-precision arithmetic compared to the conventional Ozaki scheme.2025-04-10T02:07:20ZKatsuhisa OzakiYuki UchinoToshiyuki Imamurahttp://arxiv.org/abs/2504.17268v1Parameter Estimation in ODE Models with Certified Polynomial System Solving2025-04-24T05:53:31ZWe consider dynamical models given by rational ODE systems. Parameter estimation is an important and challenging task of recovering parameter values from observed data. Recently, a method based on differential algebra and rational interpolation was proposed to express parameter estimation in terms of polynomial system solving. Typically, polynomial system solving is a bottleneck, hence the choice of the polynomial solver is crucial. In this contribution, we compare two polynomial system solvers applied to parameter estimation: homotopy continuation solver from HomotopyContinuation.jl and our new implementation of a certified solver based on rational univariate representation (RUR) and real root isolation. We show how the new RUR solver can tackle examples that are out of reach for the homotopy methods and vice versa.2025-04-24T05:53:31Z3 pagesAlexander DeminAlexey OvchinnikovFabrice Rouillierhttp://arxiv.org/abs/2504.15814v2Fast Higher-Order Interpolation and Restriction in ExaHyPE Avoiding Non-physical Reflections2025-04-23T08:43:40ZWave equations help us to understand phenomena ranging from earthquakes to tsunamis. These phenomena materialise over very large scales. It would be computationally infeasible to track them over a regular mesh. Yet, since the phenomena are localised, adaptive mesh refinement (AMR) can be used to construct meshes with a higher resolution close to the regions of interest. ExaHyPE is a software engine created to solve wave problems using AMR, and we use it as baseline to construct our numerical relativity application called ExaGRyPE. To advance the mesh in time, we have to interpolate and restrict along resolution transitions in each and every time step. ExaHyPE's vanilla code version uses a d-linear tensor-product approach. In benchmarks of a stationary black hole this performs slowly and leads to errors in conserved quantities near AMR boundaries. We therefore introduce a set of higher-order interpolation schemes where the derivatives are calculated at each coarse grid cell to approximate the enclosed fine cells. The resulting methods run faster than the tensor-product approach. Most importantly, when running the stationary black hole simulation using the higher order methods the errors near the AMR boundaries are removed.2025-04-22T11:52:58ZTimothy StokesTobias WeinzierlHan ZhangBaojiu Lihttp://arxiv.org/abs/2504.13821v1Toward Portable GPU Performance: Julia Recursive Implementation of TRMM and TRSM2025-04-18T17:51:34ZThis paper presents a performant and portable recursive implementation of triangular matrix-matrix multiplication (TRMM) and triangular solve (TRSM) in Julia for GPUs, two kernels that underlie many linear-algebra algorithms. We restructure TRMM and TRSM so that most work is executed as general matrix-matrix multiplication (GEMM), improving use of the GPU memory hierarchy and reducing latency. Exploiting Julia's multiple dispatch and metaprogramming together with the GPUArrays and KernelAbstractions frameworks, we expose a single hardware-agnostic API that runs on NVIDIA, AMD, and Apple Silicon GPUs. For large matrices the recursive code reaches throughput comparable to vendor libraries such as cuBLAS and rocBLAS, while providing these routines on Apple Silicon for the first time. The entire implementation is only a few hundred lines of code, showing that unified Julia programs can deliver near-vendor performance across heterogeneous architectures.2025-04-18T17:51:34ZVicki CarricaMaxwell OnyangoRabab AlomairyEvelyne RingootJames SchlossAlan Edelmanhttp://arxiv.org/abs/2504.11708v2Fast Mixed-Precision Real Evaluation2025-04-18T16:41:00ZEvaluating real-valued expressions to high precision is a key building block in computational mathematics, physics, and numerics. A typical implementation evaluates the whole expression in a uniform precision, doubling that precision until a sufficiently-accurate result is achieved. This is wasteful: usually only a few operations really need to be performed at high precision, and the bulk of the expression could be computed much faster. However, such non-uniform precision assignments have, to date, been impractical to compute. We propose a fast new algorithm for deriving such precision assignments. The algorithm leverages results computed at lower precisions to analytically determine a mixed-precision assignment that will result in a sufficiently-accurate result. Our implementation, Reval, achieves an average speed-up of 1.72x compared to the state-of-the-art Sollya tool, with the speed-up increasing to 5.21x on the most difficult input points. An examination of the precisions used with and without precision tuning shows that the speed-up results from assigning lower precisions for the majority of operations, though additional optimizations enabled by the non-uniform precision assignments also play a role.2025-04-16T02:12:20ZDuplicates arxiv:2410.07468Artem YadrovPavel Panchekhahttp://arxiv.org/abs/2504.13236v1NNTile: a machine learning framework capable of training extremely large GPT language models on a single node2025-04-17T16:22:32ZThis study presents an NNTile framework for training large deep neural networks in heterogeneous clusters. The NNTile is based on a StarPU library, which implements task-based parallelism and schedules all provided tasks onto all available processing units (CPUs and GPUs). It means that a particular operation, necessary to train a large neural network, can be performed on any of the CPU cores or GPU devices, depending on automatic scheduling decisions. Such an approach shifts the burden of deciding where to compute and when to communicate from a human being to an automatic decision maker, whether a simple greedy heuristic or a complex AI-based software. The performance of the presented tool for training large language models is demonstrated in extensive numerical experiments.2025-04-17T16:22:32ZAleksandr MikhalevAleksandr KatrutsaKonstantin SozykinIvan Oseledetshttp://arxiv.org/abs/2504.12841v1ALT: A Python Package for Lightweight Feature Representation in Time Series Classification2025-04-17T10:57:29ZWe introduce ALT, an open-source Python package created for efficient and accurate time series classification (TSC). The package implements the adaptive law-based transformation (ALT) algorithm, which transforms raw time series data into a linearly separable feature space using variable-length shifted time windows. This adaptive approach enhances its predecessor, the linear law-based transformation (LLT), by effectively capturing patterns of varying temporal scales. The software is implemented for scalability, interpretability, and ease of use, achieving state-of-the-art performance with minimal computational overhead. Extensive benchmarking on real-world datasets demonstrates the utility of ALT for diverse TSC tasks in physics and related domains.2025-04-17T10:57:29Z16 pages, 4 figuresMachine Learning: Science and Technology (2026)Balázs P. HalmosBalázs HajósVince Á. MolnárMarcell T. KurbuczAntal Jakovác10.1088/2632-2153/ae3e4fhttp://arxiv.org/abs/2504.11167v1Low-Rank SPIKE Framework for Solving Large Sparse Linear Systems with Applications2025-04-15T13:15:00ZThe SPIKE family of linear system solvers provides parallelism using a block tridiagonal partitioning. Typically SPIKE-based solvers are applied to banded systems, resulting in structured off-diagonal blocks with non-zeros elements restricted to relatively small submatrices comprising the band of the original matrix. In this work, a low-rank SVD based approximation of the off-diagonal blocks is investigated. This produces a representation which more effectively handles matrices with large, sparse bands. A set of flexible distributed solvers, the LR-SPIKE variants, are implemented. There are applicable to a wide range of applications -- from use as a "black-box" preconditioner which straightforwardly improves upon the classic Block Jacobi preconditioner, to use as a specialized "approximate direct solver." An investigation of the effectiveness of the new preconditioners for a selection of SuiteSparse matrices is performed, particularly focusing on matrices derived from 3D finite element simulations. In addition, the SPIKE approximate linear system solvers are also paired with the FEAST eigenvalue solver, where they are shown to be particularly effective due to the former's rapid convergence, and the latter's acceptance of loose linear system solver convergence, resulting in a combination which requires very few solver iterations.2025-04-15T13:15:00Z26 pagesBraegan S. SpringEric PolizziAhmed H. Samehhttp://arxiv.org/abs/2504.10754v1auto-fpt: Automating Free Probability Theory Calculations for Machine Learning Theory2025-04-14T23:07:56ZA large part of modern machine learning theory often involves computing the high-dimensional expected trace of a rational expression of large rectangular random matrices. To symbolically compute such quantities using free probability theory, we introduce auto-fpt, a lightweight Python and SymPy-based tool that can automatically produce a reduced system of fixed-point equations which can be solved for the quantities of interest, and effectively constitutes a theory. We overview the algorithmic ideas underlying auto-fpt and its applications to various interesting problems, such as the high-dimensional error of linearized feed-forward neural networks, recovering well-known results. We hope that auto-fpt streamlines the majority of calculations involved in high-dimensional analysis, while helping the machine learning community reproduce known and uncover new phenomena.2025-04-14T23:07:56ZWork in progressArjun SubramonianElvis Dohmatobhttp://arxiv.org/abs/2311.10700v2Deriving Algorithms for Triangular Tridiagonalization a Skew-Symmetric Matrix2025-04-10T15:07:00ZThis paper provides technical details regarding the application of the FLAME methodology to derive algorithms hand in hand with their proofs of correctness for the computation of the $ L T L^T $ decomposition (with and without pivoting) of a skew-symmetric matrix. The approach yields known as well as new algorithms, presented using the FLAME notation, enabling comparing and contrasting. A number of BLAS-like primitives are exposed at the core of the resulting unblocked and blocked algorithms.2023-11-17T18:44:40Z28 pages. arXiv admin note: text overlap with arXiv:2411.09859Robert van de GeijnMaggie MyersRuQing G. XuDevin Matthewshttp://arxiv.org/abs/1909.10051v3PyIT2FLS: A New Python Toolkit for Interval Type 2 Fuzzy Logic Systems2025-04-09T07:57:06ZFuzzy logic is an accepted and well-developed approach for constructing verbal models. Fuzzy based methods are getting more popular, while the engineers deal with more daily life tasks. This paper presents a new Python toolkit for Interval Type 2 Fuzzy Logic Systems (IT2FLS). Developing software tools is an important issue for facilitating the practical use of theoretical results. There are limited tools for implementing IT2FLSs in Python. The developed PyIT2FLS is providing a set of tools for fast and easy modeling of fuzzy systems. This paper includes a brief description of how developed toolkit can be used. Also, three examples are given showing the usage of the developed toolkit for simulating IT2FLSs. First, a simple rule-based system is developed and it's codes are presented in the paper. The second example is the prediction of the Mackey-Glass chaotic time series using IT2FLS. In this example, the Particle Swarm Optimization (PSO) algorithm is used for determining system parameters while minimizing the mean square error. In the last example, an IT2FPID is designed and used for controlling a linear time-delay system. The code for the examples are available on toolkit's GitHub page: https://github.com/Haghrah/PyIT2FLS. The simulations and their results confirm the ability of the developed toolkit to be used in a wide range of the applications.2019-09-22T17:34:20ZThis work has been published in SoftwareX, Volume 30, May 2025, 102146. https://doi.org/10.1016/j.softx.2025.102146SoftwareX, Volume 30, May 2025, 102146Amir Arslan HaghrahSehraneh Ghaemi10.1016/j.softx.2025.102146http://arxiv.org/abs/2504.06467v1ZETA: a library for Zonotope-based EsTimation and fAult diagnosis of discrete-time systems2025-04-08T22:19:11ZThis paper introduces ZETA, a new MATLAB library for Zonotope-based EsTimation and fAult diagnosis of discrete-time systems. It features user-friendly implementations of set representations based on zonotopes, namely zonotopes, constrained zonotopes, and line zonotopes, in addition to a basic implementation of interval arithmetic. This library has capabilities starting from the basic set operations with these sets, including propagations through nonlinear functions using various approximation methods. The features of ZETA allow for reachability analysis and state estimation of discrete-time linear, nonlinear, and descriptor systems, in addition to active fault diagnosis of linear systems. Efficient order reduction methods are also implemented for the respective set representations. Some examples are presented in order to illustrate the functionalities of the new library.2025-04-08T22:19:11Z8 pages, 6 figures. Preprint submitted to the 64th IEEE Conference on Decision and ControlBrenner S. RegoJoseph K. ScottDavide M. RaimondoMarco H. TerraGuilherme V. Raffohttp://arxiv.org/abs/2304.13099v5Exponentially Convergent Numerical Method for Abstract Cauchy Problem with Fractional Derivative of Caputo Type2025-04-07T17:29:15ZWe present an exponentially convergent numerical method to approximate the solution of the Cauchy problem for the inhomogeneous fractional differential equation with an unbounded operator coefficient and Caputo fractional derivative in time. The numerical method is based on the newly obtained solution formula that consolidates the mild solution representations of sub-parabolic, parabolic and sub-hyperbolic equations with sectorial operator coefficient $A$ and non-zero initial data. The involved integral operators are approximated using the sinc-quadrature formulas that are tailored to the spectral parameters of $A$, fractional order $α$ and the smoothness of the first initial condition, as well as to the properties of the equation's right-hand side $f(t)$. The resulting method possesses exponential convergence for positive sectorial $A$, any finite $t$, including $t = 0$ and the whole range $α\in (0,2)$. It is suitable for a practically important case, when no knowledge of $f(t)$ is available outside the considered interval $t \in [0, T]$. The algorithm of the method is capable of multi-level parallelism. We provide numerical examples that confirm the theoretical error estimates.2023-04-25T19:10:58ZThis version supersedes the official publication (https://www.mdpi.com/2227-7390/11/10/2312) at present time. Several typos were corrected and corrigendum addedMathematics 11, no. 10: 2312 (2023)Dmytro SytnykBarbara Wohlmuth10.3390/math11102312http://arxiv.org/abs/2407.03483v3Construct accurate multi-continuum micromorphic homogenisations in multi-D space-time with computer algebra2025-04-06T23:27:00ZHomogenisation empowers the efficient macroscale system level prediction of physical scenarios with intricate microscale structures. Here we develop an innovative powerful, rigorous and flexible framework for asymptotic homogenisation of dynamics at the \emph{finite} scale separation of real physics, with proven results underpinned by modern dynamical systems theory. The novel systematic approach removes most of the usual assumptions, whether implicit or explicit, of other methodologies. By no longer assuming averages the methodology constructs so-called multi-continuum or micromorphic homogenisations systematically informed by the microscale physics. The developed framework and approach enables a user to straightforwardly choose and create such homogenisations with clear physical and theoretical support, and of highly controllable accuracy and fidelity.2024-07-03T20:11:02Z3rd versionA. J. Robertshttp://arxiv.org/abs/2504.02067v1A Truncated Newton Method for Optimal Transport2025-04-02T19:00:24ZDeveloping a contemporary optimal transport (OT) solver requires navigating trade-offs among several critical requirements: GPU parallelization, scalability to high-dimensional problems, theoretical convergence guarantees, empirical performance in terms of precision versus runtime, and numerical stability in practice. With these challenges in mind, we introduce a specialized truncated Newton algorithm for entropic-regularized OT. In addition to proving that locally quadratic convergence is possible without assuming a Lipschitz Hessian, we provide strategies to maximally exploit the high rate of local convergence in practice. Our GPU-parallel algorithm exhibits exceptionally favorable runtime performance, achieving high precision orders of magnitude faster than many existing alternatives. This is evidenced by wall-clock time experiments on 24 problem sets (12 datasets $\times$ 2 cost functions). The scalability of the algorithm is showcased on an extremely large OT problem with $n \approx 10^6$, solved approximately under weak entopric regularization.2025-04-02T19:00:24ZAccepted to ICLR 2025Mete KemertasAmir-massoud FarahmandAllan D. Jepson