https://arxiv.org/api/jFrbMaLw4KATGHoW11XJ2WE+Xqw 2026-06-22T05:48:05Z 2664 345 15 http://arxiv.org/abs/2504.08009v3 Ozaki Scheme II: A GEMM-oriented emulation of floating-point matrix multiplication using an integer modular technique 2025-04-27T11:14:02Z

This paper addresses emulation algorithms for matrix multiplication. General Matrix-Matrix Multiplication (GEMM), a fundamental operation in the Basic Linear Algebra Subprograms (BLAS), is typically optimized for specific hardware architectures. The Ozaki scheme is a well-established GEMM-based emulation method for matrix multiplication, wherein input matrices are decomposed into several low-precision components to ensure that the resulting matrix product is computed exactly through numerical operations. This study proposes a novel GEMM-based emulation method for matrix multiplication that leverages the Chinese Remainder Theorem. The proposed method inherits the computational efficiency of highly optimized GEMM routines and further enables control over the number of matrix multiplications, which can enhance computational accuracy. We present numerical experiments featuring INT8 Tensor Core operations on GPUs and FP64 arithmetic on CPUs as case studies. The results demonstrate that FP64 emulation using the proposed method achieves performance levels of up to 7.4 to 9.8 TFLOPS on the NVIDIA RTX 4090 and 56.6 to 80.2 TFLOPS on the NVIDIA GH200, exceeding the measured performance of native FP64 arithmetic. Furthermore, for FP64 computations on CPUs, the proposed method achieved up to a 2.3x speedup in emulating quadruple-precision arithmetic compared to the conventional Ozaki scheme.

2025-04-10T02:07:20Z Katsuhisa Ozaki Yuki Uchino Toshiyuki Imamura http://arxiv.org/abs/2504.17268v1 Parameter Estimation in ODE Models with Certified Polynomial System Solving 2025-04-24T05:53:31Z

We consider dynamical models given by rational ODE systems. Parameter estimation is an important and challenging task of recovering parameter values from observed data. Recently, a method based on differential algebra and rational interpolation was proposed to express parameter estimation in terms of polynomial system solving. Typically, polynomial system solving is a bottleneck, hence the choice of the polynomial solver is crucial. In this contribution, we compare two polynomial system solvers applied to parameter estimation: homotopy continuation solver from HomotopyContinuation.jl and our new implementation of a certified solver based on rational univariate representation (RUR) and real root isolation. We show how the new RUR solver can tackle examples that are out of reach for the homotopy methods and vice versa.

2025-04-24T05:53:31Z 3 pages Alexander Demin Alexey Ovchinnikov Fabrice Rouillier http://arxiv.org/abs/2504.15814v2 Fast Higher-Order Interpolation and Restriction in ExaHyPE Avoiding Non-physical Reflections 2025-04-23T08:43:40Z

Wave equations help us to understand phenomena ranging from earthquakes to tsunamis. These phenomena materialise over very large scales. It would be computationally infeasible to track them over a regular mesh. Yet, since the phenomena are localised, adaptive mesh refinement (AMR) can be used to construct meshes with a higher resolution close to the regions of interest. ExaHyPE is a software engine created to solve wave problems using AMR, and we use it as baseline to construct our numerical relativity application called ExaGRyPE. To advance the mesh in time, we have to interpolate and restrict along resolution transitions in each and every time step. ExaHyPE's vanilla code version uses a d-linear tensor-product approach. In benchmarks of a stationary black hole this performs slowly and leads to errors in conserved quantities near AMR boundaries. We therefore introduce a set of higher-order interpolation schemes where the derivatives are calculated at each coarse grid cell to approximate the enclosed fine cells. The resulting methods run faster than the tensor-product approach. Most importantly, when running the stationary black hole simulation using the higher order methods the errors near the AMR boundaries are removed.

2025-04-22T11:52:58Z Timothy Stokes Tobias Weinzierl Han Zhang Baojiu Li http://arxiv.org/abs/2504.13821v1 Toward Portable GPU Performance: Julia Recursive Implementation of TRMM and TRSM 2025-04-18T17:51:34Z

This paper presents a performant and portable recursive implementation of triangular matrix-matrix multiplication (TRMM) and triangular solve (TRSM) in Julia for GPUs, two kernels that underlie many linear-algebra algorithms. We restructure TRMM and TRSM so that most work is executed as general matrix-matrix multiplication (GEMM), improving use of the GPU memory hierarchy and reducing latency. Exploiting Julia's multiple dispatch and metaprogramming together with the GPUArrays and KernelAbstractions frameworks, we expose a single hardware-agnostic API that runs on NVIDIA, AMD, and Apple Silicon GPUs. For large matrices the recursive code reaches throughput comparable to vendor libraries such as cuBLAS and rocBLAS, while providing these routines on Apple Silicon for the first time. The entire implementation is only a few hundred lines of code, showing that unified Julia programs can deliver near-vendor performance across heterogeneous architectures.

2025-04-18T17:51:34Z Vicki Carrica Maxwell Onyango Rabab Alomairy Evelyne Ringoot James Schloss Alan Edelman http://arxiv.org/abs/2504.11708v2 Fast Mixed-Precision Real Evaluation 2025-04-18T16:41:00Z

Evaluating real-valued expressions to high precision is a key building block in computational mathematics, physics, and numerics. A typical implementation evaluates the whole expression in a uniform precision, doubling that precision until a sufficiently-accurate result is achieved. This is wasteful: usually only a few operations really need to be performed at high precision, and the bulk of the expression could be computed much faster. However, such non-uniform precision assignments have, to date, been impractical to compute. We propose a fast new algorithm for deriving such precision assignments. The algorithm leverages results computed at lower precisions to analytically determine a mixed-precision assignment that will result in a sufficiently-accurate result. Our implementation, Reval, achieves an average speed-up of 1.72x compared to the state-of-the-art Sollya tool, with the speed-up increasing to 5.21x on the most difficult input points. An examination of the precisions used with and without precision tuning shows that the speed-up results from assigning lower precisions for the majority of operations, though additional optimizations enabled by the non-uniform precision assignments also play a role.

2025-04-16T02:12:20Z Duplicates arxiv:2410.07468 Artem Yadrov Pavel Panchekha http://arxiv.org/abs/2504.13236v1 NNTile: a machine learning framework capable of training extremely large GPT language models on a single node 2025-04-17T16:22:32Z

This study presents an NNTile framework for training large deep neural networks in heterogeneous clusters. The NNTile is based on a StarPU library, which implements task-based parallelism and schedules all provided tasks onto all available processing units (CPUs and GPUs). It means that a particular operation, necessary to train a large neural network, can be performed on any of the CPU cores or GPU devices, depending on automatic scheduling decisions. Such an approach shifts the burden of deciding where to compute and when to communicate from a human being to an automatic decision maker, whether a simple greedy heuristic or a complex AI-based software. The performance of the presented tool for training large language models is demonstrated in extensive numerical experiments.

2025-04-17T16:22:32Z Aleksandr Mikhalev Aleksandr Katrutsa Konstantin Sozykin Ivan Oseledets http://arxiv.org/abs/2504.12841v1 ALT: A Python Package for Lightweight Feature Representation in Time Series Classification 2025-04-17T10:57:29Z

We introduce ALT, an open-source Python package created for efficient and accurate time series classification (TSC). The package implements the adaptive law-based transformation (ALT) algorithm, which transforms raw time series data into a linearly separable feature space using variable-length shifted time windows. This adaptive approach enhances its predecessor, the linear law-based transformation (LLT), by effectively capturing patterns of varying temporal scales. The software is implemented for scalability, interpretability, and ease of use, achieving state-of-the-art performance with minimal computational overhead. Extensive benchmarking on real-world datasets demonstrates the utility of ALT for diverse TSC tasks in physics and related domains.

2025-04-17T10:57:29Z 16 pages, 4 figures Machine Learning: Science and Technology (2026) Balázs P. Halmos Balázs Hajós Vince Á. Molnár Marcell T. Kurbucz Antal Jakovác 10.1088/2632-2153/ae3e4f http://arxiv.org/abs/2504.11167v1 Low-Rank SPIKE Framework for Solving Large Sparse Linear Systems with Applications 2025-04-15T13:15:00Z

The SPIKE family of linear system solvers provides parallelism using a block tridiagonal partitioning. Typically SPIKE-based solvers are applied to banded systems, resulting in structured off-diagonal blocks with non-zeros elements restricted to relatively small submatrices comprising the band of the original matrix. In this work, a low-rank SVD based approximation of the off-diagonal blocks is investigated. This produces a representation which more effectively handles matrices with large, sparse bands. A set of flexible distributed solvers, the LR-SPIKE variants, are implemented. There are applicable to a wide range of applications -- from use as a "black-box" preconditioner which straightforwardly improves upon the classic Block Jacobi preconditioner, to use as a specialized "approximate direct solver." An investigation of the effectiveness of the new preconditioners for a selection of SuiteSparse matrices is performed, particularly focusing on matrices derived from 3D finite element simulations. In addition, the SPIKE approximate linear system solvers are also paired with the FEAST eigenvalue solver, where they are shown to be particularly effective due to the former's rapid convergence, and the latter's acceptance of loose linear system solver convergence, resulting in a combination which requires very few solver iterations.

2025-04-15T13:15:00Z 26 pages Braegan S. Spring Eric Polizzi Ahmed H. Sameh http://arxiv.org/abs/2504.10754v1 auto-fpt: Automating Free Probability Theory Calculations for Machine Learning Theory 2025-04-14T23:07:56Z

A large part of modern machine learning theory often involves computing the high-dimensional expected trace of a rational expression of large rectangular random matrices. To symbolically compute such quantities using free probability theory, we introduce auto-fpt, a lightweight Python and SymPy-based tool that can automatically produce a reduced system of fixed-point equations which can be solved for the quantities of interest, and effectively constitutes a theory. We overview the algorithmic ideas underlying auto-fpt and its applications to various interesting problems, such as the high-dimensional error of linearized feed-forward neural networks, recovering well-known results. We hope that auto-fpt streamlines the majority of calculations involved in high-dimensional analysis, while helping the machine learning community reproduce known and uncover new phenomena.

2025-04-14T23:07:56Z Work in progress Arjun Subramonian Elvis Dohmatob http://arxiv.org/abs/2311.10700v2 Deriving Algorithms for Triangular Tridiagonalization a Skew-Symmetric Matrix 2025-04-10T15:07:00Z

This paper provides technical details regarding the application of the FLAME methodology to derive algorithms hand in hand with their proofs of correctness for the computation of the $ L T L^T $ decomposition (with and without pivoting) of a skew-symmetric matrix. The approach yields known as well as new algorithms, presented using the FLAME notation, enabling comparing and contrasting. A number of BLAS-like primitives are exposed at the core of the resulting unblocked and blocked algorithms.

2023-11-17T18:44:40Z 28 pages. arXiv admin note: text overlap with arXiv:2411.09859 Robert van de Geijn Maggie Myers RuQing G. Xu Devin Matthews http://arxiv.org/abs/1909.10051v3 PyIT2FLS: A New Python Toolkit for Interval Type 2 Fuzzy Logic Systems 2025-04-09T07:57:06Z

Fuzzy logic is an accepted and well-developed approach for constructing verbal models. Fuzzy based methods are getting more popular, while the engineers deal with more daily life tasks. This paper presents a new Python toolkit for Interval Type 2 Fuzzy Logic Systems (IT2FLS). Developing software tools is an important issue for facilitating the practical use of theoretical results. There are limited tools for implementing IT2FLSs in Python. The developed PyIT2FLS is providing a set of tools for fast and easy modeling of fuzzy systems. This paper includes a brief description of how developed toolkit can be used. Also, three examples are given showing the usage of the developed toolkit for simulating IT2FLSs. First, a simple rule-based system is developed and it's codes are presented in the paper. The second example is the prediction of the Mackey-Glass chaotic time series using IT2FLS. In this example, the Particle Swarm Optimization (PSO) algorithm is used for determining system parameters while minimizing the mean square error. In the last example, an IT2FPID is designed and used for controlling a linear time-delay system. The code for the examples are available on toolkit's GitHub page: https://github.com/Haghrah/PyIT2FLS. The simulations and their results confirm the ability of the developed toolkit to be used in a wide range of the applications.

2019-09-22T17:34:20Z This work has been published in SoftwareX, Volume 30, May 2025, 102146. https://doi.org/10.1016/j.softx.2025.102146 SoftwareX, Volume 30, May 2025, 102146 Amir Arslan Haghrah Sehraneh Ghaemi 10.1016/j.softx.2025.102146 http://arxiv.org/abs/2504.06467v1 ZETA: a library for Zonotope-based EsTimation and fAult diagnosis of discrete-time systems 2025-04-08T22:19:11Z

This paper introduces ZETA, a new MATLAB library for Zonotope-based EsTimation and fAult diagnosis of discrete-time systems. It features user-friendly implementations of set representations based on zonotopes, namely zonotopes, constrained zonotopes, and line zonotopes, in addition to a basic implementation of interval arithmetic. This library has capabilities starting from the basic set operations with these sets, including propagations through nonlinear functions using various approximation methods. The features of ZETA allow for reachability analysis and state estimation of discrete-time linear, nonlinear, and descriptor systems, in addition to active fault diagnosis of linear systems. Efficient order reduction methods are also implemented for the respective set representations. Some examples are presented in order to illustrate the functionalities of the new library.

2025-04-08T22:19:11Z 8 pages, 6 figures. Preprint submitted to the 64th IEEE Conference on Decision and Control Brenner S. Rego Joseph K. Scott Davide M. Raimondo Marco H. Terra Guilherme V. Raffo http://arxiv.org/abs/2304.13099v5 Exponentially Convergent Numerical Method for Abstract Cauchy Problem with Fractional Derivative of Caputo Type 2025-04-07T17:29:15Z

We present an exponentially convergent numerical method to approximate the solution of the Cauchy problem for the inhomogeneous fractional differential equation with an unbounded operator coefficient and Caputo fractional derivative in time. The numerical method is based on the newly obtained solution formula that consolidates the mild solution representations of sub-parabolic, parabolic and sub-hyperbolic equations with sectorial operator coefficient $A$ and non-zero initial data. The involved integral operators are approximated using the sinc-quadrature formulas that are tailored to the spectral parameters of $A$, fractional order $α$ and the smoothness of the first initial condition, as well as to the properties of the equation's right-hand side $f(t)$. The resulting method possesses exponential convergence for positive sectorial $A$, any finite $t$, including $t = 0$ and the whole range $α\in (0,2)$. It is suitable for a practically important case, when no knowledge of $f(t)$ is available outside the considered interval $t \in [0, T]$. The algorithm of the method is capable of multi-level parallelism. We provide numerical examples that confirm the theoretical error estimates.

2023-04-25T19:10:58Z This version supersedes the official publication (https://www.mdpi.com/2227-7390/11/10/2312) at present time. Several typos were corrected and corrigendum added Mathematics 11, no. 10: 2312 (2023) Dmytro Sytnyk Barbara Wohlmuth 10.3390/math11102312 http://arxiv.org/abs/2407.03483v3 Construct accurate multi-continuum micromorphic homogenisations in multi-D space-time with computer algebra 2025-04-06T23:27:00Z

Homogenisation empowers the efficient macroscale system level prediction of physical scenarios with intricate microscale structures. Here we develop an innovative powerful, rigorous and flexible framework for asymptotic homogenisation of dynamics at the \emph{finite} scale separation of real physics, with proven results underpinned by modern dynamical systems theory. The novel systematic approach removes most of the usual assumptions, whether implicit or explicit, of other methodologies. By no longer assuming averages the methodology constructs so-called multi-continuum or micromorphic homogenisations systematically informed by the microscale physics. The developed framework and approach enables a user to straightforwardly choose and create such homogenisations with clear physical and theoretical support, and of highly controllable accuracy and fidelity.

2024-07-03T20:11:02Z 3rd version A. J. Roberts http://arxiv.org/abs/2504.02067v1 A Truncated Newton Method for Optimal Transport 2025-04-02T19:00:24Z

Developing a contemporary optimal transport (OT) solver requires navigating trade-offs among several critical requirements: GPU parallelization, scalability to high-dimensional problems, theoretical convergence guarantees, empirical performance in terms of precision versus runtime, and numerical stability in practice. With these challenges in mind, we introduce a specialized truncated Newton algorithm for entropic-regularized OT. In addition to proving that locally quadratic convergence is possible without assuming a Lipschitz Hessian, we provide strategies to maximally exploit the high rate of local convergence in practice. Our GPU-parallel algorithm exhibits exceptionally favorable runtime performance, achieving high precision orders of magnitude faster than many existing alternatives. This is evidenced by wall-clock time experiments on 24 problem sets (12 datasets $\times$ 2 cost functions). The scalability of the algorithm is showcased on an extremely large OT problem with $n \approx 10^6$, solved approximately under weak entopric regularization.

2025-04-02T19:00:24Z Accepted to ICLR 2025 Mete Kemertas Amir-massoud Farahmand Allan D. Jepson