https://arxiv.org/api/vn4j+ag/tixz5tXnuTlH0SNWNQk2026-06-21T15:14:44Z266415015http://arxiv.org/abs/2601.23237v1Applications of QR-based Vector-Valued Rational Approximation2026-01-30T18:06:46ZSeveral applications of the QR-AAA algorithm, a greedy scheme for vector-valued rational approximation, are presented. The focus is on demonstrating the flexibility and practical effectiveness of QR-AAA in a variety of computational settings, including Stokes flow computation, multivariate rational approximation, function extension, the development of novel quadrature methods and near-field approximation in the boundary element method.2026-01-30T18:06:46ZSimon Dirckxhttp://arxiv.org/abs/2208.14314v4Cardinal Optimizer (COPT) User Guide2026-01-30T03:43:15ZCardinal Optimizer is a high-performance mathematical programming solver for efficiently solving largescale optimization problem. This documentation provides basic introduction to the Cardinal Optimizer.2022-08-30T14:52:44ZDongdong GeQi HuangfuZizhuo WangJian WuYinyu Yehttp://arxiv.org/abs/2601.22200v1Adaptive Benign Overfitting (ABO): Overparameterized RLS for Online Learning in Non-stationary Time-series2026-01-29T15:58:01ZOverparameterized models have recently challenged conventional learning theory by exhibiting improved generalization beyond the interpolation limit, a phenomenon known as benign overfitting. This work introduces Adaptive Benign Overfitting (ABO), extending the recursive least-squares (RLS) framework to this regime through a numerically stable formulation based on orthogonal-triangular updates. A QR-based exponentially weighted RLS (QR-EWRLS) algorithm is introduced, combining random Fourier feature mappings with forgetting-factor regularization to enable online adaptation under non-stationary conditions. The orthogonal decomposition prevents the numerical divergence associated with covariance-form RLS while retaining adaptability to evolving data distributions. Experiments on nonlinear synthetic time series confirm that the proposed approach maintains bounded residuals and stable condition numbers while reproducing the double-descent behavior characteristic of overparameterized models. Applications to forecasting foreign exchange and electricity demand show that ABO is highly accurate (comparable to baseline kernel methods) while achieving speed improvements of between 20 and 40 percent. The results provide a unified view linking adaptive filtering, kernel approximation, and benign overfitting within a stable online learning framework.2026-01-29T15:58:01Z32 pages, 3 figures, 10 tablesLuis Ontaneda MijaresNick Firoozyehttp://arxiv.org/abs/2405.03456v2Floating Point Compression of Hierarchical Matrix Formats and its Impact on Matrix-Vector Multiplication2026-01-28T20:08:27ZMatrix-vector multiplication forms the basis of many iterative solution algorithms and as such is an important algorithm also for hierarchical matrices which are used to represent dense data in an optimized form by applying low-rank compression. However, due to its low computational intensity, the performance of matrix-vector multiplication is typically limited by the available memory bandwidth on parallel systems. With floating point compression the memory footprint can be optimized, which reduces the stress on the memory sub system and thereby increases performance. We will look into the compression of different formats of hierachical matrices and how this can be used to speed up the corresponding matrix-vector multiplication.2024-05-06T13:29:14ZRonald Kriemannhttp://arxiv.org/abs/2602.00125v1MiniTensor: A Lightweight, High-Performance Tensor Operations Library2026-01-27T21:21:59ZWe present MiniTensor, an open source tensor operations library that focuses on minimalism, correctness, and performance. MiniTensor exposes a familiar PyTorch-like Python API while it executes performance critical code in a Rust engine. The core supports dense $n$ dimensional tensors, broadcasting, reductions, matrix multiplication, reverse mode automatic differentiation, a compact set of neural network layers, and standard optimizers. In this paper, we describe the design of MiniTensor's architecture, including its efficient memory management, dynamic computation graph for gradients, and integration with Python via PyO3. We also compare the install footprint with PyTorch and TensorFlow to demonstrate that MiniTensor achieves a package size of only a few megabytes, several orders of magnitude smaller than mainstream frameworks, while preserving the essentials needed for research and development on CPUs. The repository can be found at https://github.com/neuralsorcerer/minitensor2026-01-27T21:21:59ZSoumyadip Sarkarhttp://arxiv.org/abs/2601.17708v1High-Order Mesh r-Adaptivity with Tangential Relaxation and Guaranteed Mesh Validity2026-01-25T05:42:25ZHigh-order meshes are crucial for achieving optimal convergence rates in curvilinear domains, preserving symmetry, and aligning with key flow features in moving mesh simulations, but their quality is challenging to control. In prior work, we have developed techniques based on Target-Matrix Optimization Paradigm (TMOP) to adapt a given high-order mesh to the geometry and solution of the partial differential equation (PDE). Here, we extend this framework to address two key gaps in the literature for high-order mesh r-adaptivity. First, we introduce tangential relaxation on curved surfaces using solely the discrete mesh representation, eliminating the need for access to underlying geometry (e.g., CAD model). Second, we ensure a continuously positive Jacobian determinant throughout the domain. This determinant positivity is essential for using the high-order mesh resulting from r-adaptivity with arbitrary quadrature schemes in simulations. The proposed approach is demonstrated to be robust using a variety of numerical experiments.2026-01-25T05:42:25Z13 pages, 10 figuresKetan MittalVeselin DobrevTzanio KolevVladimir Tomovhttp://arxiv.org/abs/2602.00075v1Dimensional Peeking for Low-Variance Gradients in Zeroth-Order Discrete Optimization via Simulation2026-01-21T04:23:06ZGradient-based optimization methods are commonly used to identify local optima in high-dimensional spaces. When derivatives cannot be evaluated directly, stochastic estimators can provide approximate gradients. However, these estimators' perturbation-based sampling of the objective function introduces variance that can lead to slow convergence. In this paper, we present dimensional peeking, a variance reduction method for gradient estimation in discrete optimization via simulation. By lifting the sampling granularity from scalar values to classes of values that follow the same control flow path, we increase the information gathered per simulation evaluation. Our derivation from an established smoothed gradient estimator shows that the method does not introduce any bias. We present an implementation via a custom numerical data type to transparently carry out dimensional peeking over C++ programs. Variance reductions by factors of up to 7.9 are observed for three simulation-based optimization problems with high-dimensional input. The optimization progress compared to three meta-heuristics shows that dimensional peeking increases the competitiveness of zeroth-order optimization for discrete and non-convex simulations.2026-01-21T04:23:06ZAccepted at ACM SIGSIM PADS 2026Philipp AndelfingerWentong Caihttp://arxiv.org/abs/2601.12220v1Canonicalization of Batched Einstein Summations for Tuning Retrieval2026-01-18T01:50:28ZWe present an algorithm for normalizing \emph{Batched Einstein Summation}
expressions by mapping mathematically equivalent formulations to a unique
normal form. Batches of einsums with the same Einstein notation that exhibit
substantial data reuse appear frequently in finite element methods (FEM),
numerical linear algebra, and computational chemistry. To effectively exploit
this temporal locality for high performance, we consider groups of einsums in
batched form.
Representations of equivalent batched einsums may differ due to index
renaming, permutations within the batch, and, due to the commutativity and
associativity of multiplication operation. The lack of a canonical
representation hinders the reuse of optimization and tuning knowledge in
software systems. To this end, we develop a novel encoding of batched einsums
as colored graphs and apply graph canonicalization to derive a normal form.
In addition to the canonicalization algorithm, we propose a representation of
einsums using functional array operands and provide a strategy to transfer
transformations operating on the normal form to \emph{functional batched
einsums} that exhibit the same normal form; crucial for fusing surrounding
computations for memory bound einsums. We evaluate our approach against JAX,
and observe a geomean speedup of $4.7\times$ for einsums from the TCCG
benchmark suite and an FEM solver.2026-01-18T01:50:28ZKaushik KulkarniAndreas Klöcknerhttp://arxiv.org/abs/2601.17028v1PALMA: A Lightweight Tropical Algebra Library for ARM-Based Embedded Systems2026-01-17T23:42:52ZTropical algebra, including max-plus, min-plus, and related idempotent semirings, provides a unifying framework in which many optimization problems that are nonlinear in classical algebra become linear. This property makes tropical methods particularly well suited for shortest paths, scheduling, throughput analysis, and discrete event systems. Despite their theoretical maturity and practical relevance, existing tropical algebra implementations primarily target desktop or server environments and remain largely inaccessible on resource-constrained embedded platforms, where such optimization problems are most acute. We present PALMA (Parallel Algebra Library for Max-plus Applications), a lightweight, dependency-free C library that brings tropical linear algebra to ARM-based embedded systems. PALMA implements a generic semiring abstraction with SIMD-accelerated kernels, enabling a single computational framework to support shortest paths, bottleneck paths, reachability, scheduling, and throughput analysis. The library supports five tropical semirings, dense and sparse (CSR) representations, tropical closure, and spectral analysis via maximum cycle mean computation. We evaluate PALMA on a Raspberry Pi 4 and demonstrate peak performance of 2,274 MOPS, speedups of up to 11.9 times over classical Bellman-Ford for single-source shortest paths, and sub-10 microsecond scheduling solves for real-time control workloads. Case studies in UAV control, IoT routing, and manufacturing systems show that tropical algebra enables efficient, predictable, and unified optimization directly on embedded hardware. PALMA is released as open-source software under the MIT license.2026-01-17T23:42:52ZOpen-source software available at https://github.com/ReFractals/palmaGnankan Landry Regis N'guessanhttp://arxiv.org/abs/2601.10828v1High-Order Lie Derivatives from Taylor Series in the ADTAYL Package2026-01-15T20:01:32ZHigh-order Lie derivatives are essential in nonlinear systems analysis. If done symbolically, their evaluation becomes increasingly expensive as the order increases. We present a compact and efficient numerical approach for computing Lie derivatives of scalar, vector, and covector fields using the MATLAB ADTAYL package. The method exploits a fact noted by Röbenack: that these derivatives coincide, up to factorial scaling, with the Taylor coefficients of expressions built from a Taylor expansion about a trajectory point and, when required, the associated variational matrix. Computational results for a gantry crane model demonstrate orders of magnitude speedups over symbolic evaluation using the MATLAB Symbolic Math Toolbox.2026-01-15T20:01:32Z16 pagesNedialko S. NedialkovJohn D. Prycehttp://arxiv.org/abs/2601.07827v1Tensor Algebra Processing Primitives (TAPP): Towards a Standard for Tensor Operations2026-01-12T18:58:31ZTo address the absence of a universal standard interface for tensor operations, we introduce the Tensor Algebra Processing Primitives (TAPP), a C-based interface designed to decouple the application layer from hardware-specific implementations. We provide a mathematical formulation of tensor contractions and a reference implementation to ensure correctness and facilitate the validation of optimized kernels. Developed through community consensus involving academic and industrial stakeholders, TAPP aims to enable performance portability and resolving dependency challenges. The viability of the standard is demonstrated through successful integrations with the TBLIS and cuTENSOR libraries, as well as the DIRAC quantum chemistry package.2026-01-12T18:58:31Z45 pages, 5 figuresJan BrandejsNiklas HörnbladEdward F. ValeevAlexander HeineckeJeff HammondDevin MatthewsPaolo Bientinesihttp://arxiv.org/abs/2410.21231v2$\texttt{skwdro}$: a library for Wasserstein distributionally robust machine learning2026-01-09T15:45:40ZWe present skwdro, a Python library for training robust machine learning models. The library is based on distributionally robust optimization using Wasserstein distances, popular in optimal transport and machine learnings. The goal of the library is to make the training of robust models easier for a wide audience by proposing a wrapper for PyTorch modules, enabling model loss' robustification with minimal code changes. It comes along with scikit-learn compatible estimators for some popular objectives. The core of the implementation relies on an entropic smoothing of the original robust objective, in order to ensure maximal model flexibility. The library is available at https://github.com/iutzeler/skwdro and the documentation at https://skwdro.readthedocs.io.2024-10-28T17:16:00Z7 pages 2 figuresFlorian VincentWaïss AzizianFranck IutzelerJérôme Malickhttp://arxiv.org/abs/2601.04901v1Rigorous numerical computation of the Stokes multipliers for linear differential equations with single level one2026-01-08T12:56:55ZWe describe a practical algorithm for computing the Stokes multipliers of a linear differential equation with polynomial coefficients at an irregular singular point of single level one. The algorithm follows a classical approach based on Borel summation and numerical ODE solving, but avoids a large amount of redundant work compared to a direct implementation. It applies to differential equations of arbitrary order, with no genericity assumption, and is suited to high-precision computations. In addition, we present an open-source implementation of this algorithm in the SageMath computer algebra system and illustrate its use with several examples. Our implementation supports arbitrary-precision computations and automatically provides rigorous error bounds. The article assumes minimal prior knowledge of the asymptotic theory of meromorphic differential equations and provides an elementary introduction to the linear Stokes phenomenon that may be of independent interest.2026-01-08T12:56:55ZMichèle Loday-RichaudLMOMarc MezzarobbaLIXPascal RemyLMVhttp://arxiv.org/abs/2601.02506v1Star Formation in Galaxy Collisions: Dependence on Impact Velocity and Gas Mass of Galaxies in GADGET-4 Simulations2026-01-05T19:22:40ZThis work investigates variations in the star formation rate during galaxy collisions when the initial conditions of velocity and gas mass are altered. For this purpose, hydrodynamic simulations were performed using the GADGET-4 code, with initial conditions generated by the Galstep and SnapshotJoiner programs. Systems of two galaxies on a head-on collision course were modeled with relative initial velocities ranging from 100 km/s to 1000 km/s, considering two scenarios: the first with identical galaxies, and the second with galaxies of different sizes. In simulations of systems with higher initial relative velocities, both found more intense peaks in the star formation rate, triggered by the first contact of the collision, followed by a strong decline caused by gas dispersion. In contrast, for systems with lower initial velocities, mergers between galaxies were observed, leading to multiple peaks in the star formation rate. A greater initial distance between galaxies has also been linked to whether or not the galaxy system merges, since it implies longer timescales for gravitational action, which leads to higher relative velocities at the moment of collision. Furthermore, the star formation rate in galaxies was found to have a clear dependence on initial gas content. Furthermore, the initial gas content in galaxies was found to have a clear dependence on star formation rates. Overall, our results show that the relative impact velocity, the initial distance between the galaxies, and the gas content are important parameters for analyzing the star formation rate in colliding galaxies.2026-01-05T19:22:40Z17 pages and 17 figures. Keywords: star formation; galaxy mergers; hydrodynamic simulations; GADGET-4Gustavo Neves PereiraPaulo Laerte Nattihttp://arxiv.org/abs/2504.02117v2Vectorized Parallel in Time methods for low-order discretizations with application to Porous Media problems2026-01-05T17:26:45ZHigh order methods have shown great potential to overcome performance issues of simulations of partial differential equations (PDEs) on modern hardware, still many users stick to low-order, matrix-based simulations, in particular in porous media applications. Heterogeneous coefficients and low regularity of the solution are reasons not to employ high order discretizations. We present a new approach for the simulation of instationary PDEs that allows to partially mitigate the performance problems. By reformulating the original problem we derive a parallel in time time integrator that increases the arithmetic intensity and introduces additional structure into the problem. By this it helps accelerate matrix-based simulations on modern hardware architectures. Based on a system for multiple time steps we will formulate a matrix equation that can be solved using vectorized solvers like Block Krylov methods. The structure of this approach makes it applicable for a wide range of linear and nonlinear problems. In our numerical experiments we present some first results for three different PDEs, a linear convection-diffusion equation, a nonlinear diffusion-reaction equation and a realistic example based on the Richards' equation.2025-04-02T20:26:22ZChristian EngwerAlexander SchellNils-Arne Dreier