https://arxiv.org/api/CZ7cjcRlIVrpsrClNnoUNVnNXAQ2026-06-21T08:11:29Z26646015http://arxiv.org/abs/2605.08793v1cuRegOT: A GPU-Accelerated Solver for Entropic-Regularized Optimal Transport2026-05-09T08:27:39ZOptimal transport (OT) has emerged as a fundamental tool in modern machine learning, yet its computational cost remains a significant bottleneck for large-scale applications. While harnessing the massive parallelism of modern GPU hardware is critical for efficiency, the de facto standard Sinkhorn algorithm, despite its ease of parallelization, often suffers from slow convergence in challenging problems. More recently, the sparse-plus-low-rank quasi-Newton method offers a balance between convergence rate and per-iteration complexity; however, its efficiency on GPUs is severely hindered by the serial nature of sparse matrix symbolic analysis and irregular memory access patterns. To bridge this gap, we present cuRegOT, a high-performance GPU solver tailored for entropic-regularized OT. We introduce a suite of algorithmic and architectural optimizations, including an amortized symbolic analysis strategy to mitigate CPU bottlenecks, an asynchronous Sinkhorn iterates generation mechanism, and a fused kernel for bandwidth-efficient gradient evaluation. These strategies are backed by rigorous theoretical guarantees ensuring algorithmic convergence. Extensive numerical experiments demonstrate that cuRegOT achieves significant speedups over state-of-the-art GPU-based solvers across a variety of benchmark tasks.2026-05-09T08:27:39ZYixuan Qiuhttp://arxiv.org/abs/2605.08497v1MeTime: An R package for reproducible longitudinal metabolomics data analysis2026-05-08T21:23:57ZMeTime is an opensource R package for reproducible analysis of longitudinal metabolomics data. It builds upon a central S4 container, metime_analyser, that stores multiple datasets, associated metadata and analysis outputs, enabling unified handling of complex longitudinal studies. Analyses are constructed by piping modular functions, beginning with data transformations (mod_), followed by calculations (calc_), and optional meta-analysis (meta_), so entire workflows remain transparent and easy to modify. MeTime wraps numerous existing methods within a consistent interface, including sample and metabolite distributions, correlation and distance matrices, dimensionality reduction (PCA, UMAP, tSNE), random forest imputation and feature selection via Boruta, eigenmetabolites and WGCNA based clustering, conservation index analysis, regression models (linear, mixed effects, and generalized additive), and partial correlation networks. By retaining all intermediate results and provenance within the container, MeTime facilitates iterative exploration and ensures reproducible reporting via automatically generated HTML and PDF outputs. Comprehensive user guides, case studies and reference documentation accompany the package, making MeTime a versatile platform for longitudinal omics workflows.2026-05-08T21:23:57ZFor Supplementary Information and Supplementary Data see https://hmgubox2.helmholtz-munich.de/index.php/s/MeTimeSupplementBharadwaj MarellaPatrick WeinischLara VehovecVinh TranJosef J BlessYacoub A. Njipouombe NsangouGabi KastenmuellerMatthias Arnoldhttp://arxiv.org/abs/2506.11277v3Analysis of Floating-Point Matrix Multiplication Computed via Integer Arithmetic2026-05-08T08:08:41ZOotomo, Ozaki, and Yokota [Int. J. High Perform. Comput. Appl., 38 (2024), p. 297-313] have proposed a strategy to recast a floating-point matrix multiplication in terms of integer matrix products. The factors A and B are split into integer slices, the product of these slices is computed exactly, and AB is approximated by accumulating these integer products in floating-point arithmetic. This technique is particularly well suited to mixed-precision matrix multiply-accumulate units with integer support, such as the NVIDIA tensor cores or the AMD matrix cores. The number of slices allows for performance-accuracy tradeoffs: more slices yield better accuracy but require more multiplications, which in turn reduce performance. We propose an inexpensive way to estimate the minimum number of multiplications needed to achieve a prescribed level of accuracy. Our error analysis shows that the algorithm may become inaccurate (or inefficient) if rows of A or columns of B are badly scaled. We perform a range of numerical experiments, both in simulation and on the latest NVIDIA GPUs, that confirm the analysis and illustrate strengths and weaknesses of the algorithm.2025-06-12T20:33:50ZAhmad AbdelfattahJack DongarraMassimiliano FasiMantas MikaitisFrançoise Tisseurhttp://arxiv.org/abs/2603.14926v2Acceleration of multi-component multiple-precision arithmetic with branch-free algorithms and SIMD vectorization2026-05-07T03:57:53ZMultiple-precision floating-point branch-free algorithms can significantly accelerate multi-component arithmetic implemented by combining hardware-based binary64 and binary32, particularly for triple- and quadruple-precision computations. In this study, we achieved benchmark results on x86 and ARM CPU platforms to quantify the accelerations achieved in linear computations and polynomial evaluation by integrating these algorithms.2026-03-16T07:29:54ZTomonori Kouyahttp://arxiv.org/abs/2605.05395v1Differentiable Parameter Optimization for DAEs with State-Dependent Events2026-05-06T19:27:24ZDifferential-algebraic equations (DAEs) with state-dependent events arise in systems whose continuous dynamics are constrained by algebraic equations and interrupted by mode changes, switching logic, impacts, or state reinitializations. Gradient-based parameter learning for such systems is challenging because algebraic variables are implicitly defined, event times depend on the parameters, and reset maps introduce discontinuities. This paper studies differentiable parameter optimization for semi-explicit DAEs with events. We formulate the learning problem as a constrained least-squares problem with DAE dynamics, algebraic constraints, guard equations, and reset maps. We then develop two complementary gradient-computation strategies. The first is an automatic-differentiation-through-simulation method that solves algebraic variables inside the vector field, differentiates the algebraic solve using the implicit function theorem, and handles events through segmented differentiable integration. The second is an explicit discrete-adjoint method that represents the forward simulation as an event-split residual system and computes gradients by solving for the Lagrange multipliers of smooth-segment and event residuals. The formulation clarifies that residual terms in the adjoint method are equality constraints, not heuristic penalties. We compare the two approaches in terms of gradient interpretation, event-time handling, implementation complexity, and local validity. Both methods provide gradients for the event path selected by the forward simulation and are valid under fixed event ordering and transversal guard crossings.2026-05-06T19:27:24ZIon MateiMaksym ZhenirovskyyAnthony Wonghttp://arxiv.org/abs/2605.05099v1Randompack: Cross-Platform Reproducible Random Number Generation and Distribution Sampling2026-05-06T16:35:08ZA C library for random number generation, Randompack, is presented. The library implements several modern random number generators (engines), including xoshiro256, PCG64, Philox, ranlux++, and sfc64; 14 continuous distributions including uniform, normal, exponential, gamma, beta, and multivariate normal; raw bit streams, bounded integers, permutations, and sampling without replacement. The engine and the distribution layers are separated so any engine can be used with any distribution. Benchmarks show that Randompack is faster overall than competing libraries, with speedup factors ranging from about 1 to 15 depending on engine, distribution, interface, and platform. A distinguishing feature is reproducibility: with the same seeds Randompack gives compatible results across programming languages, computers, CPU architectures, and compilers. The library includes comprehensive support for parallel simulation. It is accompanied by a comprehensive test suite, benchmarking programs, and example programs. Interfaces to Fortran, Python, Julia, and R have been implemented; their benchmark results are included, although their design and implementation are otherwise outside the scope of the article. Unlike other available C libraries with comparable scope, Randompack is permissively licensed under the MIT license, and it is open source and publicly available through GitHub and conda-forge.2026-05-06T16:35:08Z19 pagesKristján Jónassonhttp://arxiv.org/abs/2605.04629v1CombOL: a Library for Practical Enumeration and Boltzmann Sampling of Combinatorial Classes2026-05-06T08:16:38ZWe present CombOL (Combinatorial Objects Library), an open-source library for the enumeration and Boltzmann sampling of combinatorial classes. Classes can be specified by a concise string syntax, and may depend on an arbitrary number of parameters. CombOL automatically derives the associated generating functions, enabling the generation of counting sequences and the compilation of Boltzmann samplers. The library supports exact and approximate-size Boltzmann rejection sampling with automatic parameter tuning to target specific sizes. In addition to implementing established methods, CombOL contributes a novel early-rejection scheme, as well as guaranteed statistical correctness by dynamically increasing the numerical precision, eliminating bias due to floating-point rounding errors. Through the Python interface, sampled structures can be mapped to application-specific objects, enabling direct sampling of domain objects such as graphs, chemical structure representations, or other complex data types. CombOL is available from PyPI as 'combol' (pypi.org/project/combol). The source code is available at gitlab.com/casbjorn/combol.2026-05-06T08:16:38Z10 pages, 2 figures. Submitted to ICMS (International Congress on Mathematical Software) 2026Casper Asbjørn EriksenDaniel Merklehttp://arxiv.org/abs/2506.21654v2Experience converting a large mathematical software package written in C++ to C++20 modules2026-05-06T02:23:51ZMathematical software has traditionally been built in the form of "packages" that build on each other. A substantial fraction of these packages is written in C++ and, as a consequence, the interface of a package is described in the form of header files that downstream packages and applications can then #include. C++ has inherited this approach towards exporting interfaces from C, but the approach is clunky, unreliable, and slow. As a consequence, C++20 has introduced a "module" system in which packages explicitly export declarations and code that compilers then store in machine-readable form and that downstream users can "import" -- a system in line with what many other programming languages have used for decades.
Herein, I explore how one can convert large mathematical software packages written in C++ to this system, using the deal.II finite element library with its around 800,000 lines of code as an example. I describe an approach that allows providing both header-based and module-based interfaces from the same code base, discuss the challenges one encounters, and how modules actually work in practice in a variety of technical and human metrics. The results show that with a non-trivial, but also not prohibitive effort, the conversion to modules is possible, resulting in a reduction in compile time for the converted library itself; on the other hand, for downstream projects, compile times show no clear trend. I end with thoughts about long-term strategies for converting the entire ecosystem of mathematical software over the coming years or decades.2025-06-26T17:38:33ZWolfgang Bangerthhttp://arxiv.org/abs/2411.09859v3Performant Tridiagonal Factorization of Skew-Symmetric Matrices2026-05-04T19:36:18ZThe factorization of skew-symmetric matrices is a critically understudied area of dense linear algebra, particularly in comparison to that of general and symmetric matrices. While some algorithms can be adapted from the symmetric case, the cost of algorithms can be reduced by exploiting skew-symmetry. This work examines the factorization of a skew-symmetric matrix $X$ into its $LTL^T$ decomposition, where $L$ is unit lower triangular and $T$ is tridiagonal. This is also known as a triangular tridiagonalization. This operation is a means for computing the determinant of $X$ as the square of the (cheaply-computed) Pfaffian of the skew-symmetric tridiagonal matrix $T$ as well as for solving systems of equations, across fields such as quantum electronic structure and machine learning. Its application also often requires pivoting in order to improve numerical stability. We compare and contrast previously-published algorithms with those systematically derived using the FLAME methodology. Performant parallel CPU implementations are achieved by fusing operations at multiple levels in order to reduce memory traffic overhead. A key factor is the employment of new capabilities of the BLAS-like Library Instantion Software (BLIS) framework, which now supports casting level-2 and level-3 BLAS-like operations by leveraging its gemm and other kernels, hierarchical parallelism, and cache blocking. A prototype, concise C++ API facilitates the translation of correct-by-construction algorithms into correct code. Experiments verify that the resulting implementations greatly exceed the performance of previous work.2024-11-15T00:37:31ZIshna SatyarthChao YinDevin A. MatthewsMaggie MyersRobert van de GeijnRuQing G. Xuhttp://arxiv.org/abs/2605.02593v1Gradient Boosted Risk Scores2026-05-04T13:44:23ZRisk scores are an interpretable and actionable class of machine learning models with applications in medicine, insurance, and risk management. Unlike most computational methods, risk scores are designed to be computed by a human by attributing points to a data sample based on a limited set of criteria. The most common approaches for generating risk scores use linear regressions to estimate the effect of selected variables. We propose a simple and effective approach towards building compact and predictive risk scores. We provide an algorithm based on gradient boosting that is capable of modeling nonlinear effects, along with a C++ implementation with Python and R bindings. Through extensive empirical evaluation on twelve tabular datasets spanning regression, classification, and time-to-event tasks, we show that our method achieves competitive predictive performance while producing substantially more compact scores than regression-based alternatives, with 60% fewer rules for classification tasks and 16% fewer rules for time-to-event tasks on average, compared to AutoScore.2026-05-04T13:44:23ZCosta GeorgantasJonas Richiardihttp://arxiv.org/abs/2605.02554v1Interprocess Communication of Algebraic Data2026-05-04T13:02:06ZWe discuss implementation details of OSCAR's serialization framework, highlighting the design decisions that allow the fine tuning of serialization methods for specific use cases.
In particular, we show how the mrdi file format can be used for distributed computing.2026-05-04T13:02:06ZAntony Della Vecchiahttp://arxiv.org/abs/2605.02252v1Exact Higher-Order Derivatives for SE(3) via Analytical/AD Methods2026-05-04T05:55:59ZFast prototyping of new SE(3) estimation objectives remains awkward in practice. Modern Lie-group frameworks -- GTSAM, manif, Sophus, SymForce, Ceres -- target first-order workloads through different code-generation and automatic-differentiation strategies, each optimized for a particular seam between hand-derived geometry and generic differentiation. The remaining gap is a compact, AD-safe path from these first-order primitives to exact Hessians, observed-information matrices, and higher-order derivative tensors: the quantities needed for exact Newton steps, observed-information covariance estimates, and covariance correction.
This paper presents a hybrid analytical/AD recipe for SE(3) negative log-likelihoods. The practitioner writes the NLL gradient once, generic over a scalar type, and places the analytical/AD seam at the point-action interface y = Tx. Closed-form Lie-group Jacobians are used up to this interface; AD is applied only beyond it. The same source is then instantiated with ordinary floating-point scalars for gradients, vector-seeded dual numbers for exact Hessians in a single forward-mode pass, and nested dual numbers for higher-order derivative tensors. On a representative 6-DoF, 5-landmark SE(3) NLL, the advocated seeded-Hessian path is approximately 5x faster than finite-differencing the AD gradient on this benchmark while matching a nested-AD oracle to machine precision. The implementation adds roughly 70 lines of analytical-Jacobian code over an AD-only baseline. We also identify and fix a removable singularity in the standard SO(3)/SE(3) scalar basis that would otherwise produce NaNs at the origin under seeded AD, and we audit which Lie-group derivative tensors require this stabilized basis. The result is a practical path from rapidly written SE(3) objectives to exact higher-order derivatives, with predictable runtime and no finite-difference tuning.2026-05-04T05:55:59Z7 pages, 1 table. Companion code available at https://github.com/sigmapointlabs/se3_ad_recipesFrank O. Kuehnelhttp://arxiv.org/abs/2605.02966v1QBalance: A Reproducible Multi-Objective Workflow for Quantum Compilation, Noise Suppression, and Error-Mitigation Strategy Selection2026-05-03T09:28:48ZNear-term quantum workloads are shaped by coupled compilation and execution choices: qubit layout, routing, basis translation, gate suppression, measurement mitigation, shot budget, and artifact reproducibility. This paper analyzes QBalance, a Python workflow library for dataset-level selection among quantum compilation, noise-suppression, and error-mitigation strategies built on the Qiskit ecosystem. The contribution is formulated as a finite multi-objective strategy-selection problem over circuits, backends, and transformation policies. The manuscript derives the implemented weighted objective, non-dominated selection rule, survival-product error proxy, Bayesian linear candidate-ordering surrogate, and distributional diagnostics. It also positions the system relative to established work on Qiskit pass-manager compilation, SABRE-style routing, randomized compiling, dynamical decoupling, zero-noise extrapolation, matrix-free measurement mitigation, circuit cutting, and Thompson sampling. The analysis shows that QBalance provides a reproducible orchestration and artifact model for quantum workflow studies. It also establishes precise limitations: the current bandit mechanism orders candidates but does not reduce the number of candidate evaluations, the custom layout heuristic is greedy and only partially topology-aware, the implemented ZNE helper is parity-centered, and the cutting integration is a hook rather than a full reconstruction pipeline.2026-05-03T09:28:48ZSoumyadip Sarkarhttp://arxiv.org/abs/2506.12718v3Permutation-Avoiding FFT-Based Convolution2026-04-29T14:09:22ZFast Fourier Transform (FFT) libraries are widely used for evaluating discrete convolutions. Most FFT implementations follow some variant of the Cooley-Tukey framework, in which the transform is decomposed into butterfly operations and index-reversal permutations. While butterfly operations dominate the floating-point operation count, the memory access patterns induced by index-reversal permutations significantly degrade the FFT's arithmetic intensity. When performing discrete convolution, the three sets of index-reversal permutations which occur in FFT-based implementations using Cooley-Tukey frameworks cancel out, thus paving the way to implementations free of any permutation. To the best of our knowledge, such permutation-free variants of FFT-based discrete convolution are not commonly used in practice, making such kernels worth investigating. Here, we look into such permutation-avoiding convolution procedures for multi-dimensional cases within a general radix Cooley-Tukey framework. We perform numerical experiments to benchmark the algorithms presented against state-of-the-art FFT-based convolution implementations. Our results suggest that developers of FFT libraries should consider supporting permutation-avoiding convolution kernels.2025-06-15T04:50:02Z43 pages, 22 tables, 2 figures, 22 algorithmsNicolas VenkovicHartwig Anzthttp://arxiv.org/abs/2505.18441v2DB-KSVD: Scalable Alternating Optimization for Disentangling High-Dimensional Embedding Spaces2026-04-29T10:08:45ZDictionary learning has recently emerged as a promising approach for mechanistic interpretability of large transformer models. Disentangling high-dimensional transformer embeddings requires algorithms that scale to high-dimensional data with large sample sizes. Recent work has explored sparse autoencoders (SAEs) for this problem. However, SAEs use a simple linear encoder to solve the sparse encoding subproblem, which is known to be NP-hard. It is therefore interesting to understand whether this approach is sufficient to find good solutions to the dictionary learning problem or if a more sophisticated algorithm could find better solutions. In this work, we propose Double-Batch KSVD (DB-KSVD), a scalable dictionary learning algorithm that adapts the classic KSVD algorithm. DB-KSVD is informed by the rich theoretical foundations of KSVD but scales to datasets with millions of samples and thousands of dimensions. We demonstrate the efficacy of DB-KSVD by disentangling text embeddings of the Gemma-2-2B and Pythia-160M models and evaluating on six metrics from the SAEBench benchmark, where we achieve competitive results when compared to established approaches based on SAEs. We further show similar results when disentangling image embeddings obtained from the DINOv2-S and DINOv2-B models, solidifying our findings. By matching SAE performance with an entirely different optimization approach, our results suggest that (i) SAEs do find strong solutions to the dictionary learning problem and (ii) traditional optimization approaches can be scaled to the required problem sizes, offering a promising avenue for further research. We make an implementation of DB-KSVD available at https://github.com/romeov/ksvd.jl.2025-05-24T00:32:50Z8 pages + 10 pages appendix. Updated with additional vision transformer experimentsRomeo ValentinSydney M. KatzVincent VanhouckeMykel J. Kochenderfer