https://arxiv.org/api/LoE62j1Ly8S3amLS3riIWK4jwgQ2026-06-17T08:25:24Z26584515http://arxiv.org/abs/2605.16058v1High-Performance Star-M SVD for Big Data Compression2026-05-15T15:24:03ZIn the era of big data, effectively compressing large datasets while performing complex mathematical operations is crucial. Tensor-based decomposition methods have shown superior compression capabilities with minimal loss of accuracy compared to traditional matrix methods. Under the star-M tensor framework, tensors can be decomposed in a matrix-mimetic way, including using the star-M SVD. This tensor SVD has optimality guarantees and has shown exceptional performance on specific types of data, but software implementations have been mostly limited to productivity-oriented languages. In this work, we present our development of a shared-memory parallel, high-performance solution designed to efficiently implement the underlying algorithms. This software will enable optimal compression of extensive scientific datasets, paving the way for enhanced data analysis and insights.2026-05-15T15:24:03ZMd Taufique HussainGrey BallardAditya DevarakondaSrinivas EswarNaman PesrichaVishwas Raohttp://arxiv.org/abs/2605.15547v1Correctly Rounded Functions For Vector Applications: A Performance Study2026-05-15T02:39:39ZFollowing recent interest in correctly rounded math library functions (as currently recommended by the IEEE 754 standard), we have designed several SIMD algorithms for one-input single precision functions and integrated them into our CPU math library; these will form the core of the first correctly rounded vector math library, to be available to users in mid-2026. To take advantage of the cross-platform bitwise reproducibility afforded by correct rounding, we adapted and evaluated a few SIMD implementations on graphics processing units (GPU). In addition, we designed and evaluated proof-of-concept SIMD implementations of two correctly rounded double precision functions.2026-05-15T02:39:39Z5 pages, 2 figures, 4 tablesCristina AndersonMarius CorneaAndrey StepinMihai Tudor Panuhttp://arxiv.org/abs/2605.13736v1Porting the Nonlinear Optimization Library HiOp to Accelerator-Based Hardware Architectures2026-05-13T16:13:32ZWhile interior point methods have been the centerpiece of nonlinear programming tools used in science and engineering, their reliance on linear solvers that can tackle sparse symmetric indefinite and highly ill-conditioned problems made it difficult to implement them effectively on hardware accelerators. At this time, there are few sparse linear solvers that can be used in this context. Here, we present a novel formulation of an interior point method implemented in our HiOp library, which is designed to be able to run entirely on hardware accelerators. This formulation avoids dependence on sparse solvers altogether, which is achieved by compressing the underlying sparse linear problem into a dense one of manageable size. We demonstrate feasibility of this approach and provide a baseline for future interior point method implementations on hardware accelerators. Our investigation is motivated by problems arising in optimal power flow analysis in power systems engineering and our approach is tailored to the broad class of problems arising in that important domain. We also demonstrate utility of modern programming models based on performance portability libraries, namely, Umpire and RAJA. We discuss trade-offs between performance, portability and development cost in the solution space for this non-linear optimization problem. As a result of this research, we demonstrate for the first time that interior point methods for sparse problems can be efficiently realized on modern computing systems where more than 90% of processing power is in GPUs.2026-05-13T16:13:32ZSlaven PelesKalyan S. PerumallaMaksudul AlamAsher J. MancinelliR. Cameron RutherfordJake RyanCosmin G. Petrahttp://arxiv.org/abs/2605.13607v1Ergodicity Library: A Python Toolkit for Stochastic-Process Simulation, Time-Average Diagnostics, and Agent-Based Experiments2026-05-13T14:41:28Zergodicity is an open-source Python library for computational work on stochastic dynamics, with particular emphasis on non-ergodicity, time-average behavior, heavy-tailed processes, and decision making under uncertainty. The package brings together three layers that are often split across ad hoc scripts: process definitions and simulators, analysis and fitting tools, and agent-based experimentation. This article documents the implemented software rather than presenting new stochastic theory. We describe the package architecture, the supported process families, the analysis workflow, and the practical boundaries of the current implementation. We also provide fully reproducible examples covering heavy-tailed ensemble spread, multiplicative Levy growth diagnostics, adaptive memory mean reversion, preasymptotic fluctuation analysis, and partial stochastic differential equation simulation. The package is positioned as an integration layer on top of the scientific Python stack, reducing the amount of glue code required to move from process specification to diagnostics and comparative experiments.2026-05-13T14:41:28ZIhor Kendiukhovhttp://arxiv.org/abs/2605.12443v1Basilisk and Docker for Reproducible GN&C Simulation: A Workflow Reference2026-05-12T17:37:37ZBasilisk is an open-source astrodynamics simulation framework widely used for spacecraft guidance, navigation, and control (GN&C) research and development. Despite its flexibility and computational capabilities, configuring Basilisk consistently across heterogeneous development environments presents practical challenges due to dependency management, operating system compatibility, and software configuration requirements. This paper presents a Docker-based containerization workflow for Basilisk that encapsulates the complete build environment, dependencies, and simulation infrastructure within a portable container image. The workflow is demonstrated through a progression of simulation scenarios of increasing complexity, from standalone orbital dynamics scripts to BSKSim-based attitude dynamics and control simulations with Monte Carlo analysis. The BSKSim class hierarchy, dynamics model architecture, flight software implementation, and scenario execution patterns are described in detail. The presented workflow provides a self-contained implementation reference for GN&C engineers and researchers seeking reproducible and portable Basilisk simulation environments. This work expands upon a workshop presentation delivered at the 46th Rocky Mountain AAS GN&C Conference, February 2024, available at https://doi.org/10.5281/zenodo.15008785.2026-05-12T17:37:37Z21 pages, 8 figuresAnubhav Guptahttp://arxiv.org/abs/2605.12583v1QuPort: Topology-, Port-, and Congestion-Aware Compilation for Modular Multi-QPU Quantum Systems2026-05-12T17:12:30ZModular quantum processors require a compiler to reason about two resources at the same time: local device connectivity and communication across QPUs. A mapping that is acceptable on a single coupling graph may be unsuitable for a modular machine if it creates excessive cross-QPU traffic, concentrates that traffic on a small number of interconnect links, or assigns many boundary qubits to a QPU with few communication ports. This paper presents QuPort, a Python and Qiskit-based compilation framework that studies this setting through an explicit three-level model: a weighted logical interaction graph, a directed physical coupling map, and an undirected QPU-level interconnect graph. The main partitioning method, TPCCAP, optimizes the implemented objective formed by weighted cut distance, communication-port overflow, and routed link-load congestion. The framework also includes heavy-edge clustering, balanced greedy partitioning, simulated-annealing refinement, communication-port-aware layout, extraction of remote two-qubit operations, local-only routing of per-QPU circuits, and topology-aware schedule estimation. The model is a compiler-level abstraction. It does not claim a calibrated hardware runtime or an implementation of a physical remote-gate protocol.2026-05-12T17:12:30ZSoumyadip SarkarSubhasree Bhattacharjeehttp://arxiv.org/abs/2605.10678v1A Performance-Portable, Massively Parallel Distributed Nonuniform FFT2026-05-11T14:56:41ZThe nonuniform fast Fourier transform (NUFFT) enables spectral methods for problems with irregularly spaced samples, with applications in medical imaging, molecular dynamics, and kinetic plasma simulations. Existing implementations are limited to shared-memory execution, restricting problem sizes to what fits on a single node. We present the first distributed, performance-portable NUFFT for heterogeneous supercomputers. Our Kokkos-based implementation runs without modification on NVIDIA and AMD GPUs. We develop multiple spreading and interpolation kernels optimized for different accuracy requirements and architectures. Our spreading kernels match or exceed the single-GPU throughput of the state-of-the-art CUDA-based NUFFT library cuFINUFFT at production particle densities, while our Kokkos-based implementation additionally supports AMD GPUs. Strong scaling experiments on Alps (NVIDIA GH200), JUWELS Booster (NVIDIA A100), and LUMI (AMD MI250X) demonstrate scaling up to 1024 GPUs. At scale, the distributed FFT is a significant part of the total runtime, making higher NUFFT accuracy less expensive. We apply the method to massively parallel Particle-in-Fourier simulations of Landau damping with up to $1024^3$ Fourier modes and 8.6 billion particles on Alps, JUWELS, and LUMI, demonstrating that distributed NUFFTs enable kinetic plasma simulations at resolutions previously inaccessible to spectral particle methods.2026-05-11T14:56:41ZAccepted in The Platform for Advanced Scientific Computing (PASC26) conference proceedingsPaul FischillAndreas AdelmannSriramkrishnan Muralikrishnanhttp://arxiv.org/abs/2605.10573v1A Riemannian quasi-Newton algorithm for optimization with Euclidean bounds2026-05-11T13:44:01ZWe propose a Riemannian limited-memory BFGS method for optimization problems with Euclidean bounds. The method combines a limited-memory quasi-Newton update in the tangent space with a Riemannian adaptation of the generalized Cauchy point strategy from classical L-BFGS-B, enabling efficient handling of Euclidean bounds while exploiting the geometric structure of the optimization domain. This setting is important in several applications, including covariance matrix estimation with bounded variance, neuroimaging, EEG signal classification, and other signal processing or computer-vision tasks that couple manifold variables with constrained Euclidean parameters.
We provide a generic algorithmic framework and an implementation of the algorithm in the Manopt.jl library. Numerical experiments on benchmark problems indicate only minor reduction in performance on Euclidean problems compared to the classical L-BFGS-B method, while outperforming interior-point methods. Furthermore, the algorithm was tested on two mixed manifold and bounded Euclidean problems: amplitude-limited blind source separation with Gaussianity penalization and bounded-variance maximum likelihood common principal components analysis. The proposed method outperforms existing methods by several orders of magnitude.2026-05-11T13:44:01ZMateusz BaranRonny BergmannPatryk Przybyszhttp://arxiv.org/abs/2605.08793v1cuRegOT: A GPU-Accelerated Solver for Entropic-Regularized Optimal Transport2026-05-09T08:27:39ZOptimal transport (OT) has emerged as a fundamental tool in modern machine learning, yet its computational cost remains a significant bottleneck for large-scale applications. While harnessing the massive parallelism of modern GPU hardware is critical for efficiency, the de facto standard Sinkhorn algorithm, despite its ease of parallelization, often suffers from slow convergence in challenging problems. More recently, the sparse-plus-low-rank quasi-Newton method offers a balance between convergence rate and per-iteration complexity; however, its efficiency on GPUs is severely hindered by the serial nature of sparse matrix symbolic analysis and irregular memory access patterns. To bridge this gap, we present cuRegOT, a high-performance GPU solver tailored for entropic-regularized OT. We introduce a suite of algorithmic and architectural optimizations, including an amortized symbolic analysis strategy to mitigate CPU bottlenecks, an asynchronous Sinkhorn iterates generation mechanism, and a fused kernel for bandwidth-efficient gradient evaluation. These strategies are backed by rigorous theoretical guarantees ensuring algorithmic convergence. Extensive numerical experiments demonstrate that cuRegOT achieves significant speedups over state-of-the-art GPU-based solvers across a variety of benchmark tasks.2026-05-09T08:27:39ZYixuan Qiuhttp://arxiv.org/abs/2605.08497v1MeTime: An R package for reproducible longitudinal metabolomics data analysis2026-05-08T21:23:57ZMeTime is an opensource R package for reproducible analysis of longitudinal metabolomics data. It builds upon a central S4 container, metime_analyser, that stores multiple datasets, associated metadata and analysis outputs, enabling unified handling of complex longitudinal studies. Analyses are constructed by piping modular functions, beginning with data transformations (mod_), followed by calculations (calc_), and optional meta-analysis (meta_), so entire workflows remain transparent and easy to modify. MeTime wraps numerous existing methods within a consistent interface, including sample and metabolite distributions, correlation and distance matrices, dimensionality reduction (PCA, UMAP, tSNE), random forest imputation and feature selection via Boruta, eigenmetabolites and WGCNA based clustering, conservation index analysis, regression models (linear, mixed effects, and generalized additive), and partial correlation networks. By retaining all intermediate results and provenance within the container, MeTime facilitates iterative exploration and ensures reproducible reporting via automatically generated HTML and PDF outputs. Comprehensive user guides, case studies and reference documentation accompany the package, making MeTime a versatile platform for longitudinal omics workflows.2026-05-08T21:23:57ZFor Supplementary Information and Supplementary Data see https://hmgubox2.helmholtz-munich.de/index.php/s/MeTimeSupplementBharadwaj MarellaPatrick WeinischLara VehovecVinh TranJosef J BlessYacoub A. Njipouombe NsangouGabi KastenmuellerMatthias Arnoldhttp://arxiv.org/abs/2506.11277v3Analysis of Floating-Point Matrix Multiplication Computed via Integer Arithmetic2026-05-08T08:08:41ZOotomo, Ozaki, and Yokota [Int. J. High Perform. Comput. Appl., 38 (2024), p. 297-313] have proposed a strategy to recast a floating-point matrix multiplication in terms of integer matrix products. The factors A and B are split into integer slices, the product of these slices is computed exactly, and AB is approximated by accumulating these integer products in floating-point arithmetic. This technique is particularly well suited to mixed-precision matrix multiply-accumulate units with integer support, such as the NVIDIA tensor cores or the AMD matrix cores. The number of slices allows for performance-accuracy tradeoffs: more slices yield better accuracy but require more multiplications, which in turn reduce performance. We propose an inexpensive way to estimate the minimum number of multiplications needed to achieve a prescribed level of accuracy. Our error analysis shows that the algorithm may become inaccurate (or inefficient) if rows of A or columns of B are badly scaled. We perform a range of numerical experiments, both in simulation and on the latest NVIDIA GPUs, that confirm the analysis and illustrate strengths and weaknesses of the algorithm.2025-06-12T20:33:50ZAhmad AbdelfattahJack DongarraMassimiliano FasiMantas MikaitisFrançoise Tisseurhttp://arxiv.org/abs/2603.14926v2Acceleration of multi-component multiple-precision arithmetic with branch-free algorithms and SIMD vectorization2026-05-07T03:57:53ZMultiple-precision floating-point branch-free algorithms can significantly accelerate multi-component arithmetic implemented by combining hardware-based binary64 and binary32, particularly for triple- and quadruple-precision computations. In this study, we achieved benchmark results on x86 and ARM CPU platforms to quantify the accelerations achieved in linear computations and polynomial evaluation by integrating these algorithms.2026-03-16T07:29:54ZTomonori Kouyahttp://arxiv.org/abs/2605.05395v1Differentiable Parameter Optimization for DAEs with State-Dependent Events2026-05-06T19:27:24ZDifferential-algebraic equations (DAEs) with state-dependent events arise in systems whose continuous dynamics are constrained by algebraic equations and interrupted by mode changes, switching logic, impacts, or state reinitializations. Gradient-based parameter learning for such systems is challenging because algebraic variables are implicitly defined, event times depend on the parameters, and reset maps introduce discontinuities. This paper studies differentiable parameter optimization for semi-explicit DAEs with events. We formulate the learning problem as a constrained least-squares problem with DAE dynamics, algebraic constraints, guard equations, and reset maps. We then develop two complementary gradient-computation strategies. The first is an automatic-differentiation-through-simulation method that solves algebraic variables inside the vector field, differentiates the algebraic solve using the implicit function theorem, and handles events through segmented differentiable integration. The second is an explicit discrete-adjoint method that represents the forward simulation as an event-split residual system and computes gradients by solving for the Lagrange multipliers of smooth-segment and event residuals. The formulation clarifies that residual terms in the adjoint method are equality constraints, not heuristic penalties. We compare the two approaches in terms of gradient interpretation, event-time handling, implementation complexity, and local validity. Both methods provide gradients for the event path selected by the forward simulation and are valid under fixed event ordering and transversal guard crossings.2026-05-06T19:27:24ZIon MateiMaksym ZhenirovskyyAnthony Wonghttp://arxiv.org/abs/2605.05099v1Randompack: Cross-Platform Reproducible Random Number Generation and Distribution Sampling2026-05-06T16:35:08ZA C library for random number generation, Randompack, is presented. The library implements several modern random number generators (engines), including xoshiro256, PCG64, Philox, ranlux++, and sfc64; 14 continuous distributions including uniform, normal, exponential, gamma, beta, and multivariate normal; raw bit streams, bounded integers, permutations, and sampling without replacement. The engine and the distribution layers are separated so any engine can be used with any distribution. Benchmarks show that Randompack is faster overall than competing libraries, with speedup factors ranging from about 1 to 15 depending on engine, distribution, interface, and platform. A distinguishing feature is reproducibility: with the same seeds Randompack gives compatible results across programming languages, computers, CPU architectures, and compilers. The library includes comprehensive support for parallel simulation. It is accompanied by a comprehensive test suite, benchmarking programs, and example programs. Interfaces to Fortran, Python, Julia, and R have been implemented; their benchmark results are included, although their design and implementation are otherwise outside the scope of the article. Unlike other available C libraries with comparable scope, Randompack is permissively licensed under the MIT license, and it is open source and publicly available through GitHub and conda-forge.2026-05-06T16:35:08Z19 pagesKristján Jónassonhttp://arxiv.org/abs/2605.04629v1CombOL: a Library for Practical Enumeration and Boltzmann Sampling of Combinatorial Classes2026-05-06T08:16:38ZWe present CombOL (Combinatorial Objects Library), an open-source library for the enumeration and Boltzmann sampling of combinatorial classes. Classes can be specified by a concise string syntax, and may depend on an arbitrary number of parameters. CombOL automatically derives the associated generating functions, enabling the generation of counting sequences and the compilation of Boltzmann samplers. The library supports exact and approximate-size Boltzmann rejection sampling with automatic parameter tuning to target specific sizes. In addition to implementing established methods, CombOL contributes a novel early-rejection scheme, as well as guaranteed statistical correctness by dynamically increasing the numerical precision, eliminating bias due to floating-point rounding errors. Through the Python interface, sampled structures can be mapped to application-specific objects, enabling direct sampling of domain objects such as graphs, chemical structure representations, or other complex data types. CombOL is available from PyPI as 'combol' (pypi.org/project/combol). The source code is available at gitlab.com/casbjorn/combol.2026-05-06T08:16:38Z10 pages, 2 figures. Submitted to ICMS (International Congress on Mathematical Software) 2026Casper Asbjørn EriksenDaniel Merkle