https://arxiv.org/api/b0dIu1iAomI/TDXg7cYDxNoW1uM 2026-06-22T00:35:57Z 2664 270 15 http://arxiv.org/abs/2508.17493v1 Easy Acceleration with Distributed Arrays 2025-08-24T19:05:52Z High level programming languages and GPU accelerators are powerful enablers for a wide range of applications. Achieving scalable vertical (within a compute node), horizontal (across compute nodes), and temporal (over different generations of hardware) performance while retaining productivity requires effective abstractions. Distributed arrays are one such abstraction that enables high level programming to achieve highly scalable performance. Distributed arrays achieve this performance by deriving parallelism from data locality, which naturally leads to high memory bandwidth efficiency. This paper explores distributed array performance using the STREAM memory bandwidth benchmark on a variety of hardware. Scalable performance is demonstrated within and across CPU cores, CPU nodes, and GPU nodes. Horizontal scaling across multiple nodes was linear. The hardware used spans decades and allows a direct comparison of hardware improvements for memory bandwidth over this time range; showing a 10x increase in CPU core bandwidth over 20 years, 100x increase in CPU node bandwidth over 20 years, and 5x increase in GPU node bandwidth over 5 years. Running on hundreds of MIT SuperCloud nodes simultaneously achieved a sustained bandwidth $>$1 PB/s. 2025-08-24T19:05:52Z 8 pages, 4 figures, 2 tables, 2 algorithm listings, 2 code listings, to appear in IEEE HPEC 2025 Jeremy Kepner Chansup Byun LaToya Anderson William Arcand David Bestor William Bergeron Alex Bonn Daniel Burrill Vijay Gadepally Ryan Haney Michael Houle Matthew Hubbell Hayden Jananthan Michael Jones Piotr Luszczek Lauren Milechin Guillermo Morales Julie Mullen Andrew Prout Albert Reuther Antonio Rosa Charles Yee Peter Michaleas 10.1109/HPEC67600.2025.11196478 http://arxiv.org/abs/2508.15951v1 A User Manual for cuHALLaR: A GPU Accelerated Low-Rank Semidefinite Programming Solver 2025-08-21T20:45:01Z We present a Julia-based interface to the precompiled HALLaR and cuHALLaR binaries for large-scale semidefinite programs (SDPs). Both solvers are established as fast and numerically stable, and accept problem data in formats compatible with SDPA and a new enhanced data format taking advantage of Hybrid Sparse Low-Rank (HSLR) structure. The interface allows users to load custom data files, configure solver options, and execute experiments directly from Julia. A collection of example problems is included, including the SDP relaxations of the Matrix Completion and Maximum Stable Set problems. 2025-08-21T20:45:01Z Jacob Aguirre Diego Cifuentes Vincent Guigues Renato D. C. Monteiro Victor Hugo Nascimento Arnesh Sujanani http://arxiv.org/abs/2508.13867v1 OpenLB-UQ: An Uncertainty Quantification Framework for Incompressible Fluid Flow Simulations 2025-08-19T14:32:10Z Uncertainty quantification (UQ) is crucial in computational fluid dynamics to assess the reliability and robustness of simulations, given the uncertainties in input parameters. OpenLB is an open-source lattice Boltzmann method library designed for efficient and extensible simulations of complex fluid dynamics on high-performance computers. In this work, we leverage the efficiency of OpenLB for large-scale flow sampling with a dedicated and integrated UQ module. To this end, we focus on non-intrusive stochastic collocation methods based on generalized polynomial chaos and Monte Carlo sampling. The OpenLB-UQ framework is extensively validated in convergence tests with respect to statistical metrics and sample efficiency using selected benchmark cases, including two-dimensional Taylor--Green vortex flows with up to four-dimensional uncertainty and a flow past a cylinder. Our results confirm the expected convergence rates and show promising scalability, demonstrating robust statistical accuracy as well as computational efficiency. OpenLB-UQ enhances the capability of the OpenLB library, offering researchers a scalable framework for UQ in incompressible fluid flow simulations and beyond. 2025-08-19T14:32:10Z Mingliang Zhong Adrian Kummerländer Shota Ito Mathias J. Krause Martin Frank Stephan Simonis http://arxiv.org/abs/2508.13656v1 AutoMPC: A Code Generator for MPC-based Automated Driving 2025-08-19T09:04:43Z Model Predictive Control (MPC) is a powerful technique to control nonlinear, multi-input multi-output systems subject to input and state constraints. It is now a standard tool for trajectory tracking control of automated vehicles. As such it has been used in many research and development projects. However, MPC faces several challenges to be integrated into industrial production vehicles. The most important ones are its high computational demands and the complexity of implementation. The software packages AutoMPC aims to address both of these challenges. It builds on a robustified version of an active set algorithm for Nonlinear MPC. The algorithm is embedded into a framework for vehicle trajectory tracking, which makes it easy to used, yet highly customizable. Automatic code generation transforms the selections into a standalone, computationally efficient C-code file with static memory allocation. As such it can be readily deployed on a wide range of embedded platforms, e.g., based on Matlab/Simulink or Robot Operating System (ROS). Compared to a previous version of the code, the vehicle model and the numerical integration method can be manually specified, besides basic algorithm parameters. All of this information and all specifications are directly baked into the generated C-code. The algorithm is suitable driving scenarios at low or high speeds, even drifting, and supports direction changes. Multiple simulation scenarios show the versatility and effectiveness of the AutoMPC code, with the guarantee of a feasible solution, a high degree of robustness, and computational efficiency. 2025-08-19T09:04:43Z Technical Documentation Georg Schildbach Jasper Pflughaupt http://arxiv.org/abs/2508.13615v1 PennyLane-Lightning MPI: A massively scalable quantum circuit simulator based on distributed computing in CPU clusters 2025-08-19T08:23:16Z Quantum circuit simulations play a critical role in bridging the gap between theoretical quantum algorithms and their practical realization on physical quantum hardware, yet they face computational challenges due to the exponential growth of quantum state spaces with increasing qubit size. This work presents PennyLane-Lightning MPI, an MPI-based extension of the PennyLane-Lightning suite, developed to enable scalable quantum circuit simulations through parallelization of quantum state vectors and gate operations across distributed-memory systems. The core of this implementation is an index-dependent, gate-specific parallelization strategy, which fully exploits the characteristic of individual gates as well as the locality of computation associated with qubit indices in partitioned state vectors. Benchmarking tests with single gates and well-designed quantum circuits show that the present method offers advantages in performance over general methods based on unitary matrix operations and exhibits excellent scalability, supporting simulations of up to 41-qubit with hundreds of thousands of parallel processes. Being equipped with a Python plug-in for seamless integration to the PennyLane framework, this work contributes to extending the PennyLane ecosystem by enabling high-performance quantum simulations in standard multi-core CPU clusters with no library-specific requirements, providing a back-end resource for the cloud-based service framework of quantum computing that is under development in the Republic of Korea. 2025-08-19T08:23:16Z 22 pages, 6 figures, 1 listing Ji-Hoon Kang Hoon Ryu http://arxiv.org/abs/2407.05261v2 Disciplined Geodesically Convex Programming 2025-08-19T03:00:14Z Convex programming plays a fundamental role in machine learning, data science, and engineering. Testing convexity structure in nonlinear programs relies on verifying the convexity of objectives and constraints. Grant et al. (2006) introduced a framework, Disciplined Convex Programming (DCP), for automating this verification task for a wide range of convex functions that can be decomposed into basic convex functions (atoms) using convexity-preserving compositions and transformations (rules). Here, we extend this framework to functions defined on manifolds with non-positive curvature (Hadamard manifolds) by introducing Disciplined Geodesically Convex Programming (DGCP). In particular, this allows for verifying a broader range of convexity notions. For instance, many notable instances of statistical estimators and matrix-valued (sub)routines in machine learning applications are Euclidean non-convex, but exhibit geodesic convexity through a more general Riemannian lens. To define the DGCP framework, we determine convexity-preserving compositions and transformations for geodesically convex functions on general Hadamard manifolds, as well as for the special case of symmetric positive definite matrices, a common setting in matrix-valued optimization. For the latter, we also define a basic set of atoms. Our paper is accompanied by a Julia package SymbolicAnalysis.jl, which provides functionality for testing and certifying DGCP-compliant expressions. Our library interfaces with manifold optimization software, which allows for directly solving verified geodesically convex programs. 2024-07-07T05:13:51Z Andrew Cheng Vaibhav Dixit Melanie Weber http://arxiv.org/abs/2503.03156v3 Dimensionality reduction for homological stability and global structure preservation 2025-08-17T20:20:22Z We propose a new dimensionality reduction toolkit designed to address some of the challenges faced by traditional methods like UMAP and tSNE such as loss of global structure and computational efficiency. Built on the JAX framework, DiRe leverages modern hardware acceleration to provide an efficient, scalable, and interpretable solution for visualizing complex data structures, and for quantitative analysis of lower-dimensional embeddings. The toolkit shows considerable promise in preserving both local and global structures within the data as compared to state-of-the-art UMAP and tSNE implementations. This makes it suitable for a wide range of applications in machine learning, bio-informatics, and data science. 2025-03-05T03:56:01Z 22 pages, 12 figures Github repository available at https://github.com/sashakolpakov/dire-jax Package available on PyPi https://pypi.org/project/dire-jax/ Alexander Kolpakov Igor Rivin http://arxiv.org/abs/2410.21050v2 Matrix-by-matrix multiplication algorithm with $O(N^2log_2N)$ computational complexity for variable precision arithmetic 2025-08-17T14:59:09Z We show that assuming the availability of the processor with variable precision arithmetic, we can compute matrix-by-matrix multiplications in $O(N^2log_2N)$ computational complexity. We replace the standard matrix-by-matrix multiplications $\begin{bmatrix} A_{11} & A_{12} \\ A_{21} & A_{22}\end{bmatrix}\begin{bmatrix} B_{11} & B_{12} \\ B_{21} & B_{22}\end{bmatrix}=\begin{bmatrix} A_{11}B_{11}+A_{12}B_{21} & A_{11}B_{12}+A_{12}B_{22} \\ A_{21}B_{11}+A_{22}B_{21} & A_{21}B_{12}+A_{22}B_{22}\end{bmatrix}$ by $\begin{bmatrix} A_{11} & A_{12} \\ A_{21} & A_{22}\end{bmatrix}\begin{bmatrix} B_{11} & B_{12} \\ B_{21} & B_{22}\end{bmatrix}=\Bigl\lfloor\begin{bmatrix} (A_{11}+εA_{12})(B_{11}+1/εB_{21}) & (A_{11}+εA_{12})(B_{12}+1/εB_{22}) \\ (A_{21}+εA_{22})(B_{11}+1/εB_{21}) &(A_{21}+εA_{22})(B_{12}+1/εB_{22})\end{bmatrix} \Bigr\rfloor \% \frac{1}ε$ where $\lfloor \rfloor$ denotes the floor, and $\%$ denotes the modulo operators. We reduce the number of block matrix-by-matrix multiplications from 8 to 4, keeping the number of additions equal to 4, and additionally introducing 4 multiplications of a block matrices by $ε$ or $\frac{1}ε$, and 4 floor and 4 modulo operations. The resulting computational complexity for two matrices of size $N\times N$ can be estimated from recursive equation $T(N)=4(N/2)^2$ (multiplication of a matrix by $ε$ and $1/ε$) plus $4(N/2)^2$ (additions of two matrices) plus $2N^2$ (floor and modulo) plus $4T(N/2)$ (four recursive calls) as $O(N^2log_2N)$. These multiplications of blocks of a matrix by number scales like $O((N/2)^2)$. We also present a MATLAB code using \emph{vpa} variable precision arithmetic emulator that can multiply matrices of size $N\times N$ using $(4log_2N+1)N^2$ vpa operations. This emulator uses $O(N)$ digits to run our algorithm. 2024-10-28T14:06:12Z 20 pages, 2 tables, 1 figure Maciej Paszyński http://arxiv.org/abs/1811.04035v2 A Search for Good Pseudo-random Number Generators : Survey and Empirical Studies 2025-08-17T14:44:16Z This paper targets to search so-called \emph{good} generators by doing a brief survey over the generators developed in the history of pseudo-random number generators (PRNGs), verify their claims and rank them based on strong empirical tests in same platforms. To do this, the genre of PRNGs developed so far are explored and classified into three groups -- linear congruential generator based, linear feedback shift register based and cellular automata based. From each group, the well-known widely used generators which claimed themselves to be `\emph{good}' are chosen. Overall $30$ PRNGs are selected in this way on which two types of empirical testing are done -- blind statistical tests with Diehard battery of tests, battery \emph{rabbit} of TestU01 library and NIST statistical test-suite as well as graphical tests (lattice test and space-time diagram test). Finally, the selected PRNGs are divided into $24$ groups and are ranked according to their overall performance in all empirical tests. 2018-11-03T07:32:23Z Copyright of this paper is the property of Elsevier. Please cite this paper as : Kamalika Bhattacharjee and Sukanta Das. A search for good pseudo-random number generators: Survey and empirical studies, Computer Science Review, 45: 100471, 2022, ISSN 1574-0137, https://doi.org/10.1016/j.cosrev.2022.100471 Computer Science Review Volume 45, August 2022, 100471 Kamalika Bhattacharjee Sukanta Das 10.1016/j.cosrev.2022.100471 http://arxiv.org/abs/2509.21321v1 QUBOLite: A lightweigth Python toolkit for QUBO 2025-08-15T13:11:36Z We present QUBOLite, a Python package for the creation, manipulation, analysis, and solution of Quadratic Unconstrained Binary Optimization (QUBO) instances. Built as a thin wrapper around NumPy arrays, QUBOLite combines efficient numerical operations with high-level abstractions for tasks ranging from instance generation to preprocessing, analysis, and solving strategies, both exact and approximate. The package includes implementations of the QPRO+ algorithm by Glover et al. for identifying strong persistencies, dynamic range reduction heuristics, and an expressive system for partial assignments (clamping) enabling implicit variable assignment. The package is available on GitHub and the official Python package repository. 2025-08-15T13:11:36Z Sascha Mücke Thore Gerlach Nico Piatkowski Lukas Theißinger http://arxiv.org/abs/2508.10125v1 Concepts for Composing Finite Element Function Space Bases 2025-08-13T18:40:07Z Finite Element discretizations of coupled multi-physics partial differential equation models require the handling of composed function spaces. In this paper we discuss software concepts and abstractions to handle the composition of function spaces, based on a representation of product spaces as trees of simpler bases. From this description, many different numberings of degrees of freedom by multi-indices can be derived in a natural way, allowing to adapt the function spaces to very different data layouts, so that it opens the possibility to directly use the finite element code with very different linear algebra codes, different data structures, and different algebraic solvers. A recurring example throughout the paper is the stationary Stokes equation with Taylor--Hood elements as these are naturally formulated as product spaces and highlight why different storage patterns are desirable. In the second half of the paper we discuss a particular realization of most of these concepts in the \dunemodule{dune-functions} module, as part of the DUNE ecosystem. 2025-08-13T18:40:07Z arXiv admin note: substantial text overlap with arXiv:1806.09545 Christian Engwer Carsten Gräser Steffen Müthing Simon Praetorius Oliver Sander http://arxiv.org/abs/2501.00279v4 Performant Automatic BLAS Offloading on Unified Memory Architecture with OpenMP First-Touch Style Data Movement 2025-08-13T06:13:36Z BLAS is a fundamental building block of advanced linear algebra libraries and many modern scientific computing applications. GPUs are known for their strong arithmetic computing capabilities and are highly suited for BLAS operations. However, porting code to GPUs often requires significant effort, especially for large, complex codes or legacy codes, even for BLAS-heavy applications. While various tools exist to automatically offload BLAS to GPUs, they are often impractical due to the high costs associated with mandatory data transfers. The advent of unified memory architectures in recent GPU designs, such as the NVIDIA Grace-Hopper, allows cache-coherent memory access across all types of memory for both CPU and GPU, potentially eliminating the bottlenecks faced in conventional architectures. This breakthrough paves the way for innovative application developments and porting strategies. Building on our preliminary work demonstrating the potential of automatic *gemm offload, this paper extends the framework to all level-3 BLAS operations and introduces SCILIB-Accel, a novel tool for automatic BLAS offload. SCILIB-Accel leverages the memory coherency in Grace-Hopper and introduces a Device First-Use data movement policy inspired by the OpenMP First-Touch approach in multi-socket CPU programming, minimizing CPU-GPU data transfers for typical scientific computing codes. Additionally, utilizing dynamic binary instrumentation, the tool intercepts BLAS symbols directly from a CPU binary, requiring no code modifications or recompilation. SCILIB-Accel has been evaluated using multiple quantum physics codes on up to a few hundred GPU nodes, yielding promising speedups. Notably, for the LSMS method in the MuST suite, a 3x speedup was achieved on Grace-Hopper compared to Grace-Grace. 2024-12-31T05:24:30Z Junjie Li http://arxiv.org/abs/2507.21932v2 Large-Scale Linear Energy System Optimization: A Systematic Review on Parallelization Strategies via Decomposition 2025-08-08T16:40:05Z As renewable energy integration, sector coupling, and spatiotemporal detail increase, energy system optimization models grow in size and complexity, often pushing solvers to their performance limits. This systematic review explores parallelization strategies that can address these challenges. We first propose a classification scheme for linear energy system optimization models, covering their analytical focus, mathematical structure, and scope. We then review parallel decomposition methods, finding that while many offer performance benefits, no single approach is universally superior. The lack of standardized benchmark suites further complicates comparison. To address this, we recommend essential criteria for future benchmarks and minimum reporting standards. We also survey available software tools for parallel decomposition, including modular frameworks and algorithmic abstractions. Though centered on energy system models, our insights extend to the broader operations research field. 2025-07-29T15:47:16Z 25 pages, 4 figures, 6 tables Lars Hadidi Forschungszentrum Jülich GmbH Leonard Göke ETH Zurich Maximilian Hoffmann Forschungszentrum Jülich GmbH Mario Klostermeier RPTU Kaiserslautern-Landau Shima Sasanpour DLR German Aerospace Center Tim Varelmann Bluebird Optimization Vassilios Yfantis RPTU Kaiserslautern-Landau Jochen Linßen Forschungszentrum Jülich GmbH Detlef Stolten Forschungszentrum Jülich GmbH RWTH Aachen University Jann M. Weinand Forschungszentrum Jülich GmbH http://arxiv.org/abs/2508.06339v1 Performant Unified GPU Kernels for Portable Singular Value Computation Across Hardware and Precision 2025-08-08T14:14:13Z This paper presents a portable, GPU-accelerated implementation of a QR-based singular value computation algorithm in Julia. The singular value ecomposition (SVD) is a fundamental numerical tool in scientific computing and machine learning, providing optimal low-rank matrix approximations. Its importance has increased even more in large-scale machine learning pipelines, including large language models (LLMs), where it enables low-rank adaptation (LoRA). The implemented algorithm is based on the classic two-stage QR reduction, consisting of successive matrix reduction to band form and bidiagonal form. Our implementation leverages Julia's multiple dispatch and metaprogramming capabilities, integrating with the GPUArrays and KernelAbstractions frameworks to provide a unified type and hardware-agnostic function. It supports diverse GPU architectures and data types, and is, to our knowledge, the first GPU-accelerated singular value implementation to support Apple Metal GPUs and half precision. Performance results on multiple GPU backends and data types demonstrate that portability does not require sacrificing performance: the unified function outperforms most linear algebra libraries (MAGMA, SLATE, rocSOLVER, oneMKL) for matrix sizes larger than 1024x1024, and achieves 80%-90% of the performance of cuSOLVER for large matrices. 2025-08-08T14:14:13Z 12 pages, 6 figures, 4 tables Evelyne Ringoot Rabab Alomairy Valentin Churavy Alan Edelman 10.1145/3754598.3754667 http://arxiv.org/abs/2508.05371v1 Adding complex numbers to expression template algorithmic differentiation tools 2025-08-07T13:16:29Z Operator overloading algorithmic differentiation (AD) tools are usually only developed for floating-point values. Algorithmic optimization for, e.g., linear systems solvers or matrix-matrix multiplications are often introduced via external functions or manual function specializations. Complex numbers can be viewed as aggregates of two floating-point values on which specialized operations are applied. Typically, these operations can be handled by the regular floating-point operations from the AD tool. Nevertheless, adding the complex number operations to the expression template framework of modern operator overloading AD tools has several benefits. The internal computations of a complex number operation are hidden, and the complex operations do not decompose into single operations. This leads to a smaller memory footprint of the recorded tape and faster gradient computation times. We will discuss these problems, analyze how complex numbers can be integrated into modern operator overloading AD tools, demonstrate an implementation in CoDiPack, and show performance results on a synthetic test case. 2025-08-07T13:16:29Z 18 pages, 3 figures Max Sagebaum Nicolas R. Gauger