https://arxiv.org/api/b0dIu1iAomI/TDXg7cYDxNoW1uM2026-06-22T00:35:57Z266427015http://arxiv.org/abs/2508.17493v1Easy Acceleration with Distributed Arrays2025-08-24T19:05:52ZHigh level programming languages and GPU accelerators are powerful enablers for a wide range of applications. Achieving scalable vertical (within a compute node), horizontal (across compute nodes), and temporal (over different generations of hardware) performance while retaining productivity requires effective abstractions. Distributed arrays are one such abstraction that enables high level programming to achieve highly scalable performance. Distributed arrays achieve this performance by deriving parallelism from data locality, which naturally leads to high memory bandwidth efficiency. This paper explores distributed array performance using the STREAM memory bandwidth benchmark on a variety of hardware. Scalable performance is demonstrated within and across CPU cores, CPU nodes, and GPU nodes. Horizontal scaling across multiple nodes was linear. The hardware used spans decades and allows a direct comparison of hardware improvements for memory bandwidth over this time range; showing a 10x increase in CPU core bandwidth over 20 years, 100x increase in CPU node bandwidth over 20 years, and 5x increase in GPU node bandwidth over 5 years. Running on hundreds of MIT SuperCloud nodes simultaneously achieved a sustained bandwidth $>$1 PB/s.2025-08-24T19:05:52Z8 pages, 4 figures, 2 tables, 2 algorithm listings, 2 code listings, to appear in IEEE HPEC 2025Jeremy KepnerChansup ByunLaToya AndersonWilliam ArcandDavid BestorWilliam BergeronAlex BonnDaniel BurrillVijay GadepallyRyan HaneyMichael HouleMatthew HubbellHayden JananthanMichael JonesPiotr LuszczekLauren MilechinGuillermo MoralesJulie MullenAndrew ProutAlbert ReutherAntonio RosaCharles YeePeter Michaleas10.1109/HPEC67600.2025.11196478http://arxiv.org/abs/2508.15951v1A User Manual for cuHALLaR: A GPU Accelerated Low-Rank Semidefinite Programming Solver2025-08-21T20:45:01ZWe present a Julia-based interface to the precompiled HALLaR and cuHALLaR binaries for large-scale semidefinite programs (SDPs). Both solvers are established as fast and numerically stable, and accept problem data in formats compatible with SDPA and a new enhanced data format taking advantage of Hybrid Sparse Low-Rank (HSLR) structure. The interface allows users to load custom data files, configure solver options, and execute experiments directly from Julia. A collection of example problems is included, including the SDP relaxations of the Matrix Completion and Maximum Stable Set problems.2025-08-21T20:45:01ZJacob AguirreDiego CifuentesVincent GuiguesRenato D. C. MonteiroVictor Hugo NascimentoArnesh Sujananihttp://arxiv.org/abs/2508.13867v1OpenLB-UQ: An Uncertainty Quantification Framework for Incompressible Fluid Flow Simulations2025-08-19T14:32:10ZUncertainty quantification (UQ) is crucial in computational fluid dynamics to assess the reliability and robustness of simulations, given the uncertainties in input parameters. OpenLB is an open-source lattice Boltzmann method library designed for efficient and extensible simulations of complex fluid dynamics on high-performance computers. In this work, we leverage the efficiency of OpenLB for large-scale flow sampling with a dedicated and integrated UQ module. To this end, we focus on non-intrusive stochastic collocation methods based on generalized polynomial chaos and Monte Carlo sampling. The OpenLB-UQ framework is extensively validated in convergence tests with respect to statistical metrics and sample efficiency using selected benchmark cases, including two-dimensional Taylor--Green vortex flows with up to four-dimensional uncertainty and a flow past a cylinder. Our results confirm the expected convergence rates and show promising scalability, demonstrating robust statistical accuracy as well as computational efficiency. OpenLB-UQ enhances the capability of the OpenLB library, offering researchers a scalable framework for UQ in incompressible fluid flow simulations and beyond.2025-08-19T14:32:10ZMingliang ZhongAdrian KummerländerShota ItoMathias J. KrauseMartin FrankStephan Simonishttp://arxiv.org/abs/2508.13656v1AutoMPC: A Code Generator for MPC-based Automated Driving2025-08-19T09:04:43ZModel Predictive Control (MPC) is a powerful technique to control nonlinear, multi-input multi-output systems subject to input and state constraints. It is now a standard tool for trajectory tracking control of automated vehicles. As such it has been used in many research and development projects. However, MPC faces several challenges to be integrated into industrial production vehicles. The most important ones are its high computational demands and the complexity of implementation. The software packages AutoMPC aims to address both of these challenges. It builds on a robustified version of an active set algorithm for Nonlinear MPC. The algorithm is embedded into a framework for vehicle trajectory tracking, which makes it easy to used, yet highly customizable. Automatic code generation transforms the selections into a standalone, computationally efficient C-code file with static memory allocation. As such it can be readily deployed on a wide range of embedded platforms, e.g., based on Matlab/Simulink or Robot Operating System (ROS). Compared to a previous version of the code, the vehicle model and the numerical integration method can be manually specified, besides basic algorithm parameters. All of this information and all specifications are directly baked into the generated C-code. The algorithm is suitable driving scenarios at low or high speeds, even drifting, and supports direction changes. Multiple simulation scenarios show the versatility and effectiveness of the AutoMPC code, with the guarantee of a feasible solution, a high degree of robustness, and computational efficiency.2025-08-19T09:04:43ZTechnical DocumentationGeorg SchildbachJasper Pflughaupthttp://arxiv.org/abs/2508.13615v1PennyLane-Lightning MPI: A massively scalable quantum circuit simulator based on distributed computing in CPU clusters2025-08-19T08:23:16ZQuantum circuit simulations play a critical role in bridging the gap between theoretical quantum algorithms and their practical realization on physical quantum hardware, yet they face computational challenges due to the exponential growth of quantum state spaces with increasing qubit size. This work presents PennyLane-Lightning MPI, an MPI-based extension of the PennyLane-Lightning suite, developed to enable scalable quantum circuit simulations through parallelization of quantum state vectors and gate operations across distributed-memory systems. The core of this implementation is an index-dependent, gate-specific parallelization strategy, which fully exploits the characteristic of individual gates as well as the locality of computation associated with qubit indices in partitioned state vectors. Benchmarking tests with single gates and well-designed quantum circuits show that the present method offers advantages in performance over general methods based on unitary matrix operations and exhibits excellent scalability, supporting simulations of up to 41-qubit with hundreds of thousands of parallel processes. Being equipped with a Python plug-in for seamless integration to the PennyLane framework, this work contributes to extending the PennyLane ecosystem by enabling high-performance quantum simulations in standard multi-core CPU clusters with no library-specific requirements, providing a back-end resource for the cloud-based service framework of quantum computing that is under development in the Republic of Korea.2025-08-19T08:23:16Z22 pages, 6 figures, 1 listingJi-Hoon KangHoon Ryuhttp://arxiv.org/abs/2407.05261v2Disciplined Geodesically Convex Programming2025-08-19T03:00:14ZConvex programming plays a fundamental role in machine learning, data science, and engineering. Testing convexity structure in nonlinear programs relies on verifying the convexity of objectives and constraints. Grant et al. (2006) introduced a framework, Disciplined Convex Programming (DCP), for automating this verification task for a wide range of convex functions that can be decomposed into basic convex functions (atoms) using convexity-preserving compositions and transformations (rules). Here, we extend this framework to functions defined on manifolds with non-positive curvature (Hadamard manifolds) by introducing Disciplined Geodesically Convex Programming (DGCP). In particular, this allows for verifying a broader range of convexity notions. For instance, many notable instances of statistical estimators and matrix-valued (sub)routines in machine learning applications are Euclidean non-convex, but exhibit geodesic convexity through a more general Riemannian lens. To define the DGCP framework, we determine convexity-preserving compositions and transformations for geodesically convex functions on general Hadamard manifolds, as well as for the special case of symmetric positive definite matrices, a common setting in matrix-valued optimization. For the latter, we also define a basic set of atoms. Our paper is accompanied by a Julia package SymbolicAnalysis.jl, which provides functionality for testing and certifying DGCP-compliant expressions. Our library interfaces with manifold optimization software, which allows for directly solving verified geodesically convex programs.2024-07-07T05:13:51ZAndrew ChengVaibhav DixitMelanie Weberhttp://arxiv.org/abs/2503.03156v3Dimensionality reduction for homological stability and global structure preservation2025-08-17T20:20:22ZWe propose a new dimensionality reduction toolkit designed to address some of the challenges faced by traditional methods like UMAP and tSNE such as loss of global structure and computational efficiency. Built on the JAX framework, DiRe leverages modern hardware acceleration to provide an efficient, scalable, and interpretable solution for visualizing complex data structures, and for quantitative analysis of lower-dimensional embeddings. The toolkit shows considerable promise in preserving both local and global structures within the data as compared to state-of-the-art UMAP and tSNE implementations. This makes it suitable for a wide range of applications in machine learning, bio-informatics, and data science.2025-03-05T03:56:01Z22 pages, 12 figures Github repository available at https://github.com/sashakolpakov/dire-jax Package available on PyPi https://pypi.org/project/dire-jax/Alexander KolpakovIgor Rivinhttp://arxiv.org/abs/2410.21050v2Matrix-by-matrix multiplication algorithm with $O(N^2log_2N)$ computational complexity for variable precision arithmetic2025-08-17T14:59:09ZWe show that assuming the availability of the processor with variable precision arithmetic, we can compute matrix-by-matrix multiplications in $O(N^2log_2N)$ computational complexity. We replace the standard matrix-by-matrix multiplications $\begin{bmatrix} A_{11} & A_{12} \\ A_{21} & A_{22}\end{bmatrix}\begin{bmatrix} B_{11} & B_{12} \\ B_{21} & B_{22}\end{bmatrix}=\begin{bmatrix} A_{11}B_{11}+A_{12}B_{21} & A_{11}B_{12}+A_{12}B_{22} \\ A_{21}B_{11}+A_{22}B_{21} & A_{21}B_{12}+A_{22}B_{22}\end{bmatrix}$ by $\begin{bmatrix} A_{11} & A_{12} \\ A_{21} & A_{22}\end{bmatrix}\begin{bmatrix} B_{11} & B_{12} \\ B_{21} & B_{22}\end{bmatrix}=\Bigl\lfloor\begin{bmatrix} (A_{11}+εA_{12})(B_{11}+1/εB_{21}) & (A_{11}+εA_{12})(B_{12}+1/εB_{22}) \\ (A_{21}+εA_{22})(B_{11}+1/εB_{21}) &(A_{21}+εA_{22})(B_{12}+1/εB_{22})\end{bmatrix} \Bigr\rfloor \% \frac{1}ε$ where $\lfloor \rfloor$ denotes the floor, and $\%$ denotes the modulo operators. We reduce the number of block matrix-by-matrix multiplications from 8 to 4, keeping the number of additions equal to 4, and additionally introducing 4 multiplications of a block matrices by $ε$ or $\frac{1}ε$, and 4 floor and 4 modulo operations. The resulting computational complexity for two matrices of size $N\times N$ can be estimated from recursive equation $T(N)=4(N/2)^2$ (multiplication of a matrix by $ε$ and $1/ε$) plus $4(N/2)^2$ (additions of two matrices) plus $2N^2$ (floor and modulo) plus $4T(N/2)$ (four recursive calls) as $O(N^2log_2N)$. These multiplications of blocks of a matrix by number scales like $O((N/2)^2)$. We also present a MATLAB code using \emph{vpa} variable precision arithmetic emulator that can multiply matrices of size $N\times N$ using $(4log_2N+1)N^2$ vpa operations. This emulator uses $O(N)$ digits to run our algorithm.2024-10-28T14:06:12Z20 pages, 2 tables, 1 figureMaciej Paszyńskihttp://arxiv.org/abs/1811.04035v2A Search for Good Pseudo-random Number Generators : Survey and Empirical Studies2025-08-17T14:44:16ZThis paper targets to search so-called \emph{good} generators by doing a brief survey over the generators developed in the history of pseudo-random number generators (PRNGs), verify their claims and rank them based on strong empirical tests in same platforms. To do this, the genre of PRNGs developed so far are explored and classified into three groups -- linear congruential generator based, linear feedback shift register based and cellular automata based. From each group, the well-known widely used generators which claimed themselves to be `\emph{good}' are chosen. Overall $30$ PRNGs are selected in this way on which two types of empirical testing are done -- blind statistical tests with Diehard battery of tests, battery \emph{rabbit} of TestU01 library and NIST statistical test-suite as well as graphical tests (lattice test and space-time diagram test). Finally, the selected PRNGs are divided into $24$ groups and are ranked according to their overall performance in all empirical tests.2018-11-03T07:32:23ZCopyright of this paper is the property of Elsevier. Please cite this paper as : Kamalika Bhattacharjee and Sukanta Das. A search for good pseudo-random number generators: Survey and empirical studies, Computer Science Review, 45: 100471, 2022, ISSN 1574-0137, https://doi.org/10.1016/j.cosrev.2022.100471Computer Science Review Volume 45, August 2022, 100471Kamalika BhattacharjeeSukanta Das10.1016/j.cosrev.2022.100471http://arxiv.org/abs/2509.21321v1QUBOLite: A lightweigth Python toolkit for QUBO2025-08-15T13:11:36ZWe present QUBOLite, a Python package for the creation, manipulation, analysis, and solution of Quadratic Unconstrained Binary Optimization (QUBO) instances. Built as a thin wrapper around NumPy arrays, QUBOLite combines efficient numerical operations with high-level abstractions for tasks ranging from instance generation to preprocessing, analysis, and solving strategies, both exact and approximate. The package includes implementations of the QPRO+ algorithm by Glover et al. for identifying strong persistencies, dynamic range reduction heuristics, and an expressive system for partial assignments (clamping) enabling implicit variable assignment. The package is available on GitHub and the official Python package repository.2025-08-15T13:11:36ZSascha MückeThore GerlachNico PiatkowskiLukas Theißingerhttp://arxiv.org/abs/2508.10125v1Concepts for Composing Finite Element Function Space Bases2025-08-13T18:40:07ZFinite Element discretizations of coupled multi-physics partial differential equation models require the handling of composed function spaces. In this paper we discuss software concepts and abstractions to handle the composition of function spaces, based on a representation of product spaces as trees of simpler bases. From this description, many different numberings of degrees of freedom by multi-indices can be derived in a natural way, allowing to adapt the function spaces to very different data layouts, so that it opens the possibility to directly use the finite element code with very different linear algebra codes, different data structures, and different algebraic solvers.
A recurring example throughout the paper is the stationary Stokes equation with Taylor--Hood elements as these are naturally formulated as product spaces and highlight why different storage patterns are desirable.
In the second half of the paper we discuss a particular realization of most of these concepts in the \dunemodule{dune-functions} module, as part of the DUNE ecosystem.2025-08-13T18:40:07ZarXiv admin note: substantial text overlap with arXiv:1806.09545Christian EngwerCarsten GräserSteffen MüthingSimon PraetoriusOliver Sanderhttp://arxiv.org/abs/2501.00279v4Performant Automatic BLAS Offloading on Unified Memory Architecture with OpenMP First-Touch Style Data Movement2025-08-13T06:13:36ZBLAS is a fundamental building block of advanced linear algebra libraries and many modern scientific computing applications. GPUs are known for their strong arithmetic computing capabilities and are highly suited for BLAS operations. However, porting code to GPUs often requires significant effort, especially for large, complex codes or legacy codes, even for BLAS-heavy applications. While various tools exist to automatically offload BLAS to GPUs, they are often impractical due to the high costs associated with mandatory data transfers. The advent of unified memory architectures in recent GPU designs, such as the NVIDIA Grace-Hopper, allows cache-coherent memory access across all types of memory for both CPU and GPU, potentially eliminating the bottlenecks faced in conventional architectures. This breakthrough paves the way for innovative application developments and porting strategies. Building on our preliminary work demonstrating the potential of automatic *gemm offload, this paper extends the framework to all level-3 BLAS operations and introduces SCILIB-Accel, a novel tool for automatic BLAS offload. SCILIB-Accel leverages the memory coherency in Grace-Hopper and introduces a Device First-Use data movement policy inspired by the OpenMP First-Touch approach in multi-socket CPU programming, minimizing CPU-GPU data transfers for typical scientific computing codes. Additionally, utilizing dynamic binary instrumentation, the tool intercepts BLAS symbols directly from a CPU binary, requiring no code modifications or recompilation. SCILIB-Accel has been evaluated using multiple quantum physics codes on up to a few hundred GPU nodes, yielding promising speedups. Notably, for the LSMS method in the MuST suite, a 3x speedup was achieved on Grace-Hopper compared to Grace-Grace.2024-12-31T05:24:30ZJunjie Lihttp://arxiv.org/abs/2507.21932v2Large-Scale Linear Energy System Optimization: A Systematic Review on Parallelization Strategies via Decomposition2025-08-08T16:40:05ZAs renewable energy integration, sector coupling, and spatiotemporal detail increase, energy system optimization models grow in size and complexity, often pushing solvers to their performance limits. This systematic review explores parallelization strategies that can address these challenges. We first propose a classification scheme for linear energy system optimization models, covering their analytical focus, mathematical structure, and scope. We then review parallel decomposition methods, finding that while many offer performance benefits, no single approach is universally superior. The lack of standardized benchmark suites further complicates comparison. To address this, we recommend essential criteria for future benchmarks and minimum reporting standards. We also survey available software tools for parallel decomposition, including modular frameworks and algorithmic abstractions. Though centered on energy system models, our insights extend to the broader operations research field.2025-07-29T15:47:16Z25 pages, 4 figures, 6 tablesLars HadidiForschungszentrum Jülich GmbHLeonard GökeETH ZurichMaximilian HoffmannForschungszentrum Jülich GmbHMario KlostermeierRPTU Kaiserslautern-LandauShima SasanpourDLR German Aerospace CenterTim VarelmannBluebird OptimizationVassilios YfantisRPTU Kaiserslautern-LandauJochen LinßenForschungszentrum Jülich GmbHDetlef StoltenForschungszentrum Jülich GmbHRWTH Aachen UniversityJann M. WeinandForschungszentrum Jülich GmbHhttp://arxiv.org/abs/2508.06339v1Performant Unified GPU Kernels for Portable Singular Value Computation Across Hardware and Precision2025-08-08T14:14:13ZThis paper presents a portable, GPU-accelerated implementation of a QR-based singular value computation algorithm in Julia. The singular value ecomposition (SVD) is a fundamental numerical tool in scientific computing and machine learning, providing optimal low-rank matrix approximations. Its importance has increased even more in large-scale machine learning pipelines, including large language models (LLMs), where it enables low-rank adaptation (LoRA). The implemented algorithm is based on the classic two-stage QR reduction, consisting of successive matrix reduction to band form and bidiagonal form. Our implementation leverages Julia's multiple dispatch and metaprogramming capabilities, integrating with the GPUArrays and KernelAbstractions frameworks to provide a unified type and hardware-agnostic function. It supports diverse GPU architectures and data types, and is, to our knowledge, the first GPU-accelerated singular value implementation to support Apple Metal GPUs and half precision. Performance results on multiple GPU backends and data types demonstrate that portability does not require sacrificing performance: the unified function outperforms most linear algebra libraries (MAGMA, SLATE, rocSOLVER, oneMKL) for matrix sizes larger than 1024x1024, and achieves 80%-90% of the performance of cuSOLVER for large matrices.2025-08-08T14:14:13Z12 pages, 6 figures, 4 tablesEvelyne RingootRabab AlomairyValentin ChuravyAlan Edelman10.1145/3754598.3754667http://arxiv.org/abs/2508.05371v1Adding complex numbers to expression template algorithmic differentiation tools2025-08-07T13:16:29ZOperator overloading algorithmic differentiation (AD) tools are usually only developed for floating-point values. Algorithmic optimization for, e.g., linear systems solvers or matrix-matrix multiplications are often introduced via external functions or manual function specializations. Complex numbers can be viewed as aggregates of two floating-point values on which specialized operations are applied. Typically, these operations can be handled by the regular floating-point operations from the AD tool. Nevertheless, adding the complex number operations to the expression template framework of modern operator overloading AD tools has several benefits. The internal computations of a complex number operation are hidden, and the complex operations do not decompose into single operations. This leads to a smaller memory footprint of the recorded tape and faster gradient computation times. We will discuss these problems, analyze how complex numbers can be integrated into modern operator overloading AD tools, demonstrate an implementation in CoDiPack, and show performance results on a synthetic test case.2025-08-07T13:16:29Z18 pages, 3 figuresMax SagebaumNicolas R. Gauger