https://arxiv.org/api/25+ci6A6ppUlX2iuomWUY8Al26Q 2026-06-21T22:17:21Z 2664 240 15 http://arxiv.org/abs/2309.05331v3 Integrating Odeint Time Stepping into OpenFPM for Distributed and GPU Accelerated Numerical Solvers 2025-10-02T13:28:31Z

We present a software implementation integrating the time-integration library Odeint from Boost with the OpenFPM framework for scalable scientific computing. This enables compact and scalable codes for multi-stage, multi-step, and adaptive explicit time integration on distributed-memory parallel computers and on Graphics Processing Units (GPUs). The present implementation is based on extending OpenFPM's metaprogramming system to Odeint data types. This makes the time-integration methods from Odeint available in a concise template-expression language for numerical simulations distributed and parallelized using OpenFPM. We benchmark the present software for exponential and sigmoidal dynamics and present application examples to the 3D Gray-Scott reaction-diffusion problem and the "dam break" problem from fluid mechanics. We find a strong-scaling efficiency of 80% on up to 512 CPU cores and a five-fold speedup on a single GPU.

2023-09-11T09:26:37Z Abhinav Singh Landfried Kraatz Serhii Yaskovets Pietro Incardona Ivo F. Sbalzarini http://arxiv.org/abs/2510.01785v1 cuHPX: GPU-Accelerated Differentiable Spherical Harmonic Transforms on HEALPix Grids 2025-10-02T08:22:58Z

HEALPix (Hierarchical Equal Area isoLatitude Pixelization) is a widely adopted spherical grid system in astrophysics, cosmology, and Earth sciences. Its equal-area, iso-latitude structure makes it particularly well-suited for large-scale data analysis on the sphere. However, implementing high-performance spherical harmonic transforms (SHTs) on HEALPix grids remains challenging due to irregular pixel geometry, latitude-dependent alignments, and the demands for high-resolution transforms at scale. In this work, we present cuHPX, an optimized CUDA library that provides functionality for spherical harmonic analysis and related utilities on HEALPix grids. Beyond delivering substantial performance improvements, cuHPX ensures high numerical accuracy, analytic gradients for integration with deep learning frameworks, out-of-core memory-efficient optimization, and flexible regridding between HEALPix, equiangular, and other common spherical grid formats. Through evaluation, we show that cuHPX achieves rapid spectral convergence and delivers over 20 times speedup compared to existing libraries, while maintaining numerical consistency. By combining accuracy, scalability, and differentiability, cuHPX enables a broad range of applications in climate science, astrophysics, and machine learning, effectively bridging optimized GPU kernels with scientific workflows.

2025-10-02T08:22:58Z Xiaopo Cheng Akshay Subramaniam Shixun Wu Noah Brenowitz http://arxiv.org/abs/2510.01495v1 Improving Runtime Performance of Tensor Computations using Rust From Python 2025-10-01T22:14:17Z

In this work, we investigate improving the runtime performance of key computational kernels in the Python Tensor Toolbox (pyttb), a package for analyzing tensor data across a wide variety of applications. Recent runtime performance improvements have been demonstrated using Rust, a compiled language, from Python via extension modules leveraging the Python C API -- e.g., web applications, data parsing, data validation, etc. Using this same approach, we study the runtime performance of key tensor kernels of increasing complexity, from simple kernels involving sums of products over data accessed through single and nested loops to more advanced tensor multiplication kernels that are key in low-rank tensor decomposition and tensor regression algorithms. In numerical experiments involving synthetically generated tensor data of various sizes and these tensor kernels, we demonstrate consistent improvements in runtime performance when using Rust from Python over 1) using Python alone, 2) using Python and the Numba just-in-time Python compiler (for loop-based kernels), and 3) using the NumPy Python package for scientific computing (for pyttb kernels).

2025-10-01T22:14:17Z 12 pages, 4 figures Kimmie Harding Daniel M. Dunlavy http://arxiv.org/abs/2510.00649v1 Provably Optimal Quantum Circuits with Mixed-Integer Programming 2025-10-01T08:25:43Z

We present a depth-aware optimization framework for quantum circuit compilation that unifies provable optimality with scalable heuristics. For exact synthesis of a target unitary, we formulate a mixed-integer linear program (MILP) that linearly handles global-phase equivalence and uses explicit parallel scheduling variables to certify depth-optimal solutions for small-to-medium circuits. Domain-specific valid constraints, including identity ordering, commuting-gate pruning, short-sequence redundancy cuts, and Hermitian-conjugate linkages, significantly accelerate branch-and-bound, yielding speedups up to 43x on standard benchmarks. The framework supports hardware-aware objectives, enabling fault-tolerant (e.g. T-count) and NISQ-era (e.g. entangling gates) devices. For approximate synthesis, we propose 3 objectives: (i) exact, but non-convex, phase-invariant fidelity maximization; (ii) a linear surrogate that maximizes the real trace overlap, yielding a tight lower bound to fidelity; and (iii) a convex quadratic function that minimizes the circuit's Frobenius error. To scale beyond exact MILP, we propose a novel rolling-horizon optimization (RHO) that rolls primarily in time, caps the active-qubits, and enforces per-qubit closure while globally optimizing windowed segments. This preserves local context, reduces the Hilbert-space dimension, and enables iterative improvements without ancillas. On a 142-gate seed circuit, RHO yields 116 gates, an 18.3% reduction from the seed, while avoiding the trade-off between myopic passes and long run times. Empirically, our exact compilation framework achieves certified depth-optimal circuits on standard targets, high-fidelity Fibonacci-anyon weaves, and a 36% gate-count reduction on multi-body parity circuits. All methods are in the open-source QuantumCircuitOpt, providing a single framework that bridges exact certification and scalable synthesis.

2025-10-01T08:25:43Z Harsha Nagarajan Zsolt Szabó http://arxiv.org/abs/2509.24089v1 Systematic Alias Sampling: an efficient and low-variance way to sample from a discrete distribution 2025-09-28T21:50:18Z

In this paper we combine the Alias method with the concept of systematic sampling, a method commonly used in particle filters for efficient low-variance resampling. The proposed method allows very fast sampling from a discrete distribution: drawing k samples is up to an order of magnitude faster than binary search from the cumulative distribution function (cdf) or inversion methods used in many libraries. The produced empirical distribution function is evaluated using a modified Cramér-Von Mises goodness-of-fit statistic, showing that the method compares very favourably to multinomial sampling. As continuous distributions can often be approximated with discrete ones, the proposed method can be used as a very general way to efficiently produce random samples for particle filter proposal distributions, e.g. for motion models in robotics.

2025-09-28T21:50:18Z ACM Transactions on Mathematical Software, Vol. 43, No. 3, Article 18, pp. 1-17, August 2016 Ilari Vallivaara Katja Poikselkä Pauli Rikula Juha Röning 10.1145/2935745 http://arxiv.org/abs/2510.02346v1 Analyzing Computational Approaches for Differential Equations: A Study of MATLAB, Mathematica, and Maple 2025-09-27T11:07:03Z

Differential equations are fundamental to modeling dynamic systems in physics, engineering, biology, and economics. While analytical solutions are ideal, most real-world problems necessitate numerical approaches. This study conducts a detailed comparative analysis of three leading computational software packages: MATLAB, Mathematica, and Maple in solving various differential equations, including ordinary differential equations (ODEs), partial differential equations (PDEs), and systems of differential equations. The evaluation criteria include: Syntax and Usability (ease of implementation), Solution Accuracy (compared to analytical solutions), Computational Efficiency (execution time and resource usage), Visualization Capabilities (quality and flexibility of graphical outputs), Specialized Features (unique tools for specific problem types). Benchmark problems are solved across all three platforms, followed by a discussion on their respective strengths, weaknesses, and ideal use cases. The paper concludes with recommendations for selecting the most suitable software based on problem requirements

2025-09-27T11:07:03Z Arhonefe Joseph Ogethakpo Ignatius Nkonyeasua Njoseh 10.5281/zenodo.17068759 http://arxiv.org/abs/2509.20534v2 Efficient Symbolic Computation via Hash Consing 2025-09-26T01:51:02Z

Symbolic computation systems suffer from memory inefficiencies due to redundant storage of structurally identical subexpressions, commonly known as expression swell, which degrades performance in both classical computer algebra and emerging AI-driven mathematical reasoning tools. In this paper, we present the first integration of hash consing into JuliaSymbolics, a high-performance symbolic toolkit in Julia, by employing a global weak-reference hash table that canonicalizes expressions and eliminates duplication. This approach reduces memory consumption and accelerates key operations such as differentiation, simplification, and code generation, while seamlessly integrating with Julia's metaprogramming and just-in-time compilation infrastructure. Benchmark evaluations across different computational domains reveal substantial improvements: symbolic computations are accelerated by up to 3.2 times, memory usage is reduced by up to 2 times, code generation is up to 5 times faster, function compilation up to 10 times faster, and numerical evaluation up to 100 times faster for larger models. While certain workloads with fewer duplicate unknown-variable expressions show more modest gains or even slight overhead in initial computation stages, downstream processing consistently benefits significantly. These findings underscore the importance of hash consing in scaling symbolic computation and pave the way for future work integrating hash consing with e-graphs for enhanced equivalence-aware expression sharing in AI-driven pipelines.

2025-09-24T20:06:56Z Bowen Zhu Aayush Sabharwal Songchen Tan Yingbo Ma Alan Edelman Christopher Rackauckas http://arxiv.org/abs/2509.21037v1 Utilizing Sparsity in the GPU-accelerated Assembly of Schur Complement Matrices in Domain Decomposition Methods 2025-09-25T11:44:05Z

Schur complement matrices emerge in many domain decomposition methods that can solve complex engineering problems using supercomputers. Today, as most of the high-performance clusters' performance lies in GPUs, these methods should also be accelerated. Typically, the offloaded components are the explicitly assembled dense Schur complement matrices used later in the iterative solver for multiplication with a vector. As the explicit assembly is expensive, it represents a significant overhead associated with this approach to acceleration. It has already been shown that the overhead can be minimized by assembling the Schur complements directly on the GPU. This paper shows that the GPU assembly can be further improved by wisely utilizing the sparsity of the input matrices. In the context of FETI methods, we achieved a speedup of 5.1 in the GPU section of the code and 3.3 for the whole assembly, making the acceleration beneficial from as few as 10 iterations.

2025-09-25T11:44:05Z 12 pages (originally 10 pages without references), 10 figures, submitted to SC25 conference Jakub Homola Ondřej Meca Lubomír Říha Tomáš Brzobohatý 10.1145/3712285.3759904 http://arxiv.org/abs/2508.00441v3 DGEMM without FP64 Arithmetic - Using FP64 Emulation and FP8 Tensor Cores with Ozaki Scheme 2025-09-25T11:27:31Z

As the demand for AI computation rapidly increases, more hardware is being developed to efficiently perform the low-precision matrix multiplications required by such workloads. However, these operations are generally not directly applicable to scientific computations due to accuracy requirements. The Ozaki scheme - an accurate matrix multiplication method proposed by Ozaki et al. in 2012 - enables FP64 matrix multiplication (DGEMM) using low-precision matrix multiplication units, such as FP16 Tensor Cores. This approach has since been extended to utilize integer arithmetic, offering lower computational cost compared to floating-point-based implementations. In fact, it has achieved higher performance than hardware FP64 operations on GPUs equipped with fast INT8 Tensor Cores designed for AI workloads. However, recent AI-oriented processors trends have shifted toward improving the performance of low-precision floating-point operations, such as FP8, rather than integer operations. Motivated by this shift, this study revisits the use of low-precision floating-point operations in the Ozaki scheme. Specifically, we explore the use of FP8 Tensor Cores. In addition, for processors that support very slow or no hardware-based FP64 operations, we also consider FP64 arithmetic emulation based on integer arithmetic. This completely eliminates hardware FP64 instructions. Furthermore, we explore the use of blocking in the inner-product dimension to accelerate FP16-based implementations. We demonstrate the effectiveness of these methods by evaluating the performance on an NVIDIA RTX Blackwell architecture GPU.

2025-08-01T08:58:00Z Daichi Mukunoki http://arxiv.org/abs/2509.20776v1 Distributed-memory Algorithms for Sparse Matrix Permutation, Extraction, and Assignment 2025-09-25T06:00:42Z

We present scalable distributed-memory algorithms for sparse matrix permutation, extraction, and assignment. Our methods follow an Identify-Exchange-Build (IEB) strategy where each process identifies the local nonzeros to be sent, exchanges the required data, and then builds its local submatrix from the received elements. This approach reduces communication compared to SpGEMM-based methods in distributed memory. By employing synchronization-free multithreaded algorithms, we further accelerate local computations, achieving substantially better performance than existing libraries such as CombBLAS and PETSc. We design efficient software for these operations and evaluate their performance on two university clusters and the Perlmutter supercomputer. Our experiments span a variety of application scenarios, including matrix permutation for load balancing, matrix reordering, subgraph extraction, and streaming graph applications. In all cases, we compare our algorithms against CombBLAS, the most comprehensive distributed library for these operations, and, in some scenarios, against PETSc. Overall, this work provides a comprehensive study of algorithms, software implementations, experimental evaluations, and applications for sparse matrix permutation, extraction, and assignment.

2025-09-25T06:00:42Z 32 pages Elaheh Hassani Md Taufique Hussain Ariful Azad http://arxiv.org/abs/2509.18829v1 Piecewise: Flexible piecewise functions for fast integral transforms in Julia 2025-09-23T09:15:36Z

A piecewise function of a real variable x returns a value computed from a rule that can be different in each interval of the values of x. The Julia module Piecewise provides an implementation of piecewise functions, where the user is free to choose the rules. A mechanism allows for fitting a piecewise function made of user-defined formulas to a real function of a real variable. With appropriately chosen formulas, various integral transforms of the piecewise function become directly available without relying on quadratures. The module Piecewise defines seven formula that enable the fast calculation of the moments of the piecewise function. The module PiecewiseHilbert supplements these formula with methods enabling a fast Hilbert transform. The module PiecewiseLorentz extends some of these formula to enable what we call a Lorentz transform.

2025-09-23T09:15:36Z Journal of Open Source Software 10, 8329 (2025) Christophe Berthod 10.21105/joss.08329 http://arxiv.org/abs/2508.11385v2 Bandicoot: A Templated C++ Library for GPU Linear Algebra 2025-09-22T11:36:58Z

We introduce the Bandicoot C++ library for linear algebra and scientific computing on GPUs, overviewing its user interface and performance characteristics, as well as the technical details of its internal design. Bandicoot is the GPU-enabled counterpart to the well-known Armadillo C++ linear algebra library, aiming to allow users to take advantage of GPU-accelerated computation for their existing codebases without significant changes. Exploiting similar internal template meta-programming techniques that Armadillo uses, Bandicoot is able to provide compile-time optimisation of mathematical expressions within user code, leading to more efficient execution. Empirical evaluations show that Bandicoot can provide significant speedups over Armadillo-based CPU-only computation. Bandicoot is available at https://coot.sourceforge.io and is distributed as open-source software under the permissive Apache 2.0 license.

2025-08-15T10:37:26Z extended and revised version of arXiv:2308.03120 Ryan R. Curtin Marcus Edel Conrad Sanderson http://arxiv.org/abs/2509.16081v1 Software Development Aspects of Integrating Linear Algebra Libraries 2025-09-19T15:27:45Z

Many scientific discoveries are made through, or aided by, the use of simulation software. These sophisticated software applications are not built from the ground up, instead they rely on smaller parts for specific use cases, usually from domains unfamiliar to the application scientists. The software library Ginkgo is one of these building blocks to handle sparse numerical linear algebra on different platforms. By using Ginkgo, applications are able to ease the transition to modern systems, and speed up their simulations through faster numerical linear algebra routines. This paper discusses the challenges and benefits for application software in adopting Ginkgo. It will present examples from different domains, such as CFD, power grid simulation, as well as electro-cardiophysiology. For these cases, the impact of the integrations on the application code is discussed from a software engineering standpoint, and in particular, the approaches taken by Ginkgo and the applications to enable sustainable software development are highlighted.

2025-09-19T15:27:45Z 16 pages, 2 figures Marcel Koch Tobias Ribizel Pratik Nayak Fritz Göbel Gregor Olenik Terry Cojean http://arxiv.org/abs/2509.14211v1 Julia GraphBLAS with Nonblocking Execution 2025-09-17T17:41:17Z

From the beginning, the GraphBLAS were designed for ``nonblocking execution''; i.e., calls to GraphBLAS methods return as soon as the arguments to the methods are validated and define a directed acyclic graph (DAG) of GraphBLAS operations. This lets GraphBLAS implementations fuse functions, elide unneeded objects, exploit parallelism, plus any additional DAG-preserving transformations. GraphBLAS implementations exist that utilize nonblocking execution but with limited scope. In this paper, we describe our work to implement GraphBLAS with support for aggressive nonblocking execution. We show how features of the Julia programming language greatly simplify implementation of nonblocking execution. This is \emph{work-in-progress} sufficient to show the potential for nonblocking execution and is limited to GraphBLAS methods required to support PageRank.

2025-09-17T17:41:17Z Pascal Costanza Timothy G. Mattson Raye Kimmerer Benjamin Brock 10.1109/HPEC67600.2025.11196654 http://arxiv.org/abs/2508.16592v2 Performance measurements of modern Fortran MPI applications with Score-P 2025-09-17T11:10:48Z

Version 3.0 of the Message-Passing Interface (MPI) standard, released in 2012, introduced a new set of language bindings for Fortran 2008. By making use of modern language features and the enhanced interoperability with C, there was finally a type safe and standard conforming method to call MPI from Fortran. This highly recommended use mpi_f08 language binding has since then been widely adopted among developers of modern Fortran applications. However, tool support for the F08 bindings is still lacking almost a decade later, forcing users to recede to the less safe and convenient interfaces. Full support for the F08 bindings was added to the performance measurement infrastructure Score-P by implementing MPI wrappers in Fortran. Wrappers cover the latest MPI standard version 4.1 in its entirety, matching the features of the C wrappers. By implementing the wrappers in modern Fortran, we can provide full support for MPI procedures passing attributes, info objects, or callbacks. The implementation is regularly tested under the MPICH test suite. The new F08 wrappers were already used by two fluid dynamics simulation codes -- Neko, a spectral finite-element code derived from Nek5000, and EPIC (Elliptical Parcel-In-Cell) -- to successfully generate performance measurements. In this work, we additionally present our design considerations and sketch out the implementation, discussing the challenges we faced in the process. The key component of the implementation is a code generator that produces approximately 50k lines of MPI wrapper code to be used by Score-P, relying on the Python pympistandard module to provide programmatic access to the extracted data from the MPI standard.

2025-08-08T11:34:20Z Gregor Corbin