https://arxiv.org/api/W0u91i7bSdIJdYAkRJ5qfzgwnCo 2026-06-22T01:45:17Z 2664 285 15 http://arxiv.org/abs/2508.05020v1 Task-Based Programming for Adaptive Mesh Refinement in Compressible Flow Simulations 2025-08-07T04:14:42Z

High-order solvers for compressible flows are vital in scientific applications. Adaptive mesh refinement (AMR) is a key technique for reducing computational cost by concentrating resolution in regions of interest. In this work, we develop an AMR-based numerical solver using Regent, a high-level programming language for the Legion programming model. We address several challenges associated with implementing AMR in Regent. These include dynamic data structures for patch refinement/coarsening, mesh validity enforcement, and reducing task launch overhead via task fusion. Experimental results show that task fusion achieves 18x speedup, while automated GPU kernel generation via simple annotations yields 9.7x speedup for the targeted kernel. We demonstrate our approach through simulations of two canonical compressible flow problems governed by the Euler equations.

2025-08-07T04:14:42Z Anjiang Wei Hang Song Mert Hidayetoglu Elliott Slaughter Sanjiva K. Lele Alex Aiken http://arxiv.org/abs/2508.04077v1 The Ubiquitous Sparse Matrix-Matrix Products 2025-08-06T04:26:52Z

Multiplication of a sparse matrix with another (dense or sparse) matrix is a fundamental operation that captures the computational patterns of many data science applications, including but not limited to graph algorithms, sparsely connected neural networks, graph neural networks, clustering, and many-to-many comparisons of biological sequencing data. In many application scenarios, the matrix multiplication takes places on an arbitrary algebraic semiring where the scalar operations are overloaded with user-defined functions with certain properties or a more general heterogenous algebra where even the domains of the input matrices can be different. Here, we provide a unifying treatment of the sparse matrix-matrix operation and its rich application space including machine learning, computational biology and chemistry, graph algorithms, and scientific computing.

2025-08-06T04:26:52Z Aydın Buluç http://arxiv.org/abs/2508.04740v1 MissMecha: An All-in-One Python Package for Studying Missing Data Mechanisms 2025-08-06T02:40:45Z

Incomplete data is a persistent challenge in real-world datasets, often governed by complex and unobservable missing mechanisms. Simulating missingness has become a standard approach for understanding its impact on learning and analysis. However, existing tools are fragmented, mechanism-limited, and typically focus only on numerical variables, overlooking the heterogeneous nature of real-world tabular data. We present MissMecha, an open-source Python toolkit for simulating, visualizing, and evaluating missing data under MCAR, MAR, and MNAR assumptions. MissMecha supports both numerical and categorical features, enabling mechanism-aware studies across mixed-type tabular datasets. It includes visual diagnostics, MCAR testing utilities, and type-aware imputation evaluation metrics. Designed to support data quality research, benchmarking, and education,MissMecha offers a unified platform for researchers and practitioners working with incomplete data.

2025-08-06T02:40:45Z Youran Zhou Mohamed Reda Bouadjenek Sunil Aryal http://arxiv.org/abs/2501.14613v2 Improved algorithms and novel applications of the FrankWolfe.jl library 2025-08-05T10:02:14Z

Frank-Wolfe (FW) algorithms have emerged as an essential class of methods for constrained optimization, especially on large-scale problems. In this paper, we summarize the algorithmic design choices and progress made in the last years of the development of FrankWolfe.jl, a Julia package gathering high-performance implementations of state-of-the-art FW variants. We review key use cases of the library in the recent literature, which match its original dual purpose: first, becoming the de-facto toolbox for practitioners applying FW methods to their problem, and second, offering a modular ecosystem to algorithm designers who experiment with their own variants and implementations of algorithmic blocks. Finally, we demonstrate the performance of several FW variants on important problem classes in several experiments, which we curated in a separate repository for continuous benchmarking.

2025-01-24T16:32:28Z ACM Trans. Math. Softw. 51(4), Article 29, 33 pages (2025) Mathieu Besançon Sébastien Designolle Jannis Halbey Deborah Hendrych Dominik Kuzinowicz Sebastian Pokutta Hannah Troppens Daniel Viladrich Herrmannsdoerfer Elias Wirth 10.1145/3765626 http://arxiv.org/abs/2508.00269v2 chipfiring: A Python Package for Efficient Mathematical Analysis of Chip-Firing Games on Multigraphs 2025-08-04T00:39:17Z

This paper presents `chipfiring`, a comprehensive Python package for the mathematical analysis of chip-firing games on finite graphs. The package provides a robust toolkit for defining graphs and chip configurations (divisors), performing chip-firing operations, and analyzing fundamental properties such as winnability, linear equivalence, and divisor rank. We detail the core components of the library, including its object-oriented graph and divisor implementations, integrated Laplacian matrix computations, and an efficient implementation of Dhar's algorithm for determining the solvability of the dollar game. The `chipfiring` package is designed for researchers and students in graph theory, combinatorics, and algebraic geometry, providing essential algorithms and data structures for exploring these rich mathematical models. We describe the library's architecture, illustrate its usage with comprehensive examples, and highlight its specialized contributions compared to general-purpose graph libraries.

2025-08-01T02:29:20Z Dhyey Dharmendrakumar Mavani Tairan Ji Nathan Pflueger http://arxiv.org/abs/2508.00015v1 Extended Abstract: Partial-encapsulate and Its Support for Floating-point Operations in ACL2 2025-07-25T07:07:40Z

We illustrate the power of partial-encapsulate, showing how it is used in the implementation of floating-point operations in ACL2.

2025-07-25T07:07:40Z In Proceedings ACL2 2025, arXiv:2507.18567 EPTCS 423, 2025, pp. 56-59 Matt Kaufmann J Strother Moore 10.4204/EPTCS.423.6 http://arxiv.org/abs/2507.18268v1 Building an Accelerated OpenFOAM Proof-of-Concept Application using Modern C++ 2025-07-24T10:12:00Z

The modern trend in High-Performance Computing (HPC) involves the use of accelerators such as Graphics Processing Units (GPUs) alongside Central Processing Units (CPUs) to speed up numerical operations in various applications. Leading manufacturers such as NVIDIA, Intel, and AMD are constantly advancing these architectures, augmenting them with features such as mixed precision, enhanced memory hierarchies, and specialised accelerator silicon blocks (e.g., Tensor Cores on GPU or AMX/SME engines on CPU) to enhance compute performance. At the same time, significant efforts in software development are aimed at optimizing the use of these innovations, seeking to improve usability and accessibility. This work contributes to the state-of-the-art of OpenFOAM development by presenting a working Proof-Of-Concept application built using modern ISO C++ parallel constructs. This approach, combined with an appropriate compiler runtime stack, like the one provided by the NVIDIA HPC SDK, makes it possible to accelerate well-defined kernels, allowing multi-core execution and GPU offloading using a single codebase. The study demonstrates that it is possible to increase the performance of the OpenFOAM laplacianFoam application by offloading the computations on NVIDIA GPUs using the C++ parallel construct.

2025-07-24T10:12:00Z Giulio Malenza Giovanni Stabile Filippo Spiga Robert Birke Marco Aldinucci http://arxiv.org/abs/2507.13204v1 Performance Portable Gradient Computations Using Source Transformation 2025-07-17T15:15:25Z

Derivative computation is a key component of optimization, sensitivity analysis, uncertainty quantification, and nonlinear solvers. Automatic differentiation (AD) is a powerful technique for evaluating such derivatives, and in recent years, has been integrated into programming environments such as Jax, PyTorch, and TensorFlow to support derivative computations needed for training of machine learning models, resulting in widespread use of these technologies. The C++ language has become the de facto standard for scientific computing due to numerous factors, yet language complexity has made the adoption of AD technologies for C++ difficult, hampering the incorporation of powerful differentiable programming approaches into C++ scientific simulations. This is exacerbated by the increasing emergence of architectures such as GPUs, which have limited memory capabilities and require massive thread-level concurrency. Portable scientific codes rely on domain specific programming models such as Kokkos making AD for such codes even more complex. In this paper, we will investigate source transformation-based automatic differentiation using Clad to automatically generate portable and efficient gradient computations of Kokkos-based code. We discuss the modifications of Clad required to differentiate Kokkos abstractions. We will illustrate the feasibility of our proposed strategy by comparing the wall-clock time of the generated gradient code with the wall-clock time of the input function on different cutting edge GPU architectures such as NVIDIA H100, AMD MI250x, and Intel Ponte Vecchio GPU. For these three architectures and for the considered example, evaluating up to 10 000 entries of the gradient only took up to 2.17x the wall-clock time of evaluating the input function.

2025-07-17T15:15:25Z Kim Liegeois Brian Kelley Eric Phipps Sivasankaran Rajamanickam Vassil Vassilev http://arxiv.org/abs/2408.10743v2 Fast Algorithms and Implementations for Computing the Minimum Distance of Quantum Codes 2025-07-16T10:41:14Z

The distance of a stabilizer quantum code is a very important feature since it determines the number of errors that can be detected and corrected. We present three new fast algorithms and implementations for computing the symplectic distance of the associated classical code. Our new algorithms are based on the Brouwer-Zimmermann algorithm. Our experimental study shows that these new implementations are much faster than current state-of-the-art licensed implementations on single-core processors, multicore processors, and shared-memory multiprocessors. In the most computationally-demanding cases, the performance gain in the computational time can be larger than one order of magnitude. The experimental study also shows a good scalability on shared-memory parallel architectures.

2024-08-20T11:24:30Z 14 pages, 7 figures, 1 table ACM Transactions on Quantum Computing, vol. 7, no. 2, June 2026, article 12 Fernando Hernando Gregorio Quintana-Ortí Markus Grassl 10.1145/3795877 http://arxiv.org/abs/2506.12629v2 The Software Landscape for the Density Matrix Renormalization Group 2025-07-09T08:10:15Z

The density matrix renormalization group (DMRG) algorithm is a cornerstone computational method for studying quantum many-body systems, renowned for its accuracy and adaptability. Despite DMRG's broad applicability across fields such as materials science, quantum chemistry, and quantum computing, numerous independent implementations have been developed. This survey maps the rapidly expanding DMRG software landscape, providing a comprehensive comparison of features among 35 existing packages. We found significant overlap in features among the packages when comparing key aspects, such as parallelism strategies for high-performance computing and symmetry-adapted formulations that enhance efficiency. This overlap suggests opportunities for modularization of common operations, including tensor operations, symmetry representations, and eigensolvers, as the packages are mostly independent and share few third-party library dependencies where functionality is factored out. More widespread modularization and standardization would result in reduced duplication of efforts and improved interoperability. We believe that the proliferation of packages and the current lack of standard interfaces and modularity are more social than technical. We aim to raise awareness of existing packages, guide researchers in finding a suitable package for their needs, and help developers identify opportunities for collaboration, modularity standardization, and optimization. Ultimately, this work emphasizes the value of greater cohesion and modularity, which would benefit DMRG software, allowing these powerful algorithms to tackle more complex and ambitious problems.

2025-06-14T21:12:16Z [v2] Added two more packages Per Sehlstedt Jan Brandejs Paolo Bientinesi Lars Karlsson 10.1016/j.cpc.2026.110136 http://arxiv.org/abs/2507.04697v1 Performance Evaluation of General Purpose Large Language Models for Basic Linear Algebra Subprograms Code Generation 2025-07-07T06:33:59Z

Generative AI technology based on Large Language Models (LLM) has been developed and applied to assist or automatically generate program codes. In this paper, we evaluate the capability of existing general LLMs for Basic Linear Algebra Subprograms (BLAS) code generation for CPUs. We use two LLMs provided by OpenAI: GPT-4.1, a Generative Pre-trained Transformer (GPT) model, and o4-mini, one of the o-series of Reasoning models. Both have been released in April 2025. For the routines from level-1 to 3 BLAS, we tried to generate (1) C code without optimization from routine name only, (2) C code with basic performance optimizations (thread parallelization, SIMD vectorization, and cache blocking) from routine name only, and (3) C code with basic performance optimizations based on Fortran reference code. As a result, we found that correct code can be generated in many cases even when only routine name are given. We also confirmed that thread parallelization with OpenMP, SIMD vectorization, and cache blocking can be implemented to some extent, and that the code is faster than the reference code.

2025-07-07T06:33:59Z 8 pages, 6 tables Daichi Mukunoki Shun-ichiro Hayashi Tetsuya Hoshino Takahiro Katagiri http://arxiv.org/abs/2404.05563v2 Predefined Software Environment Runtimes As A Measure For Reproducibility 2025-07-03T07:52:29Z

As part of Mathematical Research Data Initiative (MaRDI), we have developed a way to preserve a software package into an easy to deploy and use sandbox environment we call a "runtime", via a program we developed called MaPS : MaRDI Packaging System. The program relies on Linux user namespaces to isolate a library environment from the host system, making the sandboxed software reproducible on other systems, with minimal effort. Moreover an overlay filesystem makes local edits persistent. This project will aid reproducibility efforts of research papers: both mathematical and from other disciplines. As a proof of concept, we provide runtimes for the OSCAR Computer Algebra System, polymake software for research in polyhedral geometry, and VIBRANT Virus Identification By iteRative ANnoTation. The software is in a prerelease state: the interface for creating, deploying, and executing runtimes is final, and an interface for easily publishing runtimes is under active development. We thus propose publishing predefined, distributable software environment runtimes along with research papers in an effort to make research with software based results reproducible.

2024-04-08T14:36:45Z Aaruni Kaushik 10.1007/978-3-031-64529-7_26 http://arxiv.org/abs/2507.01917v1 PDE-Constrained High-Order Mesh Optimization 2025-07-02T17:26:33Z

We present a novel framework for PDE-constrained $r$-adaptivity of high-order meshes. The proposed method formulates mesh movement as an optimization problem, with an objective function defined as a convex combination of a mesh quality metric and a measure of the accuracy of the PDE solution obtained via finite element discretization. The proposed formulation achieves optimized, well-defined high-order meshes by integrating mesh quality control, PDE solution accuracy, and robust gradient regularization. We adopt the Target-Matrix Optimization Paradigm to control geometric properties across the mesh, independent of the PDE of interest. To incorporate the accuracy of the PDE solution, we introduce error measures that control the finite element discretization error. The implicit dependence of these error measures on the mesh nodal positions is accurately captured by adjoint sensitivity analysis. Additionally, a convolution-based gradient regularization strategy is used to ensure stable and effective adaptation of high-order meshes. We demonstrate that the proposed framework can improve mesh quality and reduce the error by up to 10 times for the solution of Poisson and linear elasto-static problems. The approach is general with respect to the dimensionality, the order of the mesh, the types of mesh elements, and can be applied to any PDE that admits well-defined adjoint operators.

2025-07-02T17:26:33Z 22 pages, 14 figures, 3 tables Tzanio Kolev Boyan Lazarov Ketan Mittal Mathias Schmidt Vladimir Tomov http://arxiv.org/abs/2309.00465v2 A FAIR File Format for Mathematical Software 2025-07-02T12:03:19Z

We describe a generic JSON based file format which is suitable for computations in computer algebra. This is implemented in the computer algebra system OSCAR, but we also indicate how it can be used in a different context.

2023-09-01T14:03:44Z Antony Della Vecchia Michael Joswig Benjamin Lorenz http://arxiv.org/abs/2503.17405v2 Efficiently Vectorized MCMC on Modern Accelerators 2025-07-02T11:02:59Z

With the advent of automatic vectorization tools (e.g., JAX's $\texttt{vmap}$), writing multi-chain MCMC algorithms is often now as simple as invoking those tools on single-chain code. Whilst convenient, for various MCMC algorithms this results in a synchronization problem -- loosely speaking, at each iteration all chains running in parallel must wait until the last chain has finished drawing its sample. In this work, we show how to design single-chain MCMC algorithms in a way that avoids synchronization overheads when vectorizing with tools like $\texttt{vmap}$ by using the framework of finite state machines (FSMs). Using a simplified model, we derive an exact theoretical form of the obtainable speed-ups using our approach, and use it to make principled recommendations for optimal algorithm design. We implement several popular MCMC algorithms as FSMs, including Elliptical Slice Sampling, HMC-NUTS, and Delayed Rejection, demonstrating speed-ups of up to an order of magnitude in experiments.

2025-03-20T16:07:14Z Hugh Dance Pierre Glaser Peter Orbanz Ryan Adams