https://arxiv.org/api/fFBn8u1PbpmEz+1w3h/ESIYsNCw 2026-06-21T21:02:57Z 2664 225 15 http://arxiv.org/abs/2510.14891v1 A Performance Portable Matrix Free Dense MTTKRP in GenTen 2025-10-16T17:10:03Z

We extend the GenTen tensor decomposition package by introducing an accelerated dense matricized tensor times Khatri-Rao product (MTTKRP), the workhorse kernel for canonical polyadic (CP) tensor decompositions, that is portable and performant on modern CPU and GPU architectures. In contrast to the state-of-the-art matrix multiply based MTTKRP kernels used by Tensor Toolbox, TensorLy, etc., that explicitly form Khatri-Rao matrices, we develop a matrix-free element-wise parallelization approach whose memory cost grows with the rank R like the sum of the tensor shape O(R(n+m+k)), compared to matrix-based methods whose memory cost grows like the product of the tensor shape O(R(mnk)). For the largest problem we study, a rank 2000 MTTKRP, the smaller growth rate yields a matrix-free memory cost of just 2% of the matrix-based methods, a 50x improvement. In practice, the reduced memory impact means our matrix-free MTTKRP can compute a rank 2000 tensor decomposition on a single NVIDIA H100 instead of six H100s using a matrix-based MTTKRP. We also compare our optimized matrix-free MTTKRP to baseline matrix-free implementations on different devices, showing a 3x single-device speedup on an Intel 8480+ CPU and an 11x speedup on a H100 GPU. In addition to numerical results, we provide fine grained performance models for an ideal multi-level cache machine, compare analytical performance predictions to empirical results, and provide a motivated heuristic selection for selecting an algorithmic hyperparameter.

2025-10-16T17:10:03Z 10 pages, 5 figures, 4 tables, for implementation see https://github.com/sandialabs/GenTen Gabriel Kosmacher Eric T. Phipps Sivasankaran Rajamanickam http://arxiv.org/abs/2510.13536v1 Sparse Iterative Solvers Using High-Precision Arithmetic with Quasi Multi-Word Algorithms 2025-10-15T13:31:50Z

To obtain accurate results in numerical computation, high-precision arithmetic is a straightforward approach. However, most processors lack hardware support for floating-point formats beyond double precision (FP64). Double-word arithmetic (Dekker 1971) extends precision by using standard floating-point operations to represent numbers with twice the mantissa length. Building on this concept, various multi-word arithmetic methods have been proposed to further increase precision by combining additional words. Simplified variants, known as quasi algorithms, have also been introduced, which trade a certain loss of accuracy for reduced computational cost. In this study, we investigate the performance of quasi algorithms for double- and triple-word arithmetic in sparse iterative solvers based on the Conjugate Gradient method, and compare them with both non-quasi algorithms and standard FP64. We evaluate execution time on an x86 processor, the number of iterations to convergence, and solution accuracy. Although quasi algorithms require appropriate normalization to preserve accuracy - without it, convergence cannot be achieved - they can still reduce runtime when normalization is applied correctly, while maintaining accuracy comparable to full multi-word algorithms. In particular, quasi triple-word arithmetic can yield more accurate solutions without significantly increasing execution time relative to double-word arithmetic and its quasi variant. Furthermore, for certain problems, a reduction in iteration count contributes to additional speedup. Thus, quasi triple-word arithmetic can serve as a compelling alternative to conventional double-word arithmetic in sparse iterative solvers.

2025-10-15T13:31:50Z Daichi Mukunoki Katsuhisa Ozaki http://arxiv.org/abs/2510.13427v1 Verification Challenges in Sparse Matrix Vector Multiplication in High Performance Computing: Part I 2025-10-15T11:24:28Z

Sparse matrix vector multiplication (SpMV) is a fundamental kernel in scientific codes that rely on iterative solvers. In this first part of our work, we present both a sequential and a basic MPI parallel implementations of SpMV, aiming to provide a challenge problem for the scientific software verification community. The implementations are described in the context of the PETSc library.

2025-10-15T11:24:28Z In Proceedings VSS 2025, arXiv:2510.12314 EPTCS 432, 2025, pp. 98-105 Junchao Zhang Argonne National Laboratory 10.4204/EPTCS.432.11 http://arxiv.org/abs/2510.13425v1 Specification and Verification for Climate Modeling: Formalization Leading to Impactful Tooling 2025-10-15T11:24:00Z

Earth System Models (ESMs) are critical for understanding past climates and projecting future scenarios. However, the complexity of these models, which include large code bases, a wide community of developers, and diverse computational platforms, poses significant challenges for software quality assurance. The increasing adoption of GPUs and heterogeneous architectures further complicates verification efforts. Traditional verification methods often rely on bitwise reproducibility, which is not always feasible, particularly under new compilers or hardware. Manual expert evaluation, on the other hand, is subjective and time-consuming. Formal methods offer a mathematically rigorous alternative, yet their application in ESM development has been limited due to the lack of climate model-specific representations and tools. Here, we advocate for the broader adoption of formal methods in climate modeling. In particular, we identify key aspects of ESMs that are well suited to formal specification and introduce abstraction approaches for a tailored framework. To demonstrate this approach, we present a case study using CIVL model checker to formally verify a bug fix in an ocean mixing parameterization scheme. Our goal is to develop accessible, domain-specific formal tools that enhance model confidence and support more efficient and reliable ESM development.

2025-10-15T11:24:00Z In Proceedings VSS 2025, arXiv:2510.12314 EPTCS 432, 2025, pp. 60-75 Alper Altuntas Allison H. Baker John Baugh Ganesh Gopalakrishnan Stephen F. Siegel 10.4204/EPTCS.432.8 http://arxiv.org/abs/2510.13423v1 Towards Richer Challenge Problems for Scientific Computing Correctness 2025-10-15T11:23:18Z

Correctness in scientific computing (SC) is gaining increasing attention in the formal methods (FM) and programming languages (PL) community. Existing PL/FM verification techniques struggle with the complexities of realistic SC applications. Part of the problem is a lack of a common understanding between the SC and PL/FM communities of machine-verifiable correctness challenges and dimensions of correctness in SC applications. To address this gap, we call for specialized challenge problems to inform the development and evaluation of FM/PL verification techniques for correctness in SC. These specialized challenges are intended to augment existing problems studied by FM/PL researchers for general programs to ensure the needs of SC applications can be met. We propose several dimensions of correctness relevant to scientific computing, and discuss some guidelines and criteria for designing challenge problems to evaluate correctness in scientific computing.

2025-10-15T11:23:18Z In Proceedings VSS 2025, arXiv:2510.12314 EPTCS 432, 2025, pp. 19-26 Matthew Sottile Mohit Tekriwal John Sarracino 10.4204/EPTCS.432.4 http://arxiv.org/abs/2510.11753v1 An Effective Method for Solving a Class of Transcendental Diophantine Equations 2025-10-12T16:22:29Z

This paper investigates the exponential Diophantine equation of the form $a^x+b=c^y$, where $a, b, c$ are given positive integers with $a,c \ge 2$, and $x,y$ are positive integer unknowns. We define this form as a "Type-I transcendental diophantine equation." A general solution to this problem remains an open question; however, the ABC conjecture implies that the number of solutions for any such equation is finite. This work introduces and implements an effective algorithm designed to solve these equations. The method first computes a strict upper bound for potential solutions given the parameters $(a, b, c)$ and then identifies all solutions via finite enumeration. While the universal termination of this algorithm is not theoretically guaranteed, its heuristic-based design has proven effective and reliable in large-scale numerical experiments. Crucially, for each instance it successfully solves, the algorithm is capable of generating a rigorous mathematical proof of the solution's completeness.

2025-10-12T16:22:29Z Zeyu Cai http://arxiv.org/abs/2502.14256v2 QMCPy: A Python Software for Randomized Low-Discrepancy Sequences, Quasi-Monte Carlo, and Fast Kernel Methods 2025-10-10T22:45:22Z

Low-discrepancy (LD) sequences have been extensively used as efficient experimental designs across many scientific disciplines. QMCPy (https://qmcsoftware.github.io/QMCSoftware/) is an accessible Python library which provides a unified implementation of randomized LD sequences, automatic variable transformations, adaptive Quasi-Monte Carlo error estimation algorithms, and fast kernel methods. This article focuses on recent updates to QMCPy which broaden support for randomized LD sequences and add new tools to enable fast kernel methods using LD sequences. Specifically, we give a unified description of the supported LD lattices, digital nets, and Halton point sets, along with randomization options including random permutations / shifts, linear matrix scrambling (LMS), and nested uniform scrambling (NUS). We also support higher-order digital nets, higher-order scrambling with LMS or NUS, and Halton scrambling with LMS or NUS. For fast kernel methods, we provide shift-invariant (SI) and digitally-shift-invariant (DSI) kernels, including a new set of higher-order smoothness DSI kernels. When SI and DSI kernels are respectively paired with n LD lattice and digital net points, the resulting Gram matrices permit multiplication and inversion at only O(n log n) cost. These fast operations utilize QMCPy's implementation of the fast Fourier transform in bit-reversed order (FFTBR), inverse FFTBR (IFFTBR), and fast Walsh--Hadamard transform (FWHT).

2025-02-20T04:45:12Z 29 pages, 3 figures, submitted to ACM TOMS Aleksei G Sorokin http://arxiv.org/abs/2507.03812v2 Memory- and compute-optimized geometric multigrid GMGPolar for curvilinear coordinate representations -- Applications to fusion plasma 2025-10-10T16:08:26Z

Tokamak fusion reactors are actively studied as a means of realizing energy production from plasma fusion. However, due to the substantial cost and time required to construct fusion reactors and run physical experiments, numerical experiments are indispensable for understanding plasma physics inside tokamaks, supporting the design and engineering phase, and optimizing future reactor designs. Geometric multigrid methods are optimal solvers for many problems that arise from the discretization of partial differential equations. It has been shown that the multigrid solver GMGPolar solves the 2D gyrokinetic Poisson equation in linear complexity and with only small memory requirements compared to other state-of-the-art solvers. In this paper, we present a completely refactored and object-oriented version of GMGPolar which offers two different matrix-free implementations. Among other things, we leverage the Sherman-Morrison formula to solve cyclic tridiagonal systems from circular line solvers without additional fill-in and we apply reordering to optimize cache access of circular and radial smoothing operations. With the Give approach, memory requirements are further reduced and speedups of four to seven are obtained for usual test cases. For the Take approach, speedups of 16 to 18 can be attained. In an additionally experimental setup of using GMGPolar as a preconditioner for conjugate gradients, this speedup could even be increased to factors between 25 and 37.

2025-07-04T21:09:51Z 29 pages, 11 figures, 5 tables Julian Litz Philippe Leleux Carola Kruse Joscha Gedicke Martin J. Kühn 10.1016/j.cam.2025.117308 http://arxiv.org/abs/2510.08230v1 pyGinkgo: A Sparse Linear Algebra Operator Framework for Python 2025-10-09T13:55:51Z

Sparse linear algebra is a cornerstone of many scientific computing and machine learning applications. Python has become a popular choice for these applications due to its simplicity and ease of use. Yet high performance sparse kernels in Python remain limited in functionality, especially on modern CPU and GPU architectures. We present pyGinkgo, a lightweight and Pythonic interface to the Ginkgo library, offering high-performance sparse linear algebra support with platform portability across CUDA, HIP, and OpenMP backends. pyGinkgo bridges the gap between high-performance C++ backends and Python usability by exposing Ginkgo's capabilities via Pybind11 and a NumPy and PyTorch compatible interface. We benchmark pyGinkgo's performance against state-of-the-art Python libraries including SciPy, CuPy, PyTorch, and TensorFlow. Results across hardware from different vendors demonstrate that pyGinkgo consistently outperforms existing Python tools in both sparse matrix vector (SpMV) product and iterative solver performance, while maintaining performance parity with native Ginkgo C++ code. Our work positions pyGinkgo as a compelling backend for sparse machine learning models and scientific workflows.

2025-10-09T13:55:51Z Accepted for publication at the 54th International Conference on Parallel Processing (ICPP'25) Keshvi Tuteja Gregor Olenik Roman Mishchuk Yu-Hsiang Tsai Markus Götz Achim Streit Hartwig Anzt Charlotte Debus 10.1145/3754598.3754648 http://arxiv.org/abs/2504.03628v2 Improving Interoperability in Scientific Computing via MaRDI Open Interfaces 2025-10-08T15:17:44Z

MaRDI Open Interfaces is a software project aimed at improving reuse and interoperability in Scientific Computing by alleviating the difficulties of crossing boundaries between different programming languages, in which numerical packages are usually implemented, and of switching between multiple implementations of the same mathematical problem. The software consists of a set of formal interface specifications for common Scientific Computing tasks, as well as a set of loosely coupled libraries that facilitate implementing these interfaces or adapting existing implementations for multiple programming languages and handle data marshalling automatically without sacrificing performance, enabling users to use different implementations without significant code efforts. The software has high reuse potential due to aim to solve general numerical problems.

2025-04-04T17:45:54Z 6 figures Dmitry I. Kabanov Mathematics Münster, University of Münster, Germany Stephan Rave Mathematics Münster, University of Münster, Germany Mario Ohlberger Mathematics Münster, University of Münster, Germany 10.5334/jors.569 http://arxiv.org/abs/2508.21593v2 Growing Mathlib: maintenance of a large scale mathematical library 2025-10-07T17:39:13Z

The Lean mathematical library Mathlib is one of the fastest-growing libraries of formalised mathematics. We describe various strategies to manage this growth, while allowing for change and avoiding maintainer overload. This includes dealing with breaking changes via a deprecation system, using code quality analysis tools (linters) to provide direct user feedback about common pitfalls, speeding up compilation times through conscious library (re-)design, dealing with technical debt as well as writing custom tooling to help with the review and triage of new contributions.

2025-08-29T12:49:58Z 21 pages, 1 figure. To appear at Conference on Intelligent Computer Mathematics (CICM) 2025 in v2: Minor copy-edits, added one more related reference Anne Baanen Matthew Robert Ballard Johan Commelin Bryan Gin-ge Chen Michael Rothgang Damiano Testa http://arxiv.org/abs/2510.05282v1 SHarmonic: A fast and accurate implementation of spherical harmonics for electronic-structure calculations 2025-10-06T18:52:58Z

The authors present SHarmonic, a new implementation of the spherical harmonics targeted for electronic-structure calculations. Their approach is to use explicit formulas for the harmonics written in terms of normalized Cartesian coordinates. This approach results in a code that is as precise as other implementations while being at least one order of magnitude more computationally efficient. The library can run on graphics processing units (GPUs) as well, achieving an additional order of magnitude in execution speed. This new implementation is simple to use and is provided under an open source license, it can be readily used by other codes to avoid the error-prone and cumbersome implementation of the spherical harmonics.

2025-10-06T18:52:58Z 13 pages, 2 figures. The code can be downloaded from https://gitlab.com/npneq/sharmonic Xavier Andrade Jacopo Simoni Yuan Ping Tadashi Ogitsu Alfredo A. Correa http://arxiv.org/abs/2510.03421v1 Optimizing and benchmarking the computation of the permanent of general matrices 2025-10-03T18:32:04Z

Evaluating the permanent of a matrix is a fundamental computation that emerges in many domains, including traditional fields like computational complexity theory, graph theory, many-body quantum theory and emerging disciplines like machine learning and quantum computing. While conceptually simple, evaluating the permanent is extremely challenging: no polynomial-time algorithm is available (unless $\textsc{P} = \textsc{NP}$). To the best of our knowledge there is no publicly available software that automatically uses the most efficient algorithm for computing the permanent. In this work we designed, developed, and investigated the performance of our software package which evaluates the permanent of an arbitrary rectangular matrix, supporting three algorithms generally regarded as the fastest while giving the exact solution (the straightforward combinatoric algorithm, the Ryser algorithm, and the Glynn algorithm) and, optionally, automatically switching to the optimal algorithm based on the type and dimensionality of the input matrix. To do this, we developed an extension of the Glynn algorithm to rectangular matrices. Our free and open-source software package is distributed via Github, at https://github.com/theochem/matrix-permanent.

2025-10-03T18:32:04Z Cassandra Masschelein Michelle Richer Paul W. Ayers http://arxiv.org/abs/2510.02998v1 Valid Inequalities for Mixed Integer Bilevel Linear Optimization Problems 2025-10-03T13:38:25Z

Despite the success of branch-and-cut methods for solving mixed integer bilevel linear optimization problems (MIBLPs) in practice, there are still gaps in both the theory and practice surrounding these methods. In the first part of this paper, we lay out a basic theory of valid inequalities and cutting-plane methods for MIBLPs that parallels the existing theory for mixed integer linear optimization problems (MILPs). We provide a general scheme for classifying valid inequalities and illustrate how the known classes of valid inequalities fit into this categorization, as well as generalizing several existing classes. In the second part of the paper, we assess the computational effectiveness of these valid inequalities and discuss the myriad challenges that arise in integrating methods of dynamically generating inequalities valid for MIBLPs into a branch-and-cut algorithms originally designed for solving MILPs. Although branch-and-cut methods for solving for MIBLPs are in principle straightforward generalizations of those used for MILP, there are subtle but important differences and there remain many unanswered questions regarding how to suitably modify control mechanisms and other algorithmic details in order to ensure performance in the MIBLP setting. We demonstrate that performance of version 1.2 of the open-source solver MibS was substantially improved over that of version 1.1 through a variety of improvements to the previous implementation.

2025-10-03T13:38:25Z Sahar Tahernejad Ted K. Ralphs http://arxiv.org/abs/2510.02948v1 Progressive Bound Strengthening via Doubly Nonnegative Cutting Planes for Nonconvex Quadratic Programs 2025-10-03T12:43:44Z

We introduce a cutting-plane framework for nonconvex quadratic programs (QPs) that progressively tightens convex relaxations. Our approach leverages the doubly nonnegative (DNN) relaxation to compute strong lower bounds and generate separating cuts, which are iteratively added to improve the relaxation. We establish that, at any Karush-Kuhn-Tucker (KKT) point satisfying a second-order sufficient condition, a valid cut can be obtained by solving a linear semidefinite program (SDP), and we devise a finite-termination local search procedure to identify such points. Extensive computational experiments on both benchmark and synthetic instances demonstrate that our approach yields tighter bounds and consistently outperforms leading commercial and academic solvers in terms of efficiency, robustness, and scalability. Notably, on a standard desktop, our algorithm reduces the relative optimality gap to 0.01% on 138 out of 140 instances of dimension 100 within one hour, without resorting to branch-and-bound.

2025-10-03T12:43:44Z Zheng Qu Defeng Sun Jintao Xu