https://arxiv.org/api/0sItsfNY+OZ6Ii0JgxdByBrdpvk 2026-03-18T10:13:41Z 2577 0 15 http://arxiv.org/abs/2603.15920v1 DiFVM: A Vectorized Graph-Based Finite Volume Solver for Differentiable CFD on Unstructured Meshes 2026-03-16T21:14:18Z Differentiable programming has emerged as a structural prerequisite for gradient-based inverse problems and end-to-end hybrid physics--machine learning in computational fluid dynamics. However, existing differentiable CFD platforms are confined to structured Cartesian grids, excluding the geometrically complex domains where body-conforming unstructured discretizations are indispensable. We present DiFVM, the first GPU-accelerated, end-to-end differentiable finite-volume CFD solver operating natively on unstructured polyhedral meshes. The key enabling insight is a structural isomorphism between finite-volume discretization and graph neural network message-passing: by reformulating all FVM operators as static scatter/gather primitives on the mesh connectivity graph, DiFVM transforms irregular unstructured connectivity into a first-class GPU data structure. All operations are implemented in JAX/XLA, providing just-in-time compilation, operator fusion, and automatic differentiation through the complete simulation pipeline. Differentiable Windkessel outlet boundary conditions are provided for cardiovascular applications, and DiFVM accepts standard OpenFOAM case directories without modification for seamless adoption in existing workflows. Forward validation across benchmarks spanning canonical flows to patient-specific hemodynamics demonstrates close agreement with OpenFOAM, and end-to-end differentiability is demonstrated through inference of Windkessel parameters from sparse observations. DiFVM bridges the critical gap between differentiable programming and unstructured-mesh CFD, enabling gradient-based inverse problems and physics-integrated machine learning on complex engineering geometries. 2026-03-16T21:14:18Z 44 pages, 13 figures Pan Du Yongqi Li Mingqi Xu Jian-Xun Wang http://arxiv.org/abs/2603.14926v1 Acceleration of multi-component multiple-precision arithmetic with branch-free algorithms and SIMD vectorization 2026-03-16T07:29:54Z Multiple-precision floating-point branch-free algorithms can significantly accelerate multi-component arithmetic implemented by combining hardware-based binary64 and binary32, particularly for triple- and quadruple-precision computations. In this study, we achieved benchmark results on x86 and ARM CPU platforms to quantify the accelerations achieved in linear computations and polynomial evaluation by integrating these algorithms. 2026-03-16T07:29:54Z Tomonori Kouya http://arxiv.org/abs/2603.14103v1 Scorio.jl: A Julia package for ranking stochastic responses 2026-03-14T20:12:56Z Scorio.jl is a Julia package for evaluating and ranking systems from repeated responses to shared tasks. It provides a common tensor-based interface for direct score-based, pairwise, psychometric, voting, graph, and listwise methods, so the same benchmark can be analyzed under multiple ranking assumptions. We describe the package design, position it relative to existing Julia tools, and report pilot experiments on synthetic rank recovery, stability under limited trials, and runtime scaling. 2026-03-14T20:12:56Z Mohsen Hariri Michael Hinczewski Vipin Chaudhary http://arxiv.org/abs/2603.14040v1 Pyroclast: A Modular High-Performance Python Solver for Geodynamics 2026-03-14T17:23:57Z This monograph presents the design, implementation, and evaluation of Pyroclast, a modular high-performance Python framework for large-scale geodynamic simulations. Pyroclast addresses limitations of legacy geodynamics solvers, often implemented in monolithic Fortran, C++, or C codebases with limited GPU support and extensibility, by combining modern numerical methods, hardware-accelerated execution, and a flexible object-oriented architecture. Designed for distributed and GPU-accelerated environments, Pyroclast provides an accessible and efficient platform for simulating mantle convection and lithospheric deformation using the marker-in-cell method and a matrix-free finite difference discretization. The work focuses on a scalable two-dimensional viscous mechanical solver that forms the computational core for future visco-elasto-plastic models. The solver includes a stress-conservative staggered grid discretization of the incompressible Stokes equations, a matrix-free geometric multigrid solver, Krylov and quasi-Newton methods, and MPI-based domain decomposition for distributed execution. Benchmarks evaluate performance and scalability. Shared-memory tests show strong scaling of the Stokes solver and demonstrate a 5-10x speedup on NVIDIA A100 GPUs compared to a multi-core CPU baseline. Distributed advection benchmarks show near-ideal weak scaling up to 896 CPU cores across seven compute nodes. These results demonstrate that Pyroclast achieves high performance while remaining accessible through a high-level Python interface. The framework also provides a blueprint for modernizing legacy geodynamics codes. Its modular architecture and Python-native implementation lower the barrier to entry while enabling interoperability with modern machine learning libraries, enabling hybrid physics-based and data-driven workflows. 2026-03-14T17:23:57Z 138 pages. Research monograph describing the Pyroclast geodynamics solver Marcel Ferrari http://arxiv.org/abs/2503.08126v2 Trilinos: Enabling Scientific Computing Across Diverse Hardware Architectures at Scale 2026-03-12T19:47:46Z Trilinos is a community-developed, open-source software framework that facilitates building large-scale, complex, multiscale, multiphysics simulation code bases for scientific and engineering problems. Since the Trilinos framework has undergone substantial changes to support new applications and new hardware architectures, this document is an update to ``An Overview of the Trilinos project'' by Heroux et al. (ACM Transactions on Mathematical Software, 31(3):397-423, 2005). It describes the design of Trilinos, introduces its new organization in product areas, and highlights established and new features available in Trilinos. Particular focus is put on the modernized software stack based on the Kokkos ecosystem to deliver performance portability across heterogeneous hardware architectures. This paper also outlines the organization of the Trilinos community and the contribution model to help onboard interested users and contributors. 2025-03-11T07:44:20Z 32 pages, 1 figure Matthias Mayr Alexander Heinlein Christian Glusa Siva Rajamanickam Maarten Arnst Roscoe Bartlett Luc Berger-Vergiat Erik Boman Karen Devine Graham Harper Michael Heroux Mark Hoemmen Jonathan Hu Brian Kelley Kyungjoo Kim Drew P. Kouri Paul Kuberry Kim Liegeois Curtis C. Ober Roger Pawlowski Carl Pearson Mauro Perego Eric Phipps Denis Ridzal Nathan V. Roberts Christopher Siefert Heidi Thornquist Romin Tomasetti Christian R. Trott Raymond S. Tuminaro James M. Willenbring Michael M. Wolf Ichitaro Yamazaki http://arxiv.org/abs/2603.10599v1 Self-Scaled Broyden Family of Quasi-Newton Methods in JAX 2026-03-11T09:53:11Z We present a JAX implementation of the Self-Scaled Broyden family of quasi-Newton methods, fully compatible with JAX and building on the Optimistix~\cite{rader_optimistix_2024} optimisation library. The implementation includes BFGS, DFP, Broyden and their Self-Scaled variants(SSBFGS, SSDFP, SSBroyden), together with a Zoom line search satisfying the strong Wolfe conditions. This is a short technical note, not a research paper, as it does not claim any novel contribution; its purpose is to document the implementation and ease the adoption of these optimisers within the JAX community. The code is available at https://github.com/IvanBioli/ssbroyden_optimistix.git. 2026-03-11T09:53:11Z Ivan Bioli Mikel Mendibe Abarrategi http://arxiv.org/abs/2510.14964v2 Efficient and Flexible Multirate Temporal Adaptivity 2026-03-10T13:22:42Z In this work we present two new families of multirate time step adaptivity controllers, that are designed to work with embedded multirate infinitesimal (MRI) time integration methods for adapting time steps when solving problems with multiple time scales. We compare these controllers against competing approaches on two benchmark problems, showing that the proposed methods offer dramatically improved performance and flexibility. The combination of embedded MRI methods and the proposed controllers enable adaptive simulations of problems with a potentially arbitrary number of time scales, achieving high accuracy while maintaining low computational cost. Additionally, we introduce a new set of embeddings for the family of explicit multirate exponential Runge--Kutta (MERK) methods of orders 2 through 5, resulting in the first-ever fifth-order embedded MRI method. Finally, we compare the performance of a wide range of embedded MRI methods on our benchmark problems to provide guidance on how to select an appropriate MRI method and multirate controller. 2025-10-16T17:59:16Z Daniel R. Reynolds Sylvia Amihere Dashon Mitchell Vu Thai Luan http://arxiv.org/abs/2603.09038v1 Accelerating High-Order Finite Element Simulations at Extreme Scale with FP64 Tensor Cores 2026-03-10T00:12:47Z Finite element simulations play a critical role in a wide range of applications, from automotive design to tsunami modeling and computational electromagnetics. Performing these simulations efficiently at the high resolutions needed for practical applications and scientific insights necessitates the use of high-order methods and large-scale supercomputing. While much progress has been made in porting finite element codes to GPU systems in recent years, additional improvements in the efficiency and computational speed of GPU-accelerated high-order finite element simulations are in constant demand. In this paper, we demonstrate that the FP64 tensor cores on NVIDIA GPUs can be used to further accelerate such simulations, achieving significant speedups in key kernels of MFEM, a scalable open-source finite element library widely used in HPC applications. By integrating FP64 tensor cores with kernel fusion optimizations, we were able to achieve up to 2$\times$ performance gains and up to 83% energy efficiency gains on NVIDIA's Grace Hopper GH200 and Grace Blackwell GB200 architectures. To the best of our knowledge, this is the first time that FP64 tensor cores have been directly programmed to accelerate large-scale finite element scientific computing applications. We demonstrate the performance of the optimized kernels at exascale by showing near-perfect weak scaling efficiency and 90% strong scaling efficiency across nearly 10,000 GPUs on the Alps system. The new algorithms and MFEM enhancements directly benefit complex production codes, including the 2025 Gordon Bell Prize-winning application for real-time tsunami forecasting. 2026-03-10T00:12:47Z Jiqun Tu Ian Karlin John Camier Veselin Dobrev Tzanio Kolev Stefan Henneking Omar Ghattas http://arxiv.org/abs/2603.08957v1 Automated Tensor-Relational Decomposition for Large-Scale Sparse Tensor Computation 2026-03-09T21:43:39Z A \emph{tensor-relational} computation is a relational computation where individual tuples carry vectors, matrices, or higher-dimensional arrays. An advantage of tensor-relational computation is that the overall computation can be executed on top of a relational system, inheriting the system's ability to automatically handle very large inputs with high levels of sparsity while high-performance kernels (such as optimized matrix-matrix multiplication codes) can be used to perform most of the underlying mathematical operations. In this paper, we introduce upper-case-lower-case \texttt{EinSum}, which is a tensor-relational version of the classical Einstein Summation Notation. We study how to automatically rewrite a computation in Einstein Notation into upper-case-lower-case \texttt{EinSum} so that computationally intensive components are executed using efficient numerical kernels, while sparsity is managed relationally. 2026-03-09T21:43:39Z Yuxin Tang Zhiyuan Xin Zhimin Ding Xinyu Yao Daniel Bourgeois Tirthak Patel Chris Jermaine http://arxiv.org/abs/2603.07850v1 A Lock-Free, Fully GPU-Resident Architecture for the Verification of Goldbach's Conjecture 2026-03-08T23:58:47Z We present a fully device-resident, multi-GPU architecture for the large-scale computational verification of Goldbach's conjecture. In prior work, a segmented double-sieve eliminated monolithic VRAM bottlenecks but remained constrained by host-side sieve construction and PCIe transfer latency. In this work, we migrate the entire segment generation pipeline to the GPU using highly optimised L1 shared-memory tiling, achieving near-zero host-device communication during the critical verification path. To fully leverage heterogeneous multi-GPU clusters, we introduce an asynchronous, lock-free work-stealing pool that replaces static workload partitioning with atomic segment claiming, enabling $99.7$% parallel efficiency at 2 GPUs and $98.6$% at $4$ GPUs. We further implement strict mathematical overflow guards guaranteeing the soundness of the 64-bit verification pipeline up to its theoretical ceiling of $1.84 \times 10^{19}$. On the same hardware, the new architecture achieves a $45.6\times$ algorithmic speedup over its host-coupled predecessor at N = $10^{10}$. End-to-end, the framework verifies Goldbach's conjecture up to $10^{12}$ in $36.5$ seconds on a single NVIDIA RTX 5090, and up to $10^{13}$ in $133.5$ seconds on a four-GPU system. All code is open-source and reproducible on commodity hardware. 2026-03-08T23:58:47Z 14 pages, 4 figures, 3 tables. The presented work details a major architectural overhaul: migration of the segmented sieve to GPU L1 shared memory and the implementation of a lock-free multi-GPU work pool. Source code available at: https://github.com/isaac-6/goldbach-gpu Isaac Llorente-Saguer http://arxiv.org/abs/2511.00292v2 Numerically stable evaluation of closed-form expressions for eigenvalues of $3 \times 3$ matrices 2026-03-05T12:50:39Z Trigonometric formulas for eigenvalues of $3 \times 3$ matrices that build on Cardano's and Viète's work on algebraic solutions of the cubic are numerically unstable for matrices with repeated eigenvalues. This work presents numerically stable, closed-form evaluation of eigenvalues of real, diagonalizable $3 \times 3$ matrices via four invariants: the trace $I_1$, the deviatoric invariants $J_2$ and $J_3$, and the discriminant $Δ$. We analyze the conditioning of these invariants and derive tight forward error bounds. For $J_2$ we propose an algorithm and prove its accuracy. We benchmark all invariants and the resulting eigenvalue formulas, relating observed forward errors to the derived bounds. In particular, we show that, for the special case of matrices with a well-conditioned eigenbasis, the newly proposed algorithms have errors within the forward stability bounds. Performance benchmarks show that the proposed algorithm is approximately ten times faster than the highly optimized LAPACK library for a challenging test case, while maintaining comparable accuracy. 2025-10-31T22:20:28Z 24 pages. Numer Algor (2026) Michal Habera Andreas Zilian 10.1007/s11075-026-02328-5 http://arxiv.org/abs/2602.10878v2 Simple generators of rational function fields 2026-03-05T12:22:31Z Consider a subfield of the field of rational functions in several indeterminates. We present an algorithm that, given a set of generators of such a subfield, finds a simple generating set. We provide an implementation of the algorithm and show that it improves upon the state of the art both in efficiency and the quality of the results. Furthermore, we demonstrate the utility of simplified generators through several case studies from different application domains, such as structural parameter identifiability. The main algorithmic novelties include performing only partial Gröbner basis computation via sparse interpolation and efficient search for polynomials of a fixed degree in a subfield of the rational function field. 2026-02-11T14:07:00Z Alexander Demin Gleb Pogudin http://arxiv.org/abs/2602.18023v2 Observer-robust energy condition verification for warp drive spacetimes 2026-03-03T15:26:27Z We present \textbf{warpax}, an open-source, GPU-accelerated Python toolkit for observer-robust energy condition analysis of warp drive spacetimes. Instead of sampling a finite set of observer directions, \textbf{warpax} performs continuous, gradient-based optimization on the timelike observer manifold, parameterized by rapidity and boost direction and informed by the Hawking-Ellis classification. At Type~I stress-energy points, which account for more than $96%$ of grid points across tested metrics, energy-condition satisfaction is determined \emph{exactly} through an algebraic eigenvalue check, independent of observer search or rapidity caps. At non-Type~I points, the optimizer provides rapidity-capped diagnostics. Stress-energy tensors are computed from the ADM metric via forward-mode automatic differentiation, eliminating finite-difference error. Geodesic integration with tidal-force and blueshift analysis is included. We evaluate five warp drive metrics (Alcubierre, Lentz, Van~Den~Broeck, Nat'ario, Rodal) and a warp shell metric as a numerical stress test. For the Rodal metric, Eulerian-frame analysis misses violations at more than $28%$ of grid points for the dominant energy condition and more than $15%$ for the weak energy condition. Even when the violation region is correctly identified, observer optimization shows that violation severity can be orders of magnitude larger; for example, for Alcubierre the weak energy condition reaches $\sim 9\times 10^{4}$ at rapidity cap $ζ_{\max}=5$, scaling as $e^{2ζ_{\max}}$, where the cap is an analysis hyperparameter. These results show that single-frame evaluation can substantially underestimate both the spatial extent and magnitude of energy-condition violations. \textbf{warpax} is available at https://github.com/anindex/warpax. 2026-02-20T06:37:44Z 31 pages, 15 figures, 12 tables, submitted to Classical and Quantum Gravity An T. Le http://arxiv.org/abs/2603.02298v1 CuTe Layout Representation and Algebra 2026-03-02T18:31:12Z Modern architectures for high-performance computing and deep learning increasingly incorporate specialized tensor instructions, including tensor cores for matrix multiplication and hardware-optimized copy operations for multi-dimensional data. These instructions prescribe fixed, often complex data layouts that must be correctly propagated through the entire execution pipeline to ensure both correctness and optimal performance. We present CuTe, a novel mathematical specification for representing and manipulating tensors. CuTe introduces two key innovations: (1) a hierarchical layout representation that directly extends traditional flat-shape and flat-stride tensor representations, enabling the representation of complex mappings required by modern hardware instructions, and (2) a rich algebra of layout operations -- including concatenation, coalescence, composition, complementation, division, tiling, and inversion -- that enables sophisticated layout manipulation, derivation, verification, and static analysis. CuTe layouts provide a framework for managing both data layouts and thread arrangements in GPU kernels, while the layout algebra enables powerful compile-time reasoning about layout properties and the expression of generic tensor transformations. In this work, we demonstrate that CuTe's abstractions significantly aid software development compared to traditional approaches, promote compile-time verification of architecturally prescribed layouts, facilitate the implementation of algorithmic primitives that generalize to a wide range of applications, and enable the concise expression of tiling and partitioning patterns required by modern specialized tensor instructions. CuTe has been successfully deployed in production systems, forming the foundation of NVIDIA's CUTLASS library and a number of related efforts including CuTe DSL. 2026-03-02T18:31:12Z Cris Cecka http://arxiv.org/abs/2603.02621v1 GoldbachGPU: An Open Source GPU-Accelerated Framework for Verification of Goldbach's Conjecture 2026-03-02T15:51:57Z We present GoldbachGPU, an open-source framework for large-scale computational verification of Goldbach's conjecture using commodity GPU hardware. Prior GPU-based approaches reported a hard memory ceiling near 10^11 due to monolithic prime-table allocation. We show that this limitation is architectural rather than fundamental: a dense bit-packed prime representation provides a 16x reduction in memory footprint, and a segmented double-sieve design removes the VRAM ceiling entirely. By inverting the verification loop and combining a GPU fast-path with a multi-phase primality oracle, the framework achieves exhaustive verification up to 10^12 on a single NVIDIA RTX 3070 (8 GB VRAM), with no counterexamples found. Each segment requires 14 MB of VRAM, yielding O(N) wall-clock time and O(1) memory in N. A rigorous CPU fallback guarantees mathematical completeness, though it was never invoked in practice. An arbitrary-precision checker using GMP and OpenMP extends single-number verification to 10^10000 via a synchronised batch-search strategy. The segmented architecture also exhibits clean multi-GPU scaling on data-centre hardware (tested on 8 x H100). All code is open-source, documented, and reproducible on both commodity and high-end hardware. 2026-03-02T15:51:57Z 11 pages, 7 tables, 2 figures. Accompanies the v1.1.0 release of GoldbachGPU (Zenodo DOI: https://zenodo.org/records/18837081) Isaac Llorente-Saguer