https://arxiv.org/api/0sItsfNY+OZ6Ii0JgxdByBrdpvk2026-03-18T10:13:41Z2577015http://arxiv.org/abs/2603.15920v1DiFVM: A Vectorized Graph-Based Finite Volume Solver for Differentiable CFD on Unstructured Meshes2026-03-16T21:14:18ZDifferentiable programming has emerged as a structural prerequisite for gradient-based inverse problems and end-to-end hybrid physics--machine learning in computational fluid dynamics. However, existing differentiable CFD platforms are confined to structured Cartesian grids, excluding the geometrically complex domains where body-conforming unstructured discretizations are indispensable. We present DiFVM, the first GPU-accelerated, end-to-end differentiable finite-volume CFD solver operating natively on unstructured polyhedral meshes. The key enabling insight is a structural isomorphism between finite-volume discretization and graph neural network message-passing: by reformulating all FVM operators as static scatter/gather primitives on the mesh connectivity graph, DiFVM transforms irregular unstructured connectivity into a first-class GPU data structure. All operations are implemented in JAX/XLA, providing just-in-time compilation, operator fusion, and automatic differentiation through the complete simulation pipeline. Differentiable Windkessel outlet boundary conditions are provided for cardiovascular applications, and DiFVM accepts standard OpenFOAM case directories without modification for seamless adoption in existing workflows. Forward validation across benchmarks spanning canonical flows to patient-specific hemodynamics demonstrates close agreement with OpenFOAM, and end-to-end differentiability is demonstrated through inference of Windkessel parameters from sparse observations. DiFVM bridges the critical gap between differentiable programming and unstructured-mesh CFD, enabling gradient-based inverse problems and physics-integrated machine learning on complex engineering geometries.2026-03-16T21:14:18Z44 pages, 13 figuresPan DuYongqi LiMingqi XuJian-Xun Wanghttp://arxiv.org/abs/2603.14926v1Acceleration of multi-component multiple-precision arithmetic with branch-free algorithms and SIMD vectorization2026-03-16T07:29:54ZMultiple-precision floating-point branch-free algorithms can significantly accelerate multi-component arithmetic implemented by combining hardware-based binary64 and binary32, particularly for triple- and quadruple-precision computations. In this study, we achieved benchmark results on x86 and ARM CPU platforms to quantify the accelerations achieved in linear computations and polynomial evaluation by integrating these algorithms.2026-03-16T07:29:54ZTomonori Kouyahttp://arxiv.org/abs/2603.14103v1Scorio.jl: A Julia package for ranking stochastic responses2026-03-14T20:12:56ZScorio.jl is a Julia package for evaluating and ranking systems from repeated responses to shared tasks. It provides a common tensor-based interface for direct score-based, pairwise, psychometric, voting, graph, and listwise methods, so the same benchmark can be analyzed under multiple ranking assumptions. We describe the package design, position it relative to existing Julia tools, and report pilot experiments on synthetic rank recovery, stability under limited trials, and runtime scaling.2026-03-14T20:12:56ZMohsen HaririMichael HinczewskiVipin Chaudharyhttp://arxiv.org/abs/2603.14040v1Pyroclast: A Modular High-Performance Python Solver for Geodynamics2026-03-14T17:23:57ZThis monograph presents the design, implementation, and evaluation of Pyroclast, a modular high-performance Python framework for large-scale geodynamic simulations. Pyroclast addresses limitations of legacy geodynamics solvers, often implemented in monolithic Fortran, C++, or C codebases with limited GPU support and extensibility, by combining modern numerical methods, hardware-accelerated execution, and a flexible object-oriented architecture. Designed for distributed and GPU-accelerated environments, Pyroclast provides an accessible and efficient platform for simulating mantle convection and lithospheric deformation using the marker-in-cell method and a matrix-free finite difference discretization. The work focuses on a scalable two-dimensional viscous mechanical solver that forms the computational core for future visco-elasto-plastic models. The solver includes a stress-conservative staggered grid discretization of the incompressible Stokes equations, a matrix-free geometric multigrid solver, Krylov and quasi-Newton methods, and MPI-based domain decomposition for distributed execution. Benchmarks evaluate performance and scalability. Shared-memory tests show strong scaling of the Stokes solver and demonstrate a 5-10x speedup on NVIDIA A100 GPUs compared to a multi-core CPU baseline. Distributed advection benchmarks show near-ideal weak scaling up to 896 CPU cores across seven compute nodes. These results demonstrate that Pyroclast achieves high performance while remaining accessible through a high-level Python interface. The framework also provides a blueprint for modernizing legacy geodynamics codes. Its modular architecture and Python-native implementation lower the barrier to entry while enabling interoperability with modern machine learning libraries, enabling hybrid physics-based and data-driven workflows.2026-03-14T17:23:57Z138 pages. Research monograph describing the Pyroclast geodynamics solverMarcel Ferrarihttp://arxiv.org/abs/2503.08126v2Trilinos: Enabling Scientific Computing Across Diverse Hardware Architectures at Scale2026-03-12T19:47:46ZTrilinos is a community-developed, open-source software framework that facilitates building large-scale, complex, multiscale, multiphysics simulation code bases for scientific and engineering problems. Since the Trilinos framework has undergone substantial changes to support new applications and new hardware architectures, this document is an update to ``An Overview of the Trilinos project'' by Heroux et al. (ACM Transactions on Mathematical Software, 31(3):397-423, 2005). It describes the design of Trilinos, introduces its new organization in product areas, and highlights established and new features available in Trilinos. Particular focus is put on the modernized software stack based on the Kokkos ecosystem to deliver performance portability across heterogeneous hardware architectures. This paper also outlines the organization of the Trilinos community and the contribution model to help onboard interested users and contributors.2025-03-11T07:44:20Z32 pages, 1 figureMatthias MayrAlexander HeinleinChristian GlusaSiva RajamanickamMaarten ArnstRoscoe BartlettLuc Berger-VergiatErik BomanKaren DevineGraham HarperMichael HerouxMark HoemmenJonathan HuBrian KelleyKyungjoo KimDrew P. KouriPaul KuberryKim LiegeoisCurtis C. OberRoger PawlowskiCarl PearsonMauro PeregoEric PhippsDenis RidzalNathan V. RobertsChristopher SiefertHeidi ThornquistRomin TomasettiChristian R. TrottRaymond S. TuminaroJames M. WillenbringMichael M. WolfIchitaro Yamazakihttp://arxiv.org/abs/2603.10599v1Self-Scaled Broyden Family of Quasi-Newton Methods in JAX2026-03-11T09:53:11ZWe present a JAX implementation of the Self-Scaled Broyden family of quasi-Newton methods, fully compatible with JAX and building on the Optimistix~\cite{rader_optimistix_2024} optimisation library. The implementation includes BFGS, DFP, Broyden and their Self-Scaled variants(SSBFGS, SSDFP, SSBroyden), together with a Zoom line search satisfying the strong Wolfe conditions. This is a short technical note, not a research paper, as it does not claim any novel contribution; its purpose is to document the implementation and ease the adoption of these optimisers within the JAX community. The code is available at https://github.com/IvanBioli/ssbroyden_optimistix.git.2026-03-11T09:53:11ZIvan BioliMikel Mendibe Abarrategihttp://arxiv.org/abs/2510.14964v2Efficient and Flexible Multirate Temporal Adaptivity2026-03-10T13:22:42ZIn this work we present two new families of multirate time step adaptivity controllers, that are designed to work with embedded multirate infinitesimal (MRI) time integration methods for adapting time steps when solving problems with multiple time scales. We compare these controllers against competing approaches on two benchmark problems, showing that the proposed methods offer dramatically improved performance and flexibility. The combination of embedded MRI methods and the proposed controllers enable adaptive simulations of problems with a potentially arbitrary number of time scales, achieving high accuracy while maintaining low computational cost. Additionally, we introduce a new set of embeddings for the family of explicit multirate exponential Runge--Kutta (MERK) methods of orders 2 through 5, resulting in the first-ever fifth-order embedded MRI method. Finally, we compare the performance of a wide range of embedded MRI methods on our benchmark problems to provide guidance on how to select an appropriate MRI method and multirate controller.2025-10-16T17:59:16ZDaniel R. ReynoldsSylvia AmihereDashon MitchellVu Thai Luanhttp://arxiv.org/abs/2603.09038v1Accelerating High-Order Finite Element Simulations at Extreme Scale with FP64 Tensor Cores2026-03-10T00:12:47ZFinite element simulations play a critical role in a wide range of applications, from automotive design to tsunami modeling and computational electromagnetics. Performing these simulations efficiently at the high resolutions needed for practical applications and scientific insights necessitates the use of high-order methods and large-scale supercomputing. While much progress has been made in porting finite element codes to GPU systems in recent years, additional improvements in the efficiency and computational speed of GPU-accelerated high-order finite element simulations are in constant demand. In this paper, we demonstrate that the FP64 tensor cores on NVIDIA GPUs can be used to further accelerate such simulations, achieving significant speedups in key kernels of MFEM, a scalable open-source finite element library widely used in HPC applications. By integrating FP64 tensor cores with kernel fusion optimizations, we were able to achieve up to 2$\times$ performance gains and up to 83% energy efficiency gains on NVIDIA's Grace Hopper GH200 and Grace Blackwell GB200 architectures. To the best of our knowledge, this is the first time that FP64 tensor cores have been directly programmed to accelerate large-scale finite element scientific computing applications. We demonstrate the performance of the optimized kernels at exascale by showing near-perfect weak scaling efficiency and 90% strong scaling efficiency across nearly 10,000 GPUs on the Alps system. The new algorithms and MFEM enhancements directly benefit complex production codes, including the 2025 Gordon Bell Prize-winning application for real-time tsunami forecasting.2026-03-10T00:12:47ZJiqun TuIan KarlinJohn CamierVeselin DobrevTzanio KolevStefan HennekingOmar Ghattashttp://arxiv.org/abs/2603.08957v1Automated Tensor-Relational Decomposition for Large-Scale Sparse Tensor Computation2026-03-09T21:43:39ZA \emph{tensor-relational} computation is a relational computation where individual tuples carry vectors, matrices, or higher-dimensional arrays. An advantage of tensor-relational computation is that the overall computation can be executed on top of a relational system, inheriting the system's ability to automatically handle very large inputs with high levels of sparsity while high-performance kernels (such as optimized matrix-matrix multiplication codes) can be used to perform most of the underlying mathematical operations. In this paper, we introduce upper-case-lower-case \texttt{EinSum}, which is a tensor-relational version of the classical Einstein Summation Notation. We study how to automatically rewrite a computation in Einstein Notation into upper-case-lower-case \texttt{EinSum} so that computationally intensive components are executed using efficient numerical kernels, while sparsity is managed relationally.2026-03-09T21:43:39ZYuxin TangZhiyuan XinZhimin DingXinyu YaoDaniel BourgeoisTirthak PatelChris Jermainehttp://arxiv.org/abs/2603.07850v1A Lock-Free, Fully GPU-Resident Architecture for the Verification of Goldbach's Conjecture2026-03-08T23:58:47ZWe present a fully device-resident, multi-GPU architecture for the large-scale computational verification of Goldbach's conjecture. In prior work, a segmented double-sieve eliminated monolithic VRAM bottlenecks but remained constrained by host-side sieve construction and PCIe transfer latency. In this work, we migrate the entire segment generation pipeline to the GPU using highly optimised L1 shared-memory tiling, achieving near-zero host-device communication during the critical verification path. To fully leverage heterogeneous multi-GPU clusters, we introduce an asynchronous, lock-free work-stealing pool that replaces static workload partitioning with atomic segment claiming, enabling $99.7$% parallel efficiency at 2 GPUs and $98.6$% at $4$ GPUs. We further implement strict mathematical overflow guards guaranteeing the soundness of the 64-bit verification pipeline up to its theoretical ceiling of $1.84 \times 10^{19}$. On the same hardware, the new architecture achieves a $45.6\times$ algorithmic speedup over its host-coupled predecessor at N = $10^{10}$. End-to-end, the framework verifies Goldbach's conjecture up to $10^{12}$ in $36.5$ seconds on a single NVIDIA RTX 5090, and up to $10^{13}$ in $133.5$ seconds on a four-GPU system. All code is open-source and reproducible on commodity hardware.2026-03-08T23:58:47Z14 pages, 4 figures, 3 tables. The presented work details a major architectural overhaul: migration of the segmented sieve to GPU L1 shared memory and the implementation of a lock-free multi-GPU work pool. Source code available at: https://github.com/isaac-6/goldbach-gpuIsaac Llorente-Saguerhttp://arxiv.org/abs/2511.00292v2Numerically stable evaluation of closed-form expressions for eigenvalues of $3 \times 3$ matrices2026-03-05T12:50:39ZTrigonometric formulas for eigenvalues of $3 \times 3$ matrices that build on Cardano's and Viète's work on algebraic solutions of the cubic are numerically unstable for matrices with repeated eigenvalues. This work presents numerically stable, closed-form evaluation of eigenvalues of real, diagonalizable $3 \times 3$ matrices via four invariants: the trace $I_1$, the deviatoric invariants $J_2$ and $J_3$, and the discriminant $Δ$. We analyze the conditioning of these invariants and derive tight forward error bounds. For $J_2$ we propose an algorithm and prove its accuracy. We benchmark all invariants and the resulting eigenvalue formulas, relating observed forward errors to the derived bounds. In particular, we show that, for the special case of matrices with a well-conditioned eigenbasis, the newly proposed algorithms have errors within the forward stability bounds. Performance benchmarks show that the proposed algorithm is approximately ten times faster than the highly optimized LAPACK library for a challenging test case, while maintaining comparable accuracy.2025-10-31T22:20:28Z24 pages. Numer Algor (2026)Michal HaberaAndreas Zilian10.1007/s11075-026-02328-5http://arxiv.org/abs/2602.10878v2Simple generators of rational function fields2026-03-05T12:22:31ZConsider a subfield of the field of rational functions in several indeterminates. We present an algorithm that, given a set of generators of such a subfield, finds a simple generating set. We provide an implementation of the algorithm and show that it improves upon the state of the art both in efficiency and the quality of the results. Furthermore, we demonstrate the utility of simplified generators through several case studies from different application domains, such as structural parameter identifiability. The main algorithmic novelties include performing only partial Gröbner basis computation via sparse interpolation and efficient search for polynomials of a fixed degree in a subfield of the rational function field.2026-02-11T14:07:00ZAlexander DeminGleb Pogudinhttp://arxiv.org/abs/2602.18023v2Observer-robust energy condition verification for warp drive spacetimes2026-03-03T15:26:27ZWe present \textbf{warpax}, an open-source, GPU-accelerated Python toolkit for observer-robust energy condition analysis of warp drive spacetimes. Instead of sampling a finite set of observer directions, \textbf{warpax} performs continuous, gradient-based optimization on the timelike observer manifold, parameterized by rapidity and boost direction and informed by the Hawking-Ellis classification. At Type~I stress-energy points, which account for more than $96%$ of grid points across tested metrics, energy-condition satisfaction is determined \emph{exactly} through an algebraic eigenvalue check, independent of observer search or rapidity caps. At non-Type~I points, the optimizer provides rapidity-capped diagnostics. Stress-energy tensors are computed from the ADM metric via forward-mode automatic differentiation, eliminating finite-difference error. Geodesic integration with tidal-force and blueshift analysis is included.
We evaluate five warp drive metrics (Alcubierre, Lentz, Van~Den~Broeck, Nat'ario, Rodal) and a warp shell metric as a numerical stress test. For the Rodal metric, Eulerian-frame analysis misses violations at more than $28%$ of grid points for the dominant energy condition and more than $15%$ for the weak energy condition. Even when the violation region is correctly identified, observer optimization shows that violation severity can be orders of magnitude larger; for example, for Alcubierre the weak energy condition reaches $\sim 9\times 10^{4}$ at rapidity cap $ζ_{\max}=5$, scaling as $e^{2ζ_{\max}}$, where the cap is an analysis hyperparameter. These results show that single-frame evaluation can substantially underestimate both the spatial extent and magnitude of energy-condition violations. \textbf{warpax} is available at https://github.com/anindex/warpax.2026-02-20T06:37:44Z31 pages, 15 figures, 12 tables, submitted to Classical and Quantum GravityAn T. Lehttp://arxiv.org/abs/2603.02298v1CuTe Layout Representation and Algebra2026-03-02T18:31:12ZModern architectures for high-performance computing and deep learning increasingly incorporate specialized tensor instructions, including tensor cores for matrix multiplication and hardware-optimized copy operations for multi-dimensional data. These instructions prescribe fixed, often complex data layouts that must be correctly propagated through the entire execution pipeline to ensure both correctness and optimal performance. We present CuTe, a novel mathematical specification for representing and manipulating tensors. CuTe introduces two key innovations: (1) a hierarchical layout representation that directly extends traditional flat-shape and flat-stride tensor representations, enabling the representation of complex mappings required by modern hardware instructions, and (2) a rich algebra of layout operations -- including concatenation, coalescence, composition, complementation, division, tiling, and inversion -- that enables sophisticated layout manipulation, derivation, verification, and static analysis. CuTe layouts provide a framework for managing both data layouts and thread arrangements in GPU kernels, while the layout algebra enables powerful compile-time reasoning about layout properties and the expression of generic tensor transformations.
In this work, we demonstrate that CuTe's abstractions significantly aid software development compared to traditional approaches, promote compile-time verification of architecturally prescribed layouts, facilitate the implementation of algorithmic primitives that generalize to a wide range of applications, and enable the concise expression of tiling and partitioning patterns required by modern specialized tensor instructions.
CuTe has been successfully deployed in production systems, forming the foundation of NVIDIA's CUTLASS library and a number of related efforts including CuTe DSL.2026-03-02T18:31:12ZCris Ceckahttp://arxiv.org/abs/2603.02621v1GoldbachGPU: An Open Source GPU-Accelerated Framework for Verification of Goldbach's Conjecture2026-03-02T15:51:57ZWe present GoldbachGPU, an open-source framework for large-scale computational verification of Goldbach's conjecture using commodity GPU hardware. Prior GPU-based approaches reported a hard memory ceiling near 10^11 due to monolithic prime-table allocation. We show that this limitation is architectural rather than fundamental: a dense bit-packed prime representation provides a 16x reduction in memory footprint, and a segmented double-sieve design removes the VRAM ceiling entirely. By inverting the verification loop and combining a GPU fast-path with a multi-phase primality oracle, the framework achieves exhaustive verification up to 10^12 on a single NVIDIA RTX 3070 (8 GB VRAM), with no counterexamples found. Each segment requires 14 MB of VRAM, yielding O(N) wall-clock time and O(1) memory in N. A rigorous CPU fallback guarantees mathematical completeness, though it was never invoked in practice. An arbitrary-precision checker using GMP and OpenMP extends single-number verification to 10^10000 via a synchronised batch-search strategy. The segmented architecture also exhibits clean multi-GPU scaling on data-centre hardware (tested on 8 x H100). All code is open-source, documented, and reproducible on both commodity and high-end hardware.2026-03-02T15:51:57Z11 pages, 7 tables, 2 figures. Accompanies the v1.1.0 release of GoldbachGPU (Zenodo DOI: https://zenodo.org/records/18837081)Isaac Llorente-Saguer