https://arxiv.org/api/KqLJYSfFi1s3Cpie9s3vI5JslW8 2026-06-21T13:02:17Z 2664 120 15 http://arxiv.org/abs/2603.15920v1 DiFVM: A Vectorized Graph-Based Finite Volume Solver for Differentiable CFD on Unstructured Meshes 2026-03-16T21:14:18Z

Differentiable programming has emerged as a structural prerequisite for gradient-based inverse problems and end-to-end hybrid physics--machine learning in computational fluid dynamics. However, existing differentiable CFD platforms are confined to structured Cartesian grids, excluding the geometrically complex domains where body-conforming unstructured discretizations are indispensable. We present DiFVM, the first GPU-accelerated, end-to-end differentiable finite-volume CFD solver operating natively on unstructured polyhedral meshes. The key enabling insight is a structural isomorphism between finite-volume discretization and graph neural network message-passing: by reformulating all FVM operators as static scatter/gather primitives on the mesh connectivity graph, DiFVM transforms irregular unstructured connectivity into a first-class GPU data structure. All operations are implemented in JAX/XLA, providing just-in-time compilation, operator fusion, and automatic differentiation through the complete simulation pipeline. Differentiable Windkessel outlet boundary conditions are provided for cardiovascular applications, and DiFVM accepts standard OpenFOAM case directories without modification for seamless adoption in existing workflows. Forward validation across benchmarks spanning canonical flows to patient-specific hemodynamics demonstrates close agreement with OpenFOAM, and end-to-end differentiability is demonstrated through inference of Windkessel parameters from sparse observations. DiFVM bridges the critical gap between differentiable programming and unstructured-mesh CFD, enabling gradient-based inverse problems and physics-integrated machine learning on complex engineering geometries.

2026-03-16T21:14:18Z 44 pages, 13 figures Pan Du Yongqi Li Mingqi Xu Jian-Xun Wang http://arxiv.org/abs/2603.14103v1 Scorio.jl: A Julia package for ranking stochastic responses 2026-03-14T20:12:56Z

Scorio.jl is a Julia package for evaluating and ranking systems from repeated responses to shared tasks. It provides a common tensor-based interface for direct score-based, pairwise, psychometric, voting, graph, and listwise methods, so the same benchmark can be analyzed under multiple ranking assumptions. We describe the package design, position it relative to existing Julia tools, and report pilot experiments on synthetic rank recovery, stability under limited trials, and runtime scaling.

2026-03-14T20:12:56Z Mohsen Hariri Michael Hinczewski Vipin Chaudhary http://arxiv.org/abs/2603.14040v1 Pyroclast: A Modular High-Performance Python Solver for Geodynamics 2026-03-14T17:23:57Z

This monograph presents the design, implementation, and evaluation of Pyroclast, a modular high-performance Python framework for large-scale geodynamic simulations. Pyroclast addresses limitations of legacy geodynamics solvers, often implemented in monolithic Fortran, C++, or C codebases with limited GPU support and extensibility, by combining modern numerical methods, hardware-accelerated execution, and a flexible object-oriented architecture. Designed for distributed and GPU-accelerated environments, Pyroclast provides an accessible and efficient platform for simulating mantle convection and lithospheric deformation using the marker-in-cell method and a matrix-free finite difference discretization. The work focuses on a scalable two-dimensional viscous mechanical solver that forms the computational core for future visco-elasto-plastic models. The solver includes a stress-conservative staggered grid discretization of the incompressible Stokes equations, a matrix-free geometric multigrid solver, Krylov and quasi-Newton methods, and MPI-based domain decomposition for distributed execution. Benchmarks evaluate performance and scalability. Shared-memory tests show strong scaling of the Stokes solver and demonstrate a 5-10x speedup on NVIDIA A100 GPUs compared to a multi-core CPU baseline. Distributed advection benchmarks show near-ideal weak scaling up to 896 CPU cores across seven compute nodes. These results demonstrate that Pyroclast achieves high performance while remaining accessible through a high-level Python interface. The framework also provides a blueprint for modernizing legacy geodynamics codes. Its modular architecture and Python-native implementation lower the barrier to entry while enabling interoperability with modern machine learning libraries, enabling hybrid physics-based and data-driven workflows.

2026-03-14T17:23:57Z 138 pages. Research monograph describing the Pyroclast geodynamics solver Marcel Ferrari http://arxiv.org/abs/2503.08126v2 Trilinos: Enabling Scientific Computing Across Diverse Hardware Architectures at Scale 2026-03-12T19:47:46Z

Trilinos is a community-developed, open-source software framework that facilitates building large-scale, complex, multiscale, multiphysics simulation code bases for scientific and engineering problems. Since the Trilinos framework has undergone substantial changes to support new applications and new hardware architectures, this document is an update to ``An Overview of the Trilinos project'' by Heroux et al. (ACM Transactions on Mathematical Software, 31(3):397-423, 2005). It describes the design of Trilinos, introduces its new organization in product areas, and highlights established and new features available in Trilinos. Particular focus is put on the modernized software stack based on the Kokkos ecosystem to deliver performance portability across heterogeneous hardware architectures. This paper also outlines the organization of the Trilinos community and the contribution model to help onboard interested users and contributors.

2025-03-11T07:44:20Z 32 pages, 1 figure Matthias Mayr Alexander Heinlein Christian Glusa Siva Rajamanickam Maarten Arnst Roscoe Bartlett Luc Berger-Vergiat Erik Boman Karen Devine Graham Harper Michael Heroux Mark Hoemmen Jonathan Hu Brian Kelley Kyungjoo Kim Drew P. Kouri Paul Kuberry Kim Liegeois Curtis C. Ober Roger Pawlowski Carl Pearson Mauro Perego Eric Phipps Denis Ridzal Nathan V. Roberts Christopher Siefert Heidi Thornquist Romin Tomasetti Christian R. Trott Raymond S. Tuminaro James M. Willenbring Michael M. Wolf Ichitaro Yamazaki http://arxiv.org/abs/2603.10599v1 Self-Scaled Broyden Family of Quasi-Newton Methods in JAX 2026-03-11T09:53:11Z

We present a JAX implementation of the Self-Scaled Broyden family of quasi-Newton methods, fully compatible with JAX and building on the Optimistix~\cite{rader_optimistix_2024} optimisation library. The implementation includes BFGS, DFP, Broyden and their Self-Scaled variants(SSBFGS, SSDFP, SSBroyden), together with a Zoom line search satisfying the strong Wolfe conditions. This is a short technical note, not a research paper, as it does not claim any novel contribution; its purpose is to document the implementation and ease the adoption of these optimisers within the JAX community. The code is available at https://github.com/IvanBioli/ssbroyden_optimistix.git.

2026-03-11T09:53:11Z Ivan Bioli Mikel Mendibe Abarrategi http://arxiv.org/abs/2510.14964v2 Efficient and Flexible Multirate Temporal Adaptivity 2026-03-10T13:22:42Z

In this work we present two new families of multirate time step adaptivity controllers, that are designed to work with embedded multirate infinitesimal (MRI) time integration methods for adapting time steps when solving problems with multiple time scales. We compare these controllers against competing approaches on two benchmark problems, showing that the proposed methods offer dramatically improved performance and flexibility. The combination of embedded MRI methods and the proposed controllers enable adaptive simulations of problems with a potentially arbitrary number of time scales, achieving high accuracy while maintaining low computational cost. Additionally, we introduce a new set of embeddings for the family of explicit multirate exponential Runge--Kutta (MERK) methods of orders 2 through 5, resulting in the first-ever fifth-order embedded MRI method. Finally, we compare the performance of a wide range of embedded MRI methods on our benchmark problems to provide guidance on how to select an appropriate MRI method and multirate controller.

2025-10-16T17:59:16Z Daniel R. Reynolds Sylvia Amihere Dashon Mitchell Vu Thai Luan http://arxiv.org/abs/2603.08957v1 Automated Tensor-Relational Decomposition for Large-Scale Sparse Tensor Computation 2026-03-09T21:43:39Z

A \emph{tensor-relational} computation is a relational computation where individual tuples carry vectors, matrices, or higher-dimensional arrays. An advantage of tensor-relational computation is that the overall computation can be executed on top of a relational system, inheriting the system's ability to automatically handle very large inputs with high levels of sparsity while high-performance kernels (such as optimized matrix-matrix multiplication codes) can be used to perform most of the underlying mathematical operations. In this paper, we introduce upper-case-lower-case \texttt{EinSum}, which is a tensor-relational version of the classical Einstein Summation Notation. We study how to automatically rewrite a computation in Einstein Notation into upper-case-lower-case \texttt{EinSum} so that computationally intensive components are executed using efficient numerical kernels, while sparsity is managed relationally.

2026-03-09T21:43:39Z Yuxin Tang Zhiyuan Xin Zhimin Ding Xinyu Yao Daniel Bourgeois Tirthak Patel Chris Jermaine http://arxiv.org/abs/2603.07850v1 A Lock-Free, Fully GPU-Resident Architecture for the Verification of Goldbach's Conjecture 2026-03-08T23:58:47Z

We present a fully device-resident, multi-GPU architecture for the large-scale computational verification of Goldbach's conjecture. In prior work, a segmented double-sieve eliminated monolithic VRAM bottlenecks but remained constrained by host-side sieve construction and PCIe transfer latency. In this work, we migrate the entire segment generation pipeline to the GPU using highly optimised L1 shared-memory tiling, achieving near-zero host-device communication during the critical verification path. To fully leverage heterogeneous multi-GPU clusters, we introduce an asynchronous, lock-free work-stealing pool that replaces static workload partitioning with atomic segment claiming, enabling $99.7$% parallel efficiency at 2 GPUs and $98.6$% at $4$ GPUs. We further implement strict mathematical overflow guards guaranteeing the soundness of the 64-bit verification pipeline up to its theoretical ceiling of $1.84 \times 10^{19}$. On the same hardware, the new architecture achieves a $45.6\times$ algorithmic speedup over its host-coupled predecessor at N = $10^{10}$. End-to-end, the framework verifies Goldbach's conjecture up to $10^{12}$ in $36.5$ seconds on a single NVIDIA RTX 5090, and up to $10^{13}$ in $133.5$ seconds on a four-GPU system. All code is open-source and reproducible on commodity hardware.

2026-03-08T23:58:47Z 14 pages, 4 figures, 3 tables. The presented work details a major architectural overhaul: migration of the segmented sieve to GPU L1 shared memory and the implementation of a lock-free multi-GPU work pool. Source code available at: https://github.com/isaac-6/goldbach-gpu Isaac Llorente-Saguer http://arxiv.org/abs/2511.00292v2 Numerically stable evaluation of closed-form expressions for eigenvalues of $3 \times 3$ matrices 2026-03-05T12:50:39Z

Trigonometric formulas for eigenvalues of $3 \times 3$ matrices that build on Cardano's and Viète's work on algebraic solutions of the cubic are numerically unstable for matrices with repeated eigenvalues. This work presents numerically stable, closed-form evaluation of eigenvalues of real, diagonalizable $3 \times 3$ matrices via four invariants: the trace $I_1$, the deviatoric invariants $J_2$ and $J_3$, and the discriminant $Δ$. We analyze the conditioning of these invariants and derive tight forward error bounds. For $J_2$ we propose an algorithm and prove its accuracy. We benchmark all invariants and the resulting eigenvalue formulas, relating observed forward errors to the derived bounds. In particular, we show that, for the special case of matrices with a well-conditioned eigenbasis, the newly proposed algorithms have errors within the forward stability bounds. Performance benchmarks show that the proposed algorithm is approximately ten times faster than the highly optimized LAPACK library for a challenging test case, while maintaining comparable accuracy.

2025-10-31T22:20:28Z 24 pages. Numer Algor (2026) Michal Habera Andreas Zilian 10.1007/s11075-026-02328-5 http://arxiv.org/abs/2602.10878v2 Simple generators of rational function fields 2026-03-05T12:22:31Z

Consider a subfield of the field of rational functions in several indeterminates. We present an algorithm that, given a set of generators of such a subfield, finds a simple generating set. We provide an implementation of the algorithm and show that it improves upon the state of the art both in efficiency and the quality of the results. Furthermore, we demonstrate the utility of simplified generators through several case studies from different application domains, such as structural parameter identifiability. The main algorithmic novelties include performing only partial Gröbner basis computation via sparse interpolation and efficient search for polynomials of a fixed degree in a subfield of the rational function field.

2026-02-11T14:07:00Z Alexander Demin Gleb Pogudin http://arxiv.org/abs/2603.02298v1 CuTe Layout Representation and Algebra 2026-03-02T18:31:12Z

Modern architectures for high-performance computing and deep learning increasingly incorporate specialized tensor instructions, including tensor cores for matrix multiplication and hardware-optimized copy operations for multi-dimensional data. These instructions prescribe fixed, often complex data layouts that must be correctly propagated through the entire execution pipeline to ensure both correctness and optimal performance. We present CuTe, a novel mathematical specification for representing and manipulating tensors. CuTe introduces two key innovations: (1) a hierarchical layout representation that directly extends traditional flat-shape and flat-stride tensor representations, enabling the representation of complex mappings required by modern hardware instructions, and (2) a rich algebra of layout operations -- including concatenation, coalescence, composition, complementation, division, tiling, and inversion -- that enables sophisticated layout manipulation, derivation, verification, and static analysis. CuTe layouts provide a framework for managing both data layouts and thread arrangements in GPU kernels, while the layout algebra enables powerful compile-time reasoning about layout properties and the expression of generic tensor transformations. In this work, we demonstrate that CuTe's abstractions significantly aid software development compared to traditional approaches, promote compile-time verification of architecturally prescribed layouts, facilitate the implementation of algorithmic primitives that generalize to a wide range of applications, and enable the concise expression of tiling and partitioning patterns required by modern specialized tensor instructions. CuTe has been successfully deployed in production systems, forming the foundation of NVIDIA's CUTLASS library and a number of related efforts including CuTe DSL.

2026-03-02T18:31:12Z Cris Cecka http://arxiv.org/abs/2603.02621v1 GoldbachGPU: An Open Source GPU-Accelerated Framework for Verification of Goldbach's Conjecture 2026-03-02T15:51:57Z

We present GoldbachGPU, an open-source framework for large-scale computational verification of Goldbach's conjecture using commodity GPU hardware. Prior GPU-based approaches reported a hard memory ceiling near 10^11 due to monolithic prime-table allocation. We show that this limitation is architectural rather than fundamental: a dense bit-packed prime representation provides a 16x reduction in memory footprint, and a segmented double-sieve design removes the VRAM ceiling entirely. By inverting the verification loop and combining a GPU fast-path with a multi-phase primality oracle, the framework achieves exhaustive verification up to 10^12 on a single NVIDIA RTX 3070 (8 GB VRAM), with no counterexamples found. Each segment requires 14 MB of VRAM, yielding O(N) wall-clock time and O(1) memory in N. A rigorous CPU fallback guarantees mathematical completeness, though it was never invoked in practice. An arbitrary-precision checker using GMP and OpenMP extends single-number verification to 10^10000 via a synchronised batch-search strategy. The segmented architecture also exhibits clean multi-GPU scaling on data-centre hardware (tested on 8 x H100). All code is open-source, documented, and reproducible on both commodity and high-end hardware.

2026-03-02T15:51:57Z 11 pages, 7 tables, 2 figures. Accompanies the v1.1.0 release of GoldbachGPU (Zenodo DOI: https://zenodo.org/records/18837081) Isaac Llorente-Saguer http://arxiv.org/abs/2603.00880v1 A natural language framework for non-conforming hybrid polytopal methods in Gridap.jl 2026-03-01T03:04:07Z

Hybrid finite element methods such as hybridizable discontinuous Galerkin, hybrid high-order and weak Galerkin have emerged as powerful techniques for solving partial differential equations on general polytopal meshes. Despite their diverse mathematical origins, these methods share a common computational structure involving hybrid discrete spaces, local projection operators and static condensation. This work presents a comprehensive framework for implementing such methods within the Gridap finite element library. We introduce new abstractions for polytopal mesh representation using graph-based structures, broken polynomial spaces on arbitrary mesh entities, patch-based local assembly for cell-wise linear systems, high-level local operator construction and automated static condensation. These abstractions enable concise implementations of hybrid methods while maintaining computational efficiency through Julia's just-in-time compilation and Gridap's lazy evaluation strategies. We demonstrate the framework through implementations of several non-conforming polytopal methods for the Poisson problem, linear elasticity, incompressible Stokes flow and optimal control on polytopal meshes.

2026-03-01T03:04:07Z 25 pages, 8 figures, 13 listings Jordi Manyer Jai Tushar Santiago Badia http://arxiv.org/abs/2603.00214v1 Agentic Scientific Simulation: Execution-Grounded Model Construction and Reconstruction 2026-02-27T15:42:05Z

LLM agents are increasingly used for code generation, but physics-based simulation poses a deeper challenge: natural-language descriptions of simulation models are inherently underspecified, and different admissible resolutions of implicit choices produce physically valid but scientifically distinct configurations. Without explicit detection and resolution of these ambiguities, neither the correctness of the result nor its reproducibility from the original description can be assured. This paper investigates agentic scientific simulation, where model construction is organized as an execution-grounded interpret-act-validate loop and the simulator serves as the authoritative arbiter of physical validity rather than merely a runtime. We present JutulGPT, a reference implementation built on the fully differentiable Julia-based reservoir simulator JutulDarcy. The agent combines structured retrieval of documentation and examples with code synthesis, static analysis, execution, and systematic interpretation of solver diagnostics. Underspecified modelling choices are detected explicitly and resolved either autonomously (with logged assumptions) or through targeted user queries. The results demonstrate that agent-mediated model construction can be grounded in simulator validation, while also revealing a structural limitation: choices resolved tacitly through simulator defaults are invisible to the assumption log and to any downstream representation. A secondary experiment with autonomous reconstruction of a reference model from progressively abstract textual descriptions shows that reconstruction variability exposes latent degrees of freedom in simulation descriptions and provides a practical methodology for auditing reproducibility. All code, prompts, and agent logs are publicly available.

2026-02-27T15:42:05Z Knut-Andreas Lie Olav Møyner Elling Svee Jakob Torben http://arxiv.org/abs/2602.23551v1 Hyper-reduction methods for accelerating nonlinear finite element simulations: open source implementation and reproducible benchmarks 2026-02-26T23:21:31Z

Hyper-reduction methods have gained increasing attention for their potential to accelerate reduced order models for nonlinear systems, yet their comparative accuracy and computational efficiency are not well understood. Motivated by this gap, we evaluate a range of hyper-reduction techniques for nonlinear finite element models across benchmark problems of varying complexity, assessing the inevitable tradeoff between accuracy and speedup. More specifically, we consider interpolation methods based on the gappy proper orthogonal decomposition as well as the empirical quadrature procedure (EQP), and apply them to the hyper-reduction of problems in nonlinear diffusion, nonlinear elasticity and Lagrangian hydrodynamics. Our numerical results are generated using the open source libROM, Laghos and MFEM numerical libraries. Our findings reveal that the comparative performance between hyper-reduction methods depends on both the problem and the choice of time integration method. The EQP method generally achieves lower relative errors than interpolation methods and is more efficient in terms of quadrature point usage, resulting in a lower wall time for the nonlinear diffusion and elasticity problems. However, its online computational cost is observed to be relatively high for Lagrangian hydrodynamics problems. Conversely, interpolation methods exhibit greater variability, especially with respect to the use of different time integration methods in the Lagrangian hydrodynamics problems. The presented results underscore the need for problem specific method selection to balance accuracy and efficiency, while also offering useful guidance for future comparisons and refinements of hyper-reduction techniques.

2026-02-26T23:21:31Z Axel Larsson Minji Kim Chris Vales Sigrid Adriaenssens Dylan Matthew Copeland Youngsoo Choi Siu Wun Cheung