https://arxiv.org/api/KqLJYSfFi1s3Cpie9s3vI5JslW82026-06-21T13:02:17Z266412015http://arxiv.org/abs/2603.15920v1DiFVM: A Vectorized Graph-Based Finite Volume Solver for Differentiable CFD on Unstructured Meshes2026-03-16T21:14:18ZDifferentiable programming has emerged as a structural prerequisite for gradient-based inverse problems and end-to-end hybrid physics--machine learning in computational fluid dynamics. However, existing differentiable CFD platforms are confined to structured Cartesian grids, excluding the geometrically complex domains where body-conforming unstructured discretizations are indispensable. We present DiFVM, the first GPU-accelerated, end-to-end differentiable finite-volume CFD solver operating natively on unstructured polyhedral meshes. The key enabling insight is a structural isomorphism between finite-volume discretization and graph neural network message-passing: by reformulating all FVM operators as static scatter/gather primitives on the mesh connectivity graph, DiFVM transforms irregular unstructured connectivity into a first-class GPU data structure. All operations are implemented in JAX/XLA, providing just-in-time compilation, operator fusion, and automatic differentiation through the complete simulation pipeline. Differentiable Windkessel outlet boundary conditions are provided for cardiovascular applications, and DiFVM accepts standard OpenFOAM case directories without modification for seamless adoption in existing workflows. Forward validation across benchmarks spanning canonical flows to patient-specific hemodynamics demonstrates close agreement with OpenFOAM, and end-to-end differentiability is demonstrated through inference of Windkessel parameters from sparse observations. DiFVM bridges the critical gap between differentiable programming and unstructured-mesh CFD, enabling gradient-based inverse problems and physics-integrated machine learning on complex engineering geometries.2026-03-16T21:14:18Z44 pages, 13 figuresPan DuYongqi LiMingqi XuJian-Xun Wanghttp://arxiv.org/abs/2603.14103v1Scorio.jl: A Julia package for ranking stochastic responses2026-03-14T20:12:56ZScorio.jl is a Julia package for evaluating and ranking systems from repeated responses to shared tasks. It provides a common tensor-based interface for direct score-based, pairwise, psychometric, voting, graph, and listwise methods, so the same benchmark can be analyzed under multiple ranking assumptions. We describe the package design, position it relative to existing Julia tools, and report pilot experiments on synthetic rank recovery, stability under limited trials, and runtime scaling.2026-03-14T20:12:56ZMohsen HaririMichael HinczewskiVipin Chaudharyhttp://arxiv.org/abs/2603.14040v1Pyroclast: A Modular High-Performance Python Solver for Geodynamics2026-03-14T17:23:57ZThis monograph presents the design, implementation, and evaluation of Pyroclast, a modular high-performance Python framework for large-scale geodynamic simulations. Pyroclast addresses limitations of legacy geodynamics solvers, often implemented in monolithic Fortran, C++, or C codebases with limited GPU support and extensibility, by combining modern numerical methods, hardware-accelerated execution, and a flexible object-oriented architecture. Designed for distributed and GPU-accelerated environments, Pyroclast provides an accessible and efficient platform for simulating mantle convection and lithospheric deformation using the marker-in-cell method and a matrix-free finite difference discretization. The work focuses on a scalable two-dimensional viscous mechanical solver that forms the computational core for future visco-elasto-plastic models. The solver includes a stress-conservative staggered grid discretization of the incompressible Stokes equations, a matrix-free geometric multigrid solver, Krylov and quasi-Newton methods, and MPI-based domain decomposition for distributed execution. Benchmarks evaluate performance and scalability. Shared-memory tests show strong scaling of the Stokes solver and demonstrate a 5-10x speedup on NVIDIA A100 GPUs compared to a multi-core CPU baseline. Distributed advection benchmarks show near-ideal weak scaling up to 896 CPU cores across seven compute nodes. These results demonstrate that Pyroclast achieves high performance while remaining accessible through a high-level Python interface. The framework also provides a blueprint for modernizing legacy geodynamics codes. Its modular architecture and Python-native implementation lower the barrier to entry while enabling interoperability with modern machine learning libraries, enabling hybrid physics-based and data-driven workflows.2026-03-14T17:23:57Z138 pages. Research monograph describing the Pyroclast geodynamics solverMarcel Ferrarihttp://arxiv.org/abs/2503.08126v2Trilinos: Enabling Scientific Computing Across Diverse Hardware Architectures at Scale2026-03-12T19:47:46ZTrilinos is a community-developed, open-source software framework that facilitates building large-scale, complex, multiscale, multiphysics simulation code bases for scientific and engineering problems. Since the Trilinos framework has undergone substantial changes to support new applications and new hardware architectures, this document is an update to ``An Overview of the Trilinos project'' by Heroux et al. (ACM Transactions on Mathematical Software, 31(3):397-423, 2005). It describes the design of Trilinos, introduces its new organization in product areas, and highlights established and new features available in Trilinos. Particular focus is put on the modernized software stack based on the Kokkos ecosystem to deliver performance portability across heterogeneous hardware architectures. This paper also outlines the organization of the Trilinos community and the contribution model to help onboard interested users and contributors.2025-03-11T07:44:20Z32 pages, 1 figureMatthias MayrAlexander HeinleinChristian GlusaSiva RajamanickamMaarten ArnstRoscoe BartlettLuc Berger-VergiatErik BomanKaren DevineGraham HarperMichael HerouxMark HoemmenJonathan HuBrian KelleyKyungjoo KimDrew P. KouriPaul KuberryKim LiegeoisCurtis C. OberRoger PawlowskiCarl PearsonMauro PeregoEric PhippsDenis RidzalNathan V. RobertsChristopher SiefertHeidi ThornquistRomin TomasettiChristian R. TrottRaymond S. TuminaroJames M. WillenbringMichael M. WolfIchitaro Yamazakihttp://arxiv.org/abs/2603.10599v1Self-Scaled Broyden Family of Quasi-Newton Methods in JAX2026-03-11T09:53:11ZWe present a JAX implementation of the Self-Scaled Broyden family of quasi-Newton methods, fully compatible with JAX and building on the Optimistix~\cite{rader_optimistix_2024} optimisation library. The implementation includes BFGS, DFP, Broyden and their Self-Scaled variants(SSBFGS, SSDFP, SSBroyden), together with a Zoom line search satisfying the strong Wolfe conditions. This is a short technical note, not a research paper, as it does not claim any novel contribution; its purpose is to document the implementation and ease the adoption of these optimisers within the JAX community. The code is available at https://github.com/IvanBioli/ssbroyden_optimistix.git.2026-03-11T09:53:11ZIvan BioliMikel Mendibe Abarrategihttp://arxiv.org/abs/2510.14964v2Efficient and Flexible Multirate Temporal Adaptivity2026-03-10T13:22:42ZIn this work we present two new families of multirate time step adaptivity controllers, that are designed to work with embedded multirate infinitesimal (MRI) time integration methods for adapting time steps when solving problems with multiple time scales. We compare these controllers against competing approaches on two benchmark problems, showing that the proposed methods offer dramatically improved performance and flexibility. The combination of embedded MRI methods and the proposed controllers enable adaptive simulations of problems with a potentially arbitrary number of time scales, achieving high accuracy while maintaining low computational cost. Additionally, we introduce a new set of embeddings for the family of explicit multirate exponential Runge--Kutta (MERK) methods of orders 2 through 5, resulting in the first-ever fifth-order embedded MRI method. Finally, we compare the performance of a wide range of embedded MRI methods on our benchmark problems to provide guidance on how to select an appropriate MRI method and multirate controller.2025-10-16T17:59:16ZDaniel R. ReynoldsSylvia AmihereDashon MitchellVu Thai Luanhttp://arxiv.org/abs/2603.08957v1Automated Tensor-Relational Decomposition for Large-Scale Sparse Tensor Computation2026-03-09T21:43:39ZA \emph{tensor-relational} computation is a relational computation where individual tuples carry vectors, matrices, or higher-dimensional arrays. An advantage of tensor-relational computation is that the overall computation can be executed on top of a relational system, inheriting the system's ability to automatically handle very large inputs with high levels of sparsity while high-performance kernels (such as optimized matrix-matrix multiplication codes) can be used to perform most of the underlying mathematical operations. In this paper, we introduce upper-case-lower-case \texttt{EinSum}, which is a tensor-relational version of the classical Einstein Summation Notation. We study how to automatically rewrite a computation in Einstein Notation into upper-case-lower-case \texttt{EinSum} so that computationally intensive components are executed using efficient numerical kernels, while sparsity is managed relationally.2026-03-09T21:43:39ZYuxin TangZhiyuan XinZhimin DingXinyu YaoDaniel BourgeoisTirthak PatelChris Jermainehttp://arxiv.org/abs/2603.07850v1A Lock-Free, Fully GPU-Resident Architecture for the Verification of Goldbach's Conjecture2026-03-08T23:58:47ZWe present a fully device-resident, multi-GPU architecture for the large-scale computational verification of Goldbach's conjecture. In prior work, a segmented double-sieve eliminated monolithic VRAM bottlenecks but remained constrained by host-side sieve construction and PCIe transfer latency. In this work, we migrate the entire segment generation pipeline to the GPU using highly optimised L1 shared-memory tiling, achieving near-zero host-device communication during the critical verification path. To fully leverage heterogeneous multi-GPU clusters, we introduce an asynchronous, lock-free work-stealing pool that replaces static workload partitioning with atomic segment claiming, enabling $99.7$% parallel efficiency at 2 GPUs and $98.6$% at $4$ GPUs. We further implement strict mathematical overflow guards guaranteeing the soundness of the 64-bit verification pipeline up to its theoretical ceiling of $1.84 \times 10^{19}$. On the same hardware, the new architecture achieves a $45.6\times$ algorithmic speedup over its host-coupled predecessor at N = $10^{10}$. End-to-end, the framework verifies Goldbach's conjecture up to $10^{12}$ in $36.5$ seconds on a single NVIDIA RTX 5090, and up to $10^{13}$ in $133.5$ seconds on a four-GPU system. All code is open-source and reproducible on commodity hardware.2026-03-08T23:58:47Z14 pages, 4 figures, 3 tables. The presented work details a major architectural overhaul: migration of the segmented sieve to GPU L1 shared memory and the implementation of a lock-free multi-GPU work pool. Source code available at: https://github.com/isaac-6/goldbach-gpuIsaac Llorente-Saguerhttp://arxiv.org/abs/2511.00292v2Numerically stable evaluation of closed-form expressions for eigenvalues of $3 \times 3$ matrices2026-03-05T12:50:39ZTrigonometric formulas for eigenvalues of $3 \times 3$ matrices that build on Cardano's and Viète's work on algebraic solutions of the cubic are numerically unstable for matrices with repeated eigenvalues. This work presents numerically stable, closed-form evaluation of eigenvalues of real, diagonalizable $3 \times 3$ matrices via four invariants: the trace $I_1$, the deviatoric invariants $J_2$ and $J_3$, and the discriminant $Δ$. We analyze the conditioning of these invariants and derive tight forward error bounds. For $J_2$ we propose an algorithm and prove its accuracy. We benchmark all invariants and the resulting eigenvalue formulas, relating observed forward errors to the derived bounds. In particular, we show that, for the special case of matrices with a well-conditioned eigenbasis, the newly proposed algorithms have errors within the forward stability bounds. Performance benchmarks show that the proposed algorithm is approximately ten times faster than the highly optimized LAPACK library for a challenging test case, while maintaining comparable accuracy.2025-10-31T22:20:28Z24 pages. Numer Algor (2026)Michal HaberaAndreas Zilian10.1007/s11075-026-02328-5http://arxiv.org/abs/2602.10878v2Simple generators of rational function fields2026-03-05T12:22:31ZConsider a subfield of the field of rational functions in several indeterminates. We present an algorithm that, given a set of generators of such a subfield, finds a simple generating set. We provide an implementation of the algorithm and show that it improves upon the state of the art both in efficiency and the quality of the results. Furthermore, we demonstrate the utility of simplified generators through several case studies from different application domains, such as structural parameter identifiability. The main algorithmic novelties include performing only partial Gröbner basis computation via sparse interpolation and efficient search for polynomials of a fixed degree in a subfield of the rational function field.2026-02-11T14:07:00ZAlexander DeminGleb Pogudinhttp://arxiv.org/abs/2603.02298v1CuTe Layout Representation and Algebra2026-03-02T18:31:12ZModern architectures for high-performance computing and deep learning increasingly incorporate specialized tensor instructions, including tensor cores for matrix multiplication and hardware-optimized copy operations for multi-dimensional data. These instructions prescribe fixed, often complex data layouts that must be correctly propagated through the entire execution pipeline to ensure both correctness and optimal performance. We present CuTe, a novel mathematical specification for representing and manipulating tensors. CuTe introduces two key innovations: (1) a hierarchical layout representation that directly extends traditional flat-shape and flat-stride tensor representations, enabling the representation of complex mappings required by modern hardware instructions, and (2) a rich algebra of layout operations -- including concatenation, coalescence, composition, complementation, division, tiling, and inversion -- that enables sophisticated layout manipulation, derivation, verification, and static analysis. CuTe layouts provide a framework for managing both data layouts and thread arrangements in GPU kernels, while the layout algebra enables powerful compile-time reasoning about layout properties and the expression of generic tensor transformations.
In this work, we demonstrate that CuTe's abstractions significantly aid software development compared to traditional approaches, promote compile-time verification of architecturally prescribed layouts, facilitate the implementation of algorithmic primitives that generalize to a wide range of applications, and enable the concise expression of tiling and partitioning patterns required by modern specialized tensor instructions.
CuTe has been successfully deployed in production systems, forming the foundation of NVIDIA's CUTLASS library and a number of related efforts including CuTe DSL.2026-03-02T18:31:12ZCris Ceckahttp://arxiv.org/abs/2603.02621v1GoldbachGPU: An Open Source GPU-Accelerated Framework for Verification of Goldbach's Conjecture2026-03-02T15:51:57ZWe present GoldbachGPU, an open-source framework for large-scale computational verification of Goldbach's conjecture using commodity GPU hardware. Prior GPU-based approaches reported a hard memory ceiling near 10^11 due to monolithic prime-table allocation. We show that this limitation is architectural rather than fundamental: a dense bit-packed prime representation provides a 16x reduction in memory footprint, and a segmented double-sieve design removes the VRAM ceiling entirely. By inverting the verification loop and combining a GPU fast-path with a multi-phase primality oracle, the framework achieves exhaustive verification up to 10^12 on a single NVIDIA RTX 3070 (8 GB VRAM), with no counterexamples found. Each segment requires 14 MB of VRAM, yielding O(N) wall-clock time and O(1) memory in N. A rigorous CPU fallback guarantees mathematical completeness, though it was never invoked in practice. An arbitrary-precision checker using GMP and OpenMP extends single-number verification to 10^10000 via a synchronised batch-search strategy. The segmented architecture also exhibits clean multi-GPU scaling on data-centre hardware (tested on 8 x H100). All code is open-source, documented, and reproducible on both commodity and high-end hardware.2026-03-02T15:51:57Z11 pages, 7 tables, 2 figures. Accompanies the v1.1.0 release of GoldbachGPU (Zenodo DOI: https://zenodo.org/records/18837081)Isaac Llorente-Saguerhttp://arxiv.org/abs/2603.00880v1A natural language framework for non-conforming hybrid polytopal methods in Gridap.jl2026-03-01T03:04:07ZHybrid finite element methods such as hybridizable discontinuous Galerkin, hybrid high-order and weak Galerkin have emerged as powerful techniques for solving partial differential equations on general polytopal meshes. Despite their diverse mathematical origins, these methods share a common computational structure involving hybrid discrete spaces, local projection operators and static condensation. This work presents a comprehensive framework for implementing such methods within the Gridap finite element library. We introduce new abstractions for polytopal mesh representation using graph-based structures, broken polynomial spaces on arbitrary mesh entities, patch-based local assembly for cell-wise linear systems, high-level local operator construction and automated static condensation. These abstractions enable concise implementations of hybrid methods while maintaining computational efficiency through Julia's just-in-time compilation and Gridap's lazy evaluation strategies. We demonstrate the framework through implementations of several non-conforming polytopal methods for the Poisson problem, linear elasticity, incompressible Stokes flow and optimal control on polytopal meshes.2026-03-01T03:04:07Z25 pages, 8 figures, 13 listingsJordi ManyerJai TusharSantiago Badiahttp://arxiv.org/abs/2603.00214v1Agentic Scientific Simulation: Execution-Grounded Model Construction and Reconstruction2026-02-27T15:42:05ZLLM agents are increasingly used for code generation, but physics-based simulation poses a deeper challenge: natural-language descriptions of simulation models are inherently underspecified, and different admissible resolutions of implicit choices produce physically valid but scientifically distinct configurations. Without explicit detection and resolution of these ambiguities, neither the correctness of the result nor its reproducibility from the original description can be assured.
This paper investigates agentic scientific simulation, where model construction is organized as an execution-grounded interpret-act-validate loop and the simulator serves as the authoritative arbiter of physical validity rather than merely a runtime. We present JutulGPT, a reference implementation built on the fully differentiable Julia-based reservoir simulator JutulDarcy. The agent combines structured retrieval of documentation and examples with code synthesis, static analysis, execution, and systematic interpretation of solver diagnostics. Underspecified modelling choices are detected explicitly and resolved either autonomously (with logged assumptions) or through targeted user queries.
The results demonstrate that agent-mediated model construction can be grounded in simulator validation, while also revealing a structural limitation: choices resolved tacitly through simulator defaults are invisible to the assumption log and to any downstream representation. A secondary experiment with autonomous reconstruction of a reference model from progressively abstract textual descriptions shows that reconstruction variability exposes latent degrees of freedom in simulation descriptions and provides a practical methodology for auditing reproducibility. All code, prompts, and agent logs are publicly available.2026-02-27T15:42:05ZKnut-Andreas LieOlav MøynerElling SveeJakob Torbenhttp://arxiv.org/abs/2602.23551v1Hyper-reduction methods for accelerating nonlinear finite element simulations: open source implementation and reproducible benchmarks2026-02-26T23:21:31ZHyper-reduction methods have gained increasing attention for their potential to accelerate reduced order models for nonlinear systems, yet their comparative accuracy and computational efficiency are not well understood. Motivated by this gap, we evaluate a range of hyper-reduction techniques for nonlinear finite element models across benchmark problems of varying complexity, assessing the inevitable tradeoff between accuracy and speedup. More specifically, we consider interpolation methods based on the gappy proper orthogonal decomposition as well as the empirical quadrature procedure (EQP), and apply them to the hyper-reduction of problems in nonlinear diffusion, nonlinear elasticity and Lagrangian hydrodynamics. Our numerical results are generated using the open source libROM, Laghos and MFEM numerical libraries. Our findings reveal that the comparative performance between hyper-reduction methods depends on both the problem and the choice of time integration method. The EQP method generally achieves lower relative errors than interpolation methods and is more efficient in terms of quadrature point usage, resulting in a lower wall time for the nonlinear diffusion and elasticity problems. However, its online computational cost is observed to be relatively high for Lagrangian hydrodynamics problems. Conversely, interpolation methods exhibit greater variability, especially with respect to the use of different time integration methods in the Lagrangian hydrodynamics problems. The presented results underscore the need for problem specific method selection to balance accuracy and efficiency, while also offering useful guidance for future comparisons and refinements of hyper-reduction techniques.2026-02-26T23:21:31ZAxel LarssonMinji KimChris ValesSigrid AdriaenssensDylan Matthew CopelandYoungsoo ChoiSiu Wun Cheung