https://arxiv.org/api/aT2Wi5KJFyUMR1GdSnkaha6BAQQ 2026-06-21T10:41:11Z 2664 90 15 http://arxiv.org/abs/2603.09038v2 Accelerating High-Order Finite Element Simulations at Extreme Scale with FP64 Tensor Cores 2026-04-09T18:07:55Z

Finite element simulations play a critical role in a wide range of applications, from automotive design to tsunami modeling and computational electromagnetics. Performing these simulations efficiently at the high resolutions needed for practical applications and scientific insights necessitates the use of high-order methods and large-scale supercomputing. While much progress has been made in porting finite element codes to GPU systems in recent years, additional improvements in the efficiency and computational speed of GPU-accelerated high-order finite element simulations are in constant demand. In this paper, we demonstrate that the FP64 tensor cores on NVIDIA GPUs can be used to further accelerate such simulations, achieving significant speedups in key kernels of MFEM, a scalable open-source finite element library widely used in HPC applications. By integrating FP64 tensor cores with kernel fusion optimizations, we were able to achieve up to 2$\times$ performance gains and up to 83% energy efficiency gains on NVIDIA's Grace Hopper GH200 and Grace Blackwell GB200 architectures. To the best of our knowledge, this is the first time that FP64 tensor cores have been directly programmed to accelerate large-scale finite element scientific computing applications. We demonstrate the performance of the optimized kernels at exascale by showing near-perfect weak scaling efficiency and 90% strong scaling efficiency across nearly 10,000 GPUs on the Alps system. The new algorithms and MFEM enhancements directly benefit complex production codes, including the 2025 Gordon Bell Prize-winning application for real-time tsunami forecasting.

2026-03-10T00:12:47Z Jiqun Tu Ian Karlin John Camier Veselin Dobrev Tzanio Kolev Stefan Henneking Omar Ghattas http://arxiv.org/abs/2601.17979v2 An Efficient Batch Solver for the Singular Value Decomposition on GPUs 2026-04-09T13:58:01Z

The singular value decomposition (SVD) is a powerful tool in modern numerical linear algebra, which underpins computational methods such as principal component analysis (PCA), low-rank approximations, and randomized algorithms. Many practical scenarios require solving numerous small SVD problems, a regime generally referred to as "batch SVD". Existing programming models can handle this efficiently on parallel CPU architectures, but high-performance solutions for GPUs remain immature. A GPU-oriented batch SVD solver is introduced. This solver exploits the one-sided Jacobi algorithm to exploit fine-grained parallelism, and a number of algorithmic and design optimizations achieve unmatched performance. Starting from a baseline solver, a sequence of optimizations is applied to obtain incremental performance gains. Numerical experiments show that the new solver is robust across problems with different numerical properties, matrix shapes, and arithmetic precisions. Performance benchmarks on both NVIDIA and AMD systems show significant performance speedups over vendor solutions as well as existing open-source solvers.

2026-01-25T20:19:21Z Ahmad Abdelfattah Massimiliano Fasi http://arxiv.org/abs/2604.06575v2 Polylab: A MATLAB Toolbox for Multivariate Polynomial Modeling 2026-04-09T02:52:52Z

Polylab is a MATLAB toolbox for multivariate polynomial scalars and polynomial matrices with a unified symbolic-numeric interface across CPU and GPU-oriented backends. The software exposes three aligned classes: MPOLY for CPU execution, MPOLY_GPU as a legacy GPU baseline, and MPOLY_HP as an improved GPU-oriented implementation. Across these backends, Polylab supports polynomial construction, algebraic manipulation, simplification, matrix operations, differentiation, Jacobian and Hessian construction, LaTeX export, CPU-side LaTeX reconstruction, backend conversion, and interoperability with YALMIP and SOSTOOLS. Versions 3.0 and 3.1 add two practically important extensions: explicit variable identity and naming for safe mixed-variable expression handling, and affine-normal direction computation via automatic differentiation, MF-logDet-Exact, and MF-logDet-Stochastic. The toolbox has already been used successfully in prior research applications, and Polylab Version 3.1 adds a new geometry-oriented computational layer on top of a mature polynomial modeling core. This article documents the architecture and user-facing interface of the software, organizes its functionality by workflow, presents representative MATLAB sessions with actual outputs, and reports reproducible benchmarks. The results show that MPOLY is the right default for lightweight interactive workloads, whereas MPOLY-HP becomes advantageous for reduction-heavy simplification and medium-to-large affine-normal computation; the stochastic log-determinant variant becomes attractive in larger sparse regimes under approximation-oriented parameter choices.

2026-04-08T01:50:23Z 21 pages, 4 figures, 12 tables Yi-Shuai Niu Shing-Tung Yau http://arxiv.org/abs/2604.07311v1 A Proposed Framework for Advanced (Multi)Linear Infrastructure in Engineering and Science (FAMLIES) 2026-04-08T17:22:00Z

We leverage highly successful prior projects sponsored by multiple NSF grants and gifts from industry: the BLAS-like Library Instantiation Software (BLIS) and the libflame efforts to lay the foundation for a new flexible framework by vertically integrating the dense linear and multi-linear (tensor) software stacks that are important to modern computing. This vertical integration will enable high-performance computations from node-level to massively-parallel, and across both CPU and GPU architectures. The effort builds on decades of experience by the research team turning fundamental research on the systematic derivation of algorithms (the NSF-sponsored FLAME project) into practical software for this domain, targeting single and multi-core (BLIS, TBLIS, and libflame), GPU-accelerated (SuperMatrix), and massively parallel (PLAPACK, Elemental, and ROTE) compute environments. This project will implement key linear algebra and tensor operations which highlight the flexibility and effectiveness of the new framework, and set the stage for further work in broadening functionality and integration into diverse scientific and machine learning software.

2026-04-08T17:22:00Z 24 pages Devin A. Matthews Tze Meng Low Margaret E. Myers Devangi N. Parikh Robert A. van de Geijn http://arxiv.org/abs/2604.07240v1 $k$-server-bench: Automating Potential Discovery for the $k$-Server Conjecture 2026-04-08T16:06:43Z

We introduce a code-based challenge for automated, open-ended mathematical discovery based on the $k$-server conjecture, a central open problem in competitive analysis. The task is to discover a potential function satisfying a large graph-structured system of simple linear inequalities. The resulting evaluation procedure is sound but incomplete: any violated inequality definitively refutes a candidate, whereas satisfying all inequalities does not by itself constitute a proof of the corresponding conjecture's special case. Nevertheless, a candidate that passes all constraints would be strong evidence toward a valid proof and, to the best of our knowledge, no currently known potential achieves this under our formulation in the open $k=4$ circle case. As such, a successful candidate would already be an interesting contribution to the $k$-server conjecture, and could become a substantial theoretical result when paired with a full proof. Experiments on the resolved $k=3$ regime show that current agentic methods can solve nontrivial instances, and in the open $k=4$ regime they reduce the number of violations relative to existing potentials without fully resolving the task. Taken together, these results suggest that the task is challenging but plausibly within reach of current methods. Beyond its relevance to the $k$-server community, where the developed tooling enables researchers to test new hypotheses and potentially improve on the current record, the task also serves as a useful \emph{benchmark} for developing code-based discovery agents. In particular, our $k=3$ results show that it mitigates important limitations of existing open-ended code-based benchmarks, including early saturation and the weak separation between naive random baselines and more sophisticated methods.

2026-04-08T16:06:43Z Kirill Brilliantov Etienne Bamas Emmanuel Abbé http://arxiv.org/abs/2604.06258v1 Accurate Residues for Floating-Point Debugging 2026-04-06T22:46:06Z

Floating-point arithmetic is error-prone and unintuitive. Floating-point debuggers instrument programs to monitor floating-point arithmetic at run time and flag numerical issues. They estimate residues, i.e., the difference between actual floating-point and ideal real values, for every floating-point value in the program. Prior work explores various approaches for computing these residues accurately and efficiently. Unfortunately, the most efficient methods, based on "error-free transformations", have a high rate of false reports, while the most accurate methods, based on high-precision arithmetic, are very slow. This paper builds on error-free-transformations-based approaches and aims to improve their accuracy while preserving efficiency. To more accurately compute residues, this paper divides residue computation into two steps (rounding error computation and residue function evaluation) and shows how to perform each step accurately via careful improvements to the current state of the art. We evaluate on 44 large scientific computing workloads, focusing on the 14 benchmarks where prior tools produce false reports: our approach eliminates false reports on 10 benchmarks and substantially reduces them on the remaining 3 benchmarks. Moreover, complex numerical issues require additional care due to absorption, where two machine-precision residues cannot both be computed accurately in a single execution. This paper introduces residue override, which re-executes the program multiple times, computing different residues in different executions and assembling a final "patchwork" execution. We evaluate on 169 standard benchmarks drawn from numerical analysis papers and textbooks, requiring only 3.6 re-executions on average. Among 34 benchmarks with false reports in the initial run, residue override is triggered on 29 of them and reduces false reports on 25 of them, averaging 7.1 re-executions.

2026-04-06T22:46:06Z Yumeng He Pavel Panchekha http://arxiv.org/abs/1704.00605v6 Faster Base64 Encoding and Decoding Using AVX2 Instructions 2026-04-06T14:59:10Z

Web developers use base64 formats to include images, fonts, sounds and other resources directly inside HTML, JavaScript, JSON and XML files. We estimate that billions of base64 messages are decoded every day. We are motivated to improve the efficiency of base64 encoding and decoding. Compared to state-of-the-art implementations, we multiply the speeds of both the encoding (~10x) and the decoding (~7x). We achieve these good results by using the single-instruction-multiple-data (SIMD) instructions available on recent Intel processors (AVX2). Our accelerated software abides by the specification and reports errors when encountering characters outside of the base64 set. It is available online as free software under a liberal license.

2017-03-30T19:04:09Z software at https://github.com/lemire/fastbase64 ACM Transactions on the Web 12 (3), 2018 Wojciech Muła Daniel Lemire 10.1145/3132709 http://arxiv.org/abs/2604.02531v1 \texttt{DR-DAQP}: An Hybrid Operator Splitting and Active-Set Solver for Affine Variational Inequalities 2026-04-02T21:30:42Z

We present \texttt{DR-DAQP}, an open-source solver for strongly monotone affine variational inequaliries that combines Douglas-Rachford operator splitting with an active-set acceleration strategy. The key idea is to estimate the active set along the iterations to attempt a Newton-type correction. This step yields the exact AVI solution when the active set is correctly estimated, thus overcoming the asymptotic convergence limitation inherent in first-order methods. Moreover, we exploit warm-starting and pre-factorization of relevant matrices to further accelerate evaluation of the algorithm iterations. We prove convergence and establish conditions under which the algorithm terminates in finite time with the exact solution. Numerical experiments on randomly generated AVIs show that \texttt{DR-DAQP} is up to two orders of magnitude faster than the state-of-the-art solver \texttt{PATH}. On a game-theoretic MPC benchmark, \texttt{DR-DAQP} achieves solve times several orders of magnitude below those of the mixed-integer solver \texttt{NashOpt}. A high-performing C implementation is available at \textt{https://github.com/darnstrom/daqp}, with easily-accessible interfaces to Julia, MATLAB, and Python.

2026-04-02T21:30:42Z Daniel Arnström Emilio Benenati Giuseppe Belgioioso http://arxiv.org/abs/2603.29807v2 BioNetFlux: A Python Framework for Reaction--Diffusion--Chemotaxis Simulations on One-Dimensional Network Geometries 2026-04-02T07:52:49Z

We present BioNetFlux, an open-source Python framework for the numerical simulation of coupled systems of partial differential equations (PDEs) on one-dimensional multi-arc networks by the Hybridized Discontinuous Galerkin method. Its design targets biological transport phenomena on graph-like geometries that arise naturally in microfluidic organ-on-chip (OoC) devices, vascular networks, and in-vitro cell-migration assays.

2026-03-31T14:35:42Z Silvia Bertoluzza http://arxiv.org/abs/2505.00311v3 A Practical GPU-Enhanced Matrix-Free Primal-Dual Method for Large-Scale Conic Programs 2026-04-01T20:54:55Z

In this paper, we introduce a practical GPU-enhanced matrix-free first-order method for solving large-scale conic programming problems, which we refer to as PDCS, standing for the Primal-Dual Conic Programming Solver. Problems that it solves include linear programs, second-order cone programs, convex quadratic programs, and exponential cone programs. The method avoids matrix factorizations and leverages sparse matrix-vector multiplication as its core computational operation, which is both memory-efficient and well-suited for GPU acceleration. The method builds on the restarted primal-dual hybrid gradient method but further incorporates several enhancements. Additionally, it employs a bisection-based method to compute projections onto rescaled cones. Furthermore, cuPDCS is a GPU implementation of PDCS and it implements customized computational schemes that utilize different levels of GPU architecture to handle cones of different types and sizes. Numerical experiments demonstrate that cuPDCS is generally more efficient than state-of-the-art commercial solvers and other first-order methods on large-scale conic program applications, including Fisher market equilibrium problems, Lasso regression, and multi-period portfolio optimization. Furthermore, cuPDCS also exhibits better scalability, efficiency, and robustness compared to other first-order methods on the conic program benchmark dataset CBLIB. These advantages are more pronounced in large-scale, lower-accuracy settings.

2025-05-01T05:13:48Z 37 pages, 8 figures Zhenwei Lin Zikai Xiong Dongdong Ge Yinyu Ye http://arxiv.org/abs/2602.08639v3 Comparison of Structure Preserving Schemes for the Cahn-Hilliard-Navier-Stokes Equations with Degenerate Mobility and Adaptive Mesh Refinement 2026-04-01T17:03:50Z

The Cahn-Hilliard-Navier-Stokes (CHNS) system utilizes a diffusive phase-field for interface tracking of multi-phase fluid flows. Recently structure preserving methods for CHNS have moved into focus to construct numerical schemes that, for example, are mass conservative or obey initial bounds of the phase-field variable. In this work decoupled implicit-explicit formulations based on the Discontinuous Galerkin (DG) methodology are considered and compared to existing schemes from the literature. For the fluid flow a standard continuous Galerkin approach is applied. An adaptive conforming grid is utilized to further draw computational focus on the interface regions, while coarser meshes are utilized around pure phases. All presented methods are compared against each other in terms of bound preservation, mass conservation, and energy dissipation for different examples found in the literature, including a classical rising droplet problem.

2026-02-09T13:34:48Z 51 pages, 19 figures Jimmy Kornelije Gunnarsson Robert Klöfkorn http://arxiv.org/abs/2509.07690v6 HYLU: Hybrid Parallel Sparse LU Factorization 2026-04-01T11:52:10Z

This article introduces HYLU, a hybrid parallel LU factorization-based general-purpose solver designed for efficiently solving sparse linear systems (Ax=b) on multi-core shared-memory architectures. The key technical feature of HYLU is the integration of hybrid numerical kernels so that it can adapt to various sparsity patterns of coefficient matrices. Tests on 37 sparse matrices from SuiteSparse Matrix Collection reveal that HYLU outperforms Intel MKL PARDISO in the numerical factorization phase by geometric means of 2.36X (for one-time solving) and 2.90X (for repeated solving). HYLU can be downloaded from https://github.com/chenxm1986/hylu.

2025-09-09T12:55:44Z Xiaoming Chen http://arxiv.org/abs/2603.29986v1 ParetoEnsembles.jl: A Julia Package for Multiobjective Parameter Estimation Using Pareto Optimal Ensemble Techniques 2026-03-31T16:51:24Z

Mathematical models of natural and man-made systems often have many adjustable parameters that must be estimated from multiple, potentially conflicting datasets. Rather than reporting a single best-fit parameter vector, it is often more informative to generate an ensemble of parameter sets that collectively map out the trade-offs among competing objectives. This paper presents ParetoEnsembles.jl, an open-source Julia package that generates such ensembles using Pareto Optimal Ensemble Techniques (POETs), a simulated-annealing-based algorithm that requires no gradient information. The implementation corrects the original dominance relation from weak to strict Pareto dominance, reduces the per-iteration ranking cost from $O(n^2 m)$ to $O(nm)$ through an incremental update scheme, and adds multi-chain parallel execution for improved front coverage. We demonstrate the package on a cell-free gene expression model fitted to experimental data and a blood coagulation cascade model with ten estimated rate constants and three objectives. A controlled synthetic-data study reveals parameter identifiability structure, with individual rate constants off by several-fold yet model predictions accurate to 7%. A five-replicate coverage analysis confirms that timing features are reliably covered while peak amplitude is systematically overconfident. Validation against published experimental thrombin generation data demonstrates that the ensemble predicts held-out conditions to within 10% despite inherent model approximation error. By making ensemble generation lightweight and accessible, ParetoEnsembles.jl aims to lower the barrier to routine uncertainty characterization in mechanistic modeling.

2026-03-31T16:51:24Z Jeffrey D. Varner http://arxiv.org/abs/2603.29129v1 Computing FFTs at Target Precision Using Lower-Precision FFTs 2026-03-31T01:25:25Z

Modern processors deliver higher throughput for lower-precision arithmetic than for higher-precision arithmetic. For matrix multiplication, the Ozaki scheme exploits this performance gap by splitting the inputs into lower-precision components and delegating the computation to optimized lower-precision routines. However, no similar approach exists for the fast Fourier transform (FFT). Here, we propose a method that computes target-precision FFTs using lower-precision FFTs by applying the Ozaki scheme to the cyclic convolution in the Bluestein FFT. The split component convolutions are computed exactly using the number theoretic transform (NTT), an FFT over a finite field, instead of floating-point FFTs, combined with the Chinese remainder theorem. We introduce an upper bound on the number of splits and an NTT-domain accumulation strategy to reduce the NTT call count. As a concrete implementation, we implement a double-precision FFT using 32-bit NTTs and confirm reduced relative error compared with those for FFTs based on FFTW and Triple-Single precision arithmetic, with stable error across FFT lengths, at most 96 NTT calls, or 64 NTT calls with NTT-domain accumulation. On an Intel Xeon Platinum 8468 for lengths $n=2^{10}$-$2^{18}$, the execution time is approximately 107-1315$\times$ that of FFTW's double-precision FFT, with NTTs accounting for approximately 80% of the total time.

2026-03-31T01:25:25Z Shota Kawakami Daisuke Takahashi http://arxiv.org/abs/2603.28756v1 Fast Large-Scale Model-Based Iterative Tomography via Exploiting Mathematical Structure, Hierarchical Optimization, Smart Initialization, and Distributed GPU Computing 2026-03-30T17:57:21Z

Model-Based Iterative Reconstruction (MBIR) is important because direct methods, such as Filtered Back-Projection (FBP) can introduce significant noise and artifacts in sparse-angle tomography, especially for time-evolving samples. Although MBIR produces high-quality reconstructions through prior-informed optimization, its computational cost has traditionally limited its broader adoption. In previous work, we addressed this limitation by expressing the Radon transform and its adjoint using non-uniform fast Fourier transforms (NUFFTs), reducing computational complexity relative to conventional projection-based methods. We further accelerated computation by employing a multi-GPU system for parallel processing. In this work, we further accelerate our Fourier-domain framework, by introducing four main strategies: (1) a reformulation of the MBIR forward and adjoint operators that exploits their multi-level Toeplitz structure for efficient Fourier-domain computation; (2) an improved initialization strategy that uses back-projected data filtered with a standard ramp filter as the starting estimate; (3) a hierarchical multi-resolution reconstruction approach that first solves the problem on coarse grids and progressively transitions to finer grids using Lanczos interpolation; and (4) a distributed-memory implementation using MPI that enables near-linear scaling on large high-performance computing (HPC) systems. Together, these innovations significantly reduce iteration counts, improve parallel efficiency, and make high-quality MBIR reconstruction practical for large-scale tomographic imaging. These advances open the door to near-real-time MBIR for applications such as in situ, in operando, and time-evolving experiments.

2026-03-30T17:57:21Z Dinesh Kumar Jeffrey Donatelli