https://arxiv.org/api/R8LTtC9lQ/RUFZNyBNjRaMHk/oM2026-06-21T11:45:27Z266410515http://arxiv.org/abs/2603.27613v1High-Precision Computation and PSLQ Identification of Stokes Multipliers for Anharmonic Oscillators2026-03-29T10:13:53ZWe present a large-scale computational study combining arbitrary-precision arithmetic, sequence acceleration, and the PSLQ integer relation algorithm to discover exact closed-form expressions for fundamental constants arising in asymptotic analysis. We compute the Stokes multipliers C_M of the one-dimensional anharmonic oscillators H = p^2/2 + x^2/2 + g x^{2M} for M = 2, 3, ..., 11, extracting 17-30 significant digits from up to 1200 perturbation coefficients computed at 300-digit working precision. The computational pipeline consists of three stages: (i) Rayleigh-Schrodinger recursion in the harmonic oscillator basis, (ii) Richardson extrapolation of order 40-100 to accelerate convergence of ratio sequences, and (iii) PSLQ searches over bases of Gamma-function values and algebraic numbers. This pipeline discovers three new exact identities: C_3^2 pi^4 = 32, C_5^4 Gamma(1/4)^4 pi^5 = 2^{12} 3^2, and C_7^6 Gamma(1/3)^9 pi^6 = 2^{20} 3^3, in addition to confirming the known C_2^2 pi^3 = 6. Equally significant is a negative result: exhaustive PSLQ searches at 30-digit precision with coefficient bounds up to 2000 find no closed form for C_4, strongly suggesting the x^8 case introduces a genuinely new transcendental number. A number-theoretic pattern emerges: closed-form existence correlates with Euler's totient function phi(M-1)/2, which counts algebraically independent Gamma-function transcendentals at denominator M-1. We formulate conjectures connecting computational constant recognition to classical number theory, and provide all code and data for full reproducibility.2026-03-29T10:13:53ZJian Zhouhttp://arxiv.org/abs/2603.27569v1Beating vDSP: A 138 GFLOPS Radix-8 Stockham FFT on Apple Silicon via Two-Tier Register-Threadgroup Memory Decomposition2026-03-29T08:07:54ZWe present an optimized Fast Fourier Transform (FFT) implementation for Apple Silicon GPUs, achieving 138.45~GFLOPS for $N\!=\!4096$ complex single-precision transforms -- a 29\% improvement over Apple's highly optimized vDSP/Accelerate baseline (107~GFLOPS). Our approach is grounded in a \emph{two-tier local memory model} that formally characterizes the Apple GPU's 208~KiB register file as the primary data-resident tier and the 32~KiB threadgroup memory as an exchange-only tier, extending the decomposition framework established in a 2015 PhD thesis on Intel integrated GPU FFT for radar processing. We implement and evaluate radix-4 and radix-8 split-radix Stockham kernels in Metal Shading Language (MSL), demonstrating that the radix-8 decimation-in-time butterfly with 512 threads yields the best performance. We further present the first investigation of Apple's \texttt{simdgroup\_matrix} 8$\times$8 hardware MMA for FFT butterfly computation and report the counter-intuitive finding that on Apple GPU, threadgroup memory barriers are inexpensive ($\sim$2 cycles) while scattered threadgroup access patterns are the true bottleneck. Our multi-size implementation supports $N\!=\!256$ through $N\!=\!16384$ using a four-step decomposition for sizes exceeding the 32~KiB threadgroup memory limit. All kernels are validated against vDSP reference outputs.2026-03-29T08:07:54ZMohamed Amine Bergachhttp://arxiv.org/abs/2603.27276v1PyINLA: Fast Bayesian Inference for Latent Gaussian Models in Python2026-03-28T14:16:25ZBayesian inference often relies on Markov chain Monte Carlo (MCMC) methods, particularly required for non-Gaussian data families. When dealing with complex hierarchical models, the MCMC approach can be computationally demanding in workflows that require repeated model fitting or when working with models of large dimensions with limited hardware resources. The Integrated Nested Laplace Approximations (INLA) is a deterministic alternative for models with non-Gaussian data that belong to the class of latent Gaussian models (LGMs), yielding accurate approximations to posterior marginals in many applied settings. The INLA method was implemented in C as a standalone program, inla, that is widely used in R through the INLA package. This paper introduces PyINLA, a dedicated Python package that provides a Pythonic interface directly to the inla program. Therefore, PyINLA enables specifying LGMs, running INLA-based inference, and accessing posterior summaries directly from Python while leveraging the established INLA implementation. We describe the package design and illustrate its use on representative models, including generalized linear mixed models, time series forecasting, disease mapping, and geostatistical prediction, demonstrating how deterministic Bayesian inference can be performed in Python using INLA in a way that integrates naturally with common scientific computing workflows.2026-03-28T14:16:25Z41 pages, 9 figuresEsmail Abdul FattahElias KrainskiHavard Ruehttp://arxiv.org/abs/2603.26818v1Multi-GPU fast Fourier transforms in MATLAB (for large-scale phase-field crystal simulations)2026-03-26T20:40:45ZWe present a MATLAB-based framework for two- and three-dimensional fast Fourier transforms on multiple GPUs for large-scale numerical simulations using the pseudo-spectral Fourier method. The software implements two complementary multi-GPU strategies that overcome single-GPU memory limitations and accelerate spectral solvers. This approach is motivated by and applied to phase-field crystal (PFC) models, which are governed by tenth-order partial differential equations, require fine spatial resolution, and are typically formulated in periodic domains. Our resulting numerical framework achieves significant speedups, approximately sixfold for standard PFC simulations and up to sixtyfold for multiphysics extensions, compared to a purely CPU-based implementation running on hundreds of cores.2026-03-26T20:40:45Z12 pages, 2 figuresMaik PunkeMarco Salvalagliohttp://arxiv.org/abs/2602.03977v4Fast Relax-and-Round Unit Commitment with Sub-hourly Mechanical and Ramp Constraints2026-03-25T17:54:00ZWe propose a novel computational method for unit commitment UC, which does not require linearized approximation and provides several orders of magnitude performance improvement over current state-of-the-art. The performance improvement is achieved by introducing a heuristic tailored for UC problems. The method can be implemented using existing continuous optimization solvers and adapted for different applications. We demonstrate value of the new method in examples of advanced UC analyses at the scale where use of current state-of-the-art tools is infeasible. We expect that the capability demonstrated in this paper will be critical to address emerging power systems challenges with more volatile large loads, such as data centers, and generation that is composed of larger number of smaller units, including significant behind-the-meter generation.2026-02-03T20:01:39Z9 pages, 7 figuresShaked RegevEve TsybinaSlaven Peleshttp://arxiv.org/abs/2603.15934v2Fast Relax-and-Round Unit Commitment with Economic Horizons2026-03-25T17:51:03ZWe expand our novel computational method for unit commitment (UC) to include long-horizon planning. We introduce a fast novel algorithm to commit hydro-generators, provably accurately. We solve problems with thousands of generators at 5 minute market intervals. We show that our method can solve interconnect size UC problems in approximately 1 minute on a commodity hardware and that an increased planning horizon leads to sizable operational cost savings (our objective). This scale is infeasible for current state-of-the-art tools. We attain this runtime improvement by introducing a heuristic tailored for UC problems. Our method can be implemented using existing continuous optimization solvers and adapted for different applications. Combined, the two algorithms would allow an operator operating large systems with hydro units to make horizon-aware economic decisions.2026-03-16T21:24:51Z6 pages (journal limit), 6 figuresShaked RegevEve TsybinaSlaven Peleshttp://arxiv.org/abs/2603.21444v2Communication-Avoiding SpGEMM via Trident Partitioning on Hierarchical GPU Interconnects2026-03-24T15:54:50ZThe multiplication of two sparse matrices, known as SpGEMM, is a key kernel in scientific computing and large-scale data analytics, underpinning graph algorithms, machine learning, simulations, and computational biology, where sparsity is often highly unstructured. The unstructured sparsity makes achieving high performance challenging because it limits both memory efficiency and scalability. In distributed memory, the cost of exchanging and merging partial products across nodes further constrains performance. These issues are exacerbated on modern heterogeneous supercomputers with deep, hierarchical GPU interconnects. Current SpGEMM implementations overlook the gap between intra-node and inter-node bandwidth, resulting in unnecessary data movement and synchronization not fully exploiting the fast intra-node interconnect. To address these challenges, we introduce Trident, a hierarchy-aware 2D distributed SpGEMM algorithm that uses communication-avoiding techniques and asynchronous communication to exploit the hierarchical and heterogeneous architecture of modern supercomputing interconnect. Central to Trident is the novel trident partitioning scheme, which enables hierarchy-aware decomposition and reduces internode communication by leveraging the higher bandwidth between GPUs within a node compared to across nodes. Here, we evaluate Trident on unstructured matrices, achieving up to $2.38\times$ speedup over a 2D SpGEMM with a corresponding geometric mean speedup of $1.54\times$. Trident reduces internode communication volume by up to $2\times$ on NERSC's Perlmutter supercomputer. Furthermore, we demonstrate the effectiveness of Trident in speeding up Markov Clustering, achieving up to $2\times$ speedup compared to competing strategies.2026-03-22T23:18:49Z2026 International Conference on Supercomputing (ICS '26), July 06--09, 2026, Belfast, United KingdomJulian BellavitaLorenzo PichettiThomas PasqualiFlavio VellaGiulia Guidi10.1145/3797905.3800543http://arxiv.org/abs/2603.23143v1A Systematic Framework for Stable and Cost-Efficient Matrix Polynomial Evaluation2026-03-24T12:43:26ZA method for evaluating matrix polynomials have recently been developed that require one fewer matrix product ($1M$) than the Paterson--Stockmeyer (PS) method. Since the computational cost for large-scale matrices is asymptotically determined by the number of matrix products, this reduction directly affects the total execution time. However, the coefficients in these optimized formulas emerge as solutions to systems of nonlinear polynomial equations, resulting in multiple potential solution sets. An inappropriate selection of these coefficients can lead to numerical instability in floating-point arithmetic.
This paper presents a systematic framework and a MATLAB implementation, MatrixPolEval1, used to obtain and validate stable coefficient sets for matrix polynomials of degrees $m \in \{8, 10, 12\}$ and above. The framework introduces structural variants to maintain stability even when the original configuration fails to yield a robust solution. The provided tool identifies stable coefficient sets using variable precision arithmetic (VPA) and provides a reliability indicator for expected accuracy. Numerical experiments on polynomials arising in applications, including the matrix exponential and geometric series, show that the framework achieves the $1M$ saving while maintaining numerical accuracy comparable to the PS method.2026-03-24T12:43:26Z24 pages, 18 figuresJ. M. AlonsoJ. SastreJ. IbáñezE. Defezhttp://arxiv.org/abs/2603.21230v1A Modular Approach to Stochastic Optimisation for Inverse Problems Using the Core Imaging Library2026-03-22T13:39:46ZThe Core Imaging Library (CIL) is an open-source versatile Python framework for solving inverse problems with special emphasis on imaging applications such as computed tomography (CT), using a plug-in architecture for data and operators, interfacing to toolboxes such as ASTRA, TIGRE and SIRF. A key component of CIL is its optimisation module enabling users to flexibly combine mathematical operators and functionals to form smooth and non-smooth optimisation problems and solve these with a range of first-order algorithms. The present work introduces an expansion of CIL with a new modular framework for stochastic optimisation, allowing researchers to easily use a variety of existing stochastic optimisation algorithms as well form new ones by combining modular building blocks. Users can flexibly configure algorithmic components, adapt to diverse problem structures, and experiment with various sampling and step size strategies. Rather than individual black-box implementations of each fixed algorithm with significant redundancies, our design is modular providing building blocks that can be flexibly combined to realise a wealth of algorithm instances. The framework is particularly well-suited for large-scale applications, where stochastic methods offer notable computational advantages over deterministic approaches. To demonstrate its versatility and practical utility, we present experiments on real-world datasets from imaging inverse problems, such as X-Ray CT and Positron Emission Tomography (PET) reconstruction. In summary, the presented software expansion aims to support the research community with a robust, extensible optimisation suite for developing, testing, and benchmarking stochastic methods for inverse problems.2026-03-22T13:39:46ZEvangelos PapoutsellisMargaret A. G. DuffJakob S. JørgensenSam PorterClaire DelplanckeGemma FardellEdoardo PascaKris Thielemanshttp://arxiv.org/abs/2603.20889v1Implementation of QR factorization of tall and very skinny matrices on current GPUs2026-03-21T17:30:58ZWe consider the problem of computing a QR (or QZ) decomposition of a real, dense, tall and very skinny matrix. That is, the number of columns is tiny compared to the number of rows, rendering most computations completely or partially memory-bandwidth limited. The paper focuses on recent NVIDIA GPGPUs still supporting 64-bit floating-point arithmetic, but the findings carry over to AMD GPUs as well. We discuss two basic algorithms: Methods based on the normal equations (Gram matrix), in particular Cholesky-QR2 and SVQB, and the "tall-skinny QR" (TSQR), based on Householder transformations in a tree-reduction scheme. We propose two primary optimization techniques: Avoiding the write-back of the Q factor ("Q-less QR"), and exploiting fast local memory (shared memory on GPUs). We compare a straight-forward implementation of Gramian-based methods, and a more sophisticated TSQR implementation, in terms of performance achieved, time-to-solution, and implementation complexity. By performance modelling and numerical experiments with our own code and a vendor-optimized library routine, we demonstrate the crucial need for specialized methods and implementations in this memory-bound to transitional (memory/compute-bound) regime, and that TSQR is competitive in terms of time-to-solution, but at the cost of an investment in low-level code optimization.2026-03-21T17:30:58Zsubmitted to the Euro-Par 2026 proceedingsJonas ThiesMelven Röhrig-Zöllnerhttp://arxiv.org/abs/2512.00133v2A Matlab code for analysis and topology optimization with Third Medium Contact2026-03-20T10:50:16ZWe present a Matlab code for modelling and topology optimization of hyperelastic structures, including contact modelled by the Third Medium Contact (TMC) approach. By using the so-called HuHu-regularization we penalize the skew distortion of the bilinear finite elements discretizing void regions, thus promoting convergence of the nonlinear solver. First, we show how this method is implemented in a compact code, allowing to simulate contact and force transfer in hyperelastic structures. We then solve two topology optimization problems for minimum end-compliance of structures exhibiting contact. In the first example, contact happens at the supported boundary, while the second features self-contact. The Matlab scripts that replicate the results are included, and we discuss some possible extensions to more general problems.2025-11-28T12:21:26ZAndreas Henrik FrederiksenOle SigmundFederico Ferrarihttp://arxiv.org/abs/2603.19656v1Cellular Automata based Resource Efficient Maximally Equidistributed Pseudo-Random Number Generators2026-03-20T05:45:11ZAn equidistribution is a theoretical quality criteria that measures the uniformity of a linear pseudo-random number generator (PRNG). In this work, we first show that all existing linear cellular automaton (CA) based pseudo-random number generators (PRNGs) are weak in the equidistribution characteristic. Then we propose a list of light-weight combined CA-based PRNGs with time spacing ($2 \leq s \leq 10$) using linear maximal length cellular automata of degree $31 \leq k \leq 128$ (close to computer word size). We show that these PRNGs achieve maximal period as well as satisfy the maximal equidistribution property. Finally, we show that these combined maximal length CA-based PRNGs pass almost all the empirical testbeds, with speed and performance comparable to the Mersenne Twister.2026-03-20T05:45:11ZBhuvaneswari AKamalika Bhattacharjeehttp://arxiv.org/abs/2603.19115v1BSTModelKit.jl: A Julia Package for Constructing, Solving, and Analyzing Biochemical Systems Theory Models2026-03-19T16:38:32ZWe present BSTModelKit.jl, an open-source Julia package for constructing, solving, and analyzing Biochemical Systems Theory (BST) models of biochemical networks. The package implements S-system representations, a canonical power-law formalism for modeling metabolic and regulatory networks. BSTModelKit.jl provides a declarative model specification format, dynamic simulation via ordinary differential equation (ODE) integration, steady-state computation, and global sensitivity analysis using the Morris and Sobol methods. The package leverages the Julia scientific computing ecosystem, in particular the SciML suite of differential equation solvers, to provide efficient and flexible model analysis tools. We describe the mathematical formulation, software design, and demonstrate the package capabilities with illustrative examples.2026-03-19T16:38:32ZSandra VadhinJeffrey D. Varnerhttp://arxiv.org/abs/2603.18458v1Axis-Aligned Relaxations for Mixed-Integer Nonlinear Programming2026-03-19T03:39:54ZWe present a novel relaxation framework for general mixed-integer nonlinear programming (MINLP) grounded in computational geometry. Our approach constructs polyhedral relaxations by convexifying finite sets of strategically chosen points, iteratively refining the approximation to converge toward the simultaneous convex hull of factorable function graphs. The framework is underpinned by three key contributions: (i) a new class of explicit inequalities for products of functions that strictly improve upon standard factorable and composite relaxation schemes; (ii) a proof establishing that the simultaneous convex hull of multilinear functions over axis-aligned regions is fully determined by their values at corner points, thereby generalizing existing results from hypercubes to arbitrary axis-aligned domains; and (iii) the integration of computational geometry tools, specifically voxelization and QuickHull, to efficiently approximate feasible regions and function graphs. We implement this framework and evaluate it on randomly generated polynomial optimization problems and a suite of 619 instances from \texttt{MINLPLib}. Numerical results demonstrate significant improvements over state-of-the-art benchmarks: on polynomial instances, our relaxation closes an additional 20--25\% of the optimality gap relative to standard methods on half the instances. Furthermore, compared against an enhanced factorable programming baseline and Gurobi's root-node bounds, our approach yields superior dual bounds on approximately 30\% of \texttt{MINLPLib} instances, with roughly 10\% of cases exhibiting a gap reduction exceeding 50\%.2026-03-19T03:39:54ZHaisheng ZhuTaotao HeMohit Tawarmalanihttp://arxiv.org/abs/2603.16976v1Implementation of tangent linear and adjoint models for neural networks based on a compiler library tool2026-03-17T13:11:22ZThis paper presents TorchNWP, a compilation library tool for the efficient coupling of artificial intelligence components and traditional numerical models. It aims to address the issues of poor cross-language compatibility, insufficient coupling flexibility, and low data transfer efficiency between operational numerical models developed in Fortran and Python-based deep learning frameworks. Based on LibTorch, it optimizes and designs a unified application-layer calling interface, converts deep learning models under the PyTorch framework into a static binary format, and provides C/C++ interfaces. Then, using hybrid Fortran/C/C++ programming, it enables the deployment of deep learning models within numerical models. Integrating TorchNWP into a numerical model only requires compiling it into a callable link library and linking it during the compilation and linking phase to generate the executable. On this basis, tangent linear and adjoint model based on neural networks are implemented at the C/C++ level, which can shield the internal structure of neural network models and simplify the construction process of four-dimensional variational data assimilation systems. Meanwhile, it supports deployment on heterogeneous platforms, is compatible with mainstream neural network models, and enables mapping of different parallel granularities and efficient parallel execution. Using this tool requires minimal code modifications to the original numerical model, thus reducing coupling costs. It can be efficiently integrated into numerical weather prediction models such as CMA-GFS and MCV, and has been applied to the coupling of deep learning-based physical parameterization schemes (e.g., radiation, non-orographic gravity wave drag) and the development of their tangent linear and adjoint models, significantly improving the accuracy and efficiency of numerical weather prediction.2026-03-17T13:11:22ZSa XiaoHao JingHonglu SunHaoyu Li