https://arxiv.org/api/T7qj4WHkuxAeicbXlUrQZ7FYtHs2026-06-21T09:23:01Z26647515http://arxiv.org/abs/2604.26160v1Fitting Large Nonlinear Mixed Effects Models Using Variational Expectation Maximization2026-04-28T22:41:53ZNonlinear Mixed Effects models (NLME) models are widely used in pharmacometrics and related fields to analyze hierarchical and longitudinal data. However, as the number of parameters and random effects increases, traditional methods for maximizing the marginal likelihood become computationally expensive. This paper explores the Variational Expectation Maximization (VEM) algorithm, a scalable alternative for fitting NLME models. Originally introduced in the context of probabilistic graphical models and later popularized through variational autoencoders, VEM has not been extensively applied to NLME modeling. By leveraging flexible variational families and reverse-mode automatic differentiation, VEM can efficiently maximize the marginal likelihood, scaling to NLME models with over 15,000 population parameters. This work provides a detailed description of VEM, compares it to other NLME fitting algorithms, and highlights its scalability through computational experiments. Using the Pumas statistical software, we fit two test models: 1) a standard warfarin model, and 2) a DeepNLME Friberg model with 15,410 population parameters and 16 random effects. The warfarin model was fitted to completion to demonstrate the correctness of VEM, while the DeepNLME Friberg model was fitted for a limited number of iterations to measure the time per iteration and demonstrate VEM's scalability.2026-04-28T22:41:53ZMohamed TarekPedro Afonsohttp://arxiv.org/abs/2512.05919v2A Discontinuous Galerkin Consistent Splitting Method for the Incompressible Navier-Stokes Equations2026-04-28T16:39:48ZThis work presents the discontinuous Galerkin discretization of the consistent splitting scheme proposed by Liu [J. Liu, J. Comp. Phys., 228(19), 2009]. The method enforces the divergence-free constraint implicitly, removing velocity--pressure compatibility conditions and eliminating pressure boundary layers. Consistent boundary conditions are imposed, also for settings with open and traction boundaries. Hence, accuracy in time is no longer limited by a splitting error.
The symmetric interior penalty Galerkin method is used for second spatial derivatives. The convective term is treated in a semi-implicit manner, which relaxes the CFL restriction of explicit schemes while avoiding the need to solve nonlinear systems required by fully implicit formulations. For improved mass conservation, Leray projection is combined with divergence and normal continuity penalty terms.
By selecting appropriate fluxes for both the divergence of the velocity field and the divergence of the convective operator, the consistent pressure boundary condition can be shown to reduce to contributions arising solely from the acceleration and the viscous term for the $L^2$ discretization. Per time step, the decoupled nature of the scheme with respect to the velocity and pressure fields leads to a single pressure Poisson equation followed by a single vector-valued convection-diffusion-reaction equation. We verify optimal convergence rates of the method in both space and time and demonstrate compatibility with higher-order time integration schemes. A series of numerical experiments, including the two-dimensional flow around a cylinder benchmark and the three-dimensional Taylor--Green vortex problem, verify the applicability to practically relevant flow problems.2025-12-05T17:56:40ZComputer Methods in Applied Mechanics and Engineering, Volume 458, 2026, 119008Dominik StillNatalia NebulishviliRichard SchussnigKatharina KormannMartin Kronbichler10.1016/j.cma.2026.119008http://arxiv.org/abs/2305.06709v4NUBO: A Transparent Python Package for Bayesian Optimization2026-04-28T07:08:12ZNUBO, short for Newcastle University Bayesian Optimisation, is a Bayesian optimization framework for the optimization of expensive-to-evaluate black-box functions, such as physical experiments and computer simulators. Bayesian optimization is a costefficient optimization strategy that uses surrogate modelling via Gaussian processes to represent an objective function and acquisition functions to guide the selection of candidate points to approximate the global optimum of the objective function. NUBO itself focuses on transparency and user experience to make Bayesian optimization easily accessible to researchers from all disciplines. Clean and understandable code, precise references, and thorough documentation ensure transparency, while user experience is ensured by a modular and flexible design, easy-to-write syntax, and careful selection of Bayesian optimization algorithms. NUBO allows users to tailor Bayesian optimization to their specific problem by writing the optimization loop themselves using the provided building blocks. It supports sequential single-point, parallel multi-point, and asynchronous optimization of bounded, constrained, and/or mixed (discrete and continuous) parameter input spaces. Only algorithms and methods that are extensively tested and validated to perform well are included in NUBO. This ensures that the package remains compact and does not overwhelm the user with an unnecessarily large number of options. The package is written in Python but does not require expert knowledge of Python to optimize your simulators and experiments. NUBO is distributed as open-source software under the BSD 3-Clause license.2023-05-11T10:34:27ZJournal of Statistical Software, 114(1), 1-28 (2025)Mike DiessnerKevin J. WilsonRichard D. Whalley10.18637/jss.v114.i01http://arxiv.org/abs/2604.25148v1Extending UNIQuE: Quantum Simulation Speedup for the HHL Algorithm2026-04-28T02:53:15ZIn an extension of the Unconventional Noiseless Intermediate Quantum Emulator, this work introduces a classical emulation of the quantum Harrow-Hassidim-Lloyd algorithm for sampling from the solution space of linear systems. The emulated HHL algorithm scales exponentially with the number of qubits required to represent the linear system, which is an advantage over the state vector simulation of the HHL algorithm, which scales exponentially as a function of both the size of the linear system and the magnitude of its largest (scaled) eigenvalue. We benchmark our emulator by comparing it with the Intel Quantum Simulator and demonstrate a runtime advantage for small linear systems.2026-04-28T02:53:15ZReece RobertsonAmeya Bhavehttp://arxiv.org/abs/2605.08114v1Statistical Inference and Quality Measures of KV Cache Quantisations Inspired by TurboQuant2026-04-27T23:11:12ZWe analyse three KV cache quantization schemes under a fair bit budget: \textbf{KV} (scalar MSE baseline), \textbf{KQV} (WHT + MSE on $K$; WHT + MSE + QJL on $V$), and \textbf{QKQV} (WHT + MSE + QJL on both). Starting from the Beta distribution on the hypersphere, we trace how QJL on $K$ inflates inner product variance by $π/2$, which softmax amplifies nonlinearly via Jensen's inequality, and we present statistical inference and information metrics to highlight practical differences.
Three empirical findings emerge. (1)~At $n=4$ (the practically dominant budget), KQV wins on every measure -- KL divergence, geometric $K$ error, and 6D distance -- across all distributions and ranks tested. (2)~The K--V asymmetry is unconditional: QKQV is consistently worse than KQV in KL divergence at every budget and distribution. (3)~A budget-dependent crossover exists: QKQV achieves better geometric $K$ reconstruction at $n \in \{2,3,5\}$, KQV at $n \in \{4,6\}$, invariant to rank and tail weight -- an open rate-distortion problem.
$\mathrm{KL}(p_{\mathrm{ref}} \| p_{\mathrm{quant}})$, K-only by construction, bridges K direction error to routing corruption and output collapse. We present a sufficient condition when the Jensen mechanism amplifies superlinearly through the softmax. At $n \in \{2,3,5\}$, QKQV wins geometrically because this assumption does not bind. At $n=4$, elevated K error and KL divergence for QKQV strongly suggest the Jensen mechanism is the operative cause of the crossover, providing a new perspective and explanation.2026-04-27T23:11:12Z23 pages, 7 Figures, multiple tables, the process is highly assisted by AIPaolo D'Albertohttp://arxiv.org/abs/2601.01413v2GlycoPy: A CasADi-based Python Framework for Hierarchical Modeling, Optimization, and Control of Bioprocesses2026-04-25T11:47:50ZEfficient implementation of nonlinear model predictive control (NMPC) for bioprocesses remains challenging because large nonlinear models are difficult to organize, simulate, and embed within optimization and control workflows. This difficulty is particularly pronounced for large-scale and multiscale systems that require hierarchical model construction and customized simulation strategies. To address this issue, we present GlycoPy, a CasADi-based Python framework for hierarchical modeling, optimization, and control of bioprocesses. GlycoPy combines an equation-oriented, object-oriented modeling architecture with CasADi's symbolic and differentiable computational capabilities, enabling hierarchical model composition, numerical and symbolic simulation, parameter estimation, dynamic optimization, and NMPC within a unified workflow. A key feature of the framework is its support for customized differentiable simulation algorithms that can be embedded directly in gradient-based optimization and control. GlycoPy is demonstrated on a multiscale monoclonal antibody glycosylation process in Chinese hamster ovary cell culture, where it is used for hierarchical model construction, quasi-steady-state simulation, and adaptive NMPC. The results show that GlycoPy provides a practical and reusable framework for applying advanced optimization and control methods to computationally demanding bioprocesses.2026-01-04T07:36:11ZYingjie MaJing GuoRichard D. Braatzhttp://arxiv.org/abs/2604.22242v1Fast GPU Linear Algebra via Compile Time Expression Fusion2026-04-24T05:34:59ZWe describe the Bandicoot GPU linear algebra toolkit, a C++ based library that prioritises ease of use without compromising efficiency. Bandicoot's API is compatible with the popular Armadillo CPU linear algebra library, enabling easy transition for existing CPU-based codebases. Unlike other GPU-focused toolkits, Bandicoot uses template metaprogramming to generate fused GPU kernels directly at compile time, yielding efficient kernels that are often able to saturate memory bandwidth. This removes the need for runtime overhead or JIT infrastructure. Empirical results show that Bandicoot outperforms (sometimes by considerable margins) commonly-used linear algebra toolkits including PyTorch, TensorFlow, and JAX.2026-04-24T05:34:59ZRyan R. CurtinMarcus EdelConrad Sandersonhttp://arxiv.org/abs/2604.22087v1JetSCI: A Hybrid JAX-PETSc Framework for Scalable Differentiable Simulation2026-04-23T21:40:01ZThe rapid rise of scientific machine learning (SciML) has expanded the role of differentiable modeling, surrogate modeling, and data-driven constitutive laws in large-scale simulation. The JAX framework provides an attractive environment for these workflows through automatically differentiable programs, vectorization, GPU acceleration, and while enabling seamless learning of surrogate models. However, large-scale simulation still relies on mature HPC infrastructure. Libraries, such as PETSc, provide scalable MPI-based parallelism, robust linear and nonlinear solvers, and advanced preconditioning capabilities that remain difficult to reproduce in JAX-only workflows. We present JetSCI, a hybrid JAX-PETSc framework that unifies these complementary strengths. JetSCI uses JAX for GPU-parallel differentiable discretizations and PETSc for robust, scalable solution of the resulting systems on distributed-memory architectures, exposing multilevel parallelism through GPU acceleration within nodes and MPI parallelism across nodes. For finite element discretizations of heterogeneous micromechanics problems, JetSCI outperforms JAX-only implementations in efficiency and accuracy.2026-04-23T21:40:01ZAlberto CattaneoM Keith BallardRobert M. KirbyVarun Shankarhttp://arxiv.org/abs/2604.21504v1Efficient generation of expected-degree graphs via edge-arrivals2026-04-23T10:06:31ZWe study the efficient generation of random graphs with a prescribed expected degree sequence, focusing on rank-1 inhomogeneous models in which vertices are assigned weights and edges are drawn independently with probabilities proportional to the product of endpoint weights. We adopt a temporal viewpoint, adding edges to the graph one at a time up to a fixed time horizon, and allowing for self-loops or duplicate edges in the first stage. Then, the simple projection of the resulting multigraph recovers exactly the simple Norros--Reittu random graph, whose expected degrees match the prescribed targets under mild conditions. Building on this representation, we develop an exact generator based on \textit{edge-arrivals} for expected-degree random graphs with running time $O(n+m)$, where $m$ is the number of generated edges, and hence proportional to the output size. This removes the typical vertex sorting used by widely-used fast generator algorithms based on \textit{edge-skipping} for rank-1 expected-degree models, which leads to a total running time of $O(n \log n + m)$. In addition, our algorithm is simpler than those in the literature, easy to implement, and very flexible, thus opening up to extensions to directed and temporal random graphs, generalization to higher-order structures, and improvements through parallelization.2026-04-23T10:06:31Z18 pages, 2 figures, submitted to 34th Annual European Symposium on Algorithms (ESA 2026)Gianlorenzo D'AngeloRiccardo Michielanhttp://arxiv.org/abs/2604.19227v1SignatureTensors.jl: A Package for Signature Tensors in Julia2026-04-21T08:34:14ZWe introduce SignatureTensors.jl, a new package for computing signature tensors of paths in julia. We present its core functionality and demonstrate its use through illustrative examples. The package is compatible with the computer algebra system OSCAR, enabling both exact and numerical computations with signatures.2026-04-21T08:34:14ZGabriel RiffoLeonard Schmitzhttp://arxiv.org/abs/2604.19004v1Ocean: Fast Estimation-Based Sparse General Matrix-Matrix Multiplication on GPU2026-04-21T02:46:07ZIn computational science and data analytics, many workloads involve irregular and sparse computations that are inherently difficult to optimize for modern hardware. A key kernel is Sparse General Matrix-Matrix Multiplication (SpGEMM), which underpins simulations, graph analytics, and machine learning applications. SpGEMM exhibits irregular memory access patterns and workload imbalance, making it challenging to achieve high performance on GPUs. Current GPU SpGEMM solutions typically rely on a two-pass workflow to address load imbalance and reduce memory access. The symbolic pass, which determines the number of output elements per row, accounts for roughly 28% of the total runtime on average. In this work, we question the necessity of exact symbolic computation and introduce an estimation-based SpGEMM workflow. Our approach replaces the costly symbolic step with lightweight HyperLogLog estimators, combined with an analysis strategy that dynamically selects the optimal workflow and guides accumulator configuration. In addition, we introduce a hybrid accumulator design, including a novel hash-based accumulator that leverages both shared and global memory. Our approach consistently outperforms leading GPU SpGEMM implementations across a wide range of both square and rectangular matrices, achieving speedups of 1.4x-2.8x on NVIDIA A100 and H100 architectures.2026-04-21T02:46:07Z2026 International Conference on Supercomputing (ICS '26), July 06--09, 2026, Belfast, United KingdomYifan LiGiulia Guidi10.1145/3797905.3807868http://arxiv.org/abs/2604.18276v1Block-encodings as programming abstractions: The Eclipse Qrisp BlockEncoding Interface2026-04-20T13:51:06ZBlock-encoding is a foundational technique in modern quantum algorithms, enabling the implementation of non-unitary operations by embedding them into larger unitary matrices. While theoretically powerful and essential for advanced protocols like Quantum Singular Value Transformation (QSVT) and Quantum Signal Processing (QSP), the generation of compilable implementations of block-encodings poses a formidable challenge. This work presents the BlockEncoding interface within the Eclipse Qrisp framework, establishing block-encodings as a high-level programming abstraction accessible to a broad scientific audience. Serving as both a technical framework introduction and a hands-on tutorial, this paper explicitly details key underlying concepts abstracted away by the interface, such as block-encoding construction and qubitization, and their practical integration into methods like the Childs-Kothari-Somma (CKS) algorithm. We outline the interface's software architecture, encompassing constructors, core utilities, arithmetic composition, and algorithmic applications such as matrix inversion, polynomial filtering, and Hamiltonian simulation. Through code examples, we demonstrate how this interface simplifies both the practical realization of advanced quantum algorithms and their associated resource estimation.2026-04-20T13:51:06Z11 pagesMatic PetričRené Zanderhttp://arxiv.org/abs/2510.02878v2On the energy efficiency of sparse matrix computations on multi-GPU clusters2026-04-15T12:38:42ZWe investigate the energy efficiency of a library designed for parallel computations with sparse matrices. The library leverages high-performance, energy-efficient Graphics Processing Unit (GPU) accelerators to enable large-scale scientific applications. Our primary development objective was to maximize parallel performance and scalability in solving sparse linear systems whose dimensions far exceed the memory capacity of a single node. To this end, we devised methods that expose a high degree of parallelism while optimizing algorithmic implementations for efficient multi-GPU usage. Previous work has already demonstrated the library's performance efficiency on large-scale systems comprising thousands of NVIDIA GPUs, achieving improvements over state-of-the-art solutions. In this paper, we extend those results by providing energy profiles that address the growing sustainability requirements of modern HPC platforms. We present our methodology and tools for accurate runtime energy measurements of the library's core components and discuss the findings. Our results confirm that optimizing GPU computations and minimizing data movement across memory and computing nodes reduces both time-to-solution and energy consumption. Moreover, we show that the library delivers substantial advantages over comparable software frameworks on standard benchmarks.2025-10-03T10:35:14ZMassimo BernaschiAlessandro CelestiniPasqua D'AmbraGiorgio Richellihttp://arxiv.org/abs/2603.21011v2ALL-FEM: Agentic Large Language models Fine-tuned for Finite Element Methods2026-04-13T22:50:31ZFinite element (FE) analysis guides the design and verification of nearly all manufactured objects. It is at the core of computational engineering, enabling simulation of complex physical systems, from fluids and solids to multiphysics systems. However, implementing FE codes and analyzing simulation results demands expertise across numerical analysis, continuum mechanics, and programming. Conventional Large Language Models (LLMs) can generate FE code, but they hallucinate, lack awareness of variational structures, and cannot close the loop from problem statement to a verified solution. Here, we propose ALL-FEM, an autonomous simulation system that integrates agentic AI with domain-specific, fine-tuned LLMs for FEniCS code generation across solid, fluid, and multiphysics applications. We construct a corpus of 1000+ verified FEniCS scripts by combining 500+ curated expert codes with a retrieval-augmented, multi-LLM pipeline that generates and filters codes for diverse PDEs, geometries, and boundary conditions. We used the corpus to fine-tune LLMs with 3B to 120B parameters. Our agentic framework orchestrates specialized agents, powered by fine-tuned LLMs, to formulate problems as PDEs, generate and debug code and visualize the results. We evaluated the system on 39 benchmarks that include problems of linear/nonlinear elasticity, plasticity, Newtonian/non-Newtonian flow, thermofluids, fluid-structure interaction, phase separation, and transport on moving domains. Embedded in a multi-agent workflow with runtime feedback, the best fine-tuned model (GPT OSS 120B) achieves code-level success of 71.79%, outperforming a non-agentic deployment of GPT 5 Thinking. By showing that relatively small, fine-tuned LLMs, orchestrated through agentic frameworks, can automate FE workflows, ALL-FEM offers a blueprint for autonomous simulation systems in computational science and engineering.2026-01-08T21:25:59ZRushikesh DeotaleAdithya SrinivasanYuan TianTianyi ZhangPavlos VlachosHector Gomezhttp://arxiv.org/abs/2507.01770v4Global optimization tailored for graphics processing units: Complete and rigorous search for large-scale nonlinear minimization2026-04-13T19:14:59ZThis paper introduces a numerical method to enclose the global minimum of a nonlinear function subject to simple bounds on the variables. Using interval analysis, coupled with the computational power and architecture of graphics processing units (GPUs), the method iteratively rules out the regions in the search domain where the global minimum cannot exist and leaves a finite set of regions where the global minimum must exist. For effectiveness, because of the rigor of interval analysis, the method is guaranteed to enclose the global minimum even in the presence of rounding errors. For efficiency, the method employs a novel GPU-based single program, single data parallel programming style to circumvent major GPU performance bottlenecks, and a variable cycling technique is also integrated into the method to reduce computational cost when minimizing large-scale nonlinear functions. The method is validated by minimizing 11 benchmark test functions with scalable dimensions, including the well-known Ackley function, Griewank function, Levy function, Rastrigin function, and Rosenbrock function. These benchmark test functions represent grand challenges of global optimization, and enclosing the guaranteed global minimum of these benchmark test functions with more than 80 dimensions has not been reported in the literature. Our method completely searches the feasible domain and successfully encloses the guaranteed global minimum of these 11 benchmark test functions with up to 10,000 dimensions using only one GPU in a reasonable computation time, far exceeding the reported results in the literature due to the unique method design and implementation based on GPU architecture.2025-07-02T14:54:52Z35 pages, 4 figuresPNAS Nexus, 5(4), pp. pgag103 (2026)Guanglu ZhangQihang ShanJonathan Cagan10.1093/pnasnexus/pgag103