https://arxiv.org/api/kmz9Z1eMWjlsousgGXjThZWkjgg 2026-06-15T08:36:27Z 2652 30 15 http://arxiv.org/abs/2605.18686v2 critband: A Python Package for Critical Bandwidth Analysis of Multimodal Distributions 2026-05-19T17:56:25Z

Multimodal density estimation is a fundamental problem in scientific computing. Determining the number of modes in a distribution is a core numerical challenge with applications across ecology, economics, genomics, and astronomy. While the R ecosystem provides mature tools through the multimode package, the Python ecosystem has lacked an equivalent cohesive implementation. We present critband, a Python package for critical bandwidth bimodality detection based on Silverman's kernel density approach. The package implements critical bandwidth search with a robust bracketed mode-count solver and FFT-accelerated KDE, and provides additional features including k-mode detection, component decomposition, bimodality strength quantification, and excess mass estimation. Validation against twelve benchmark cases spanning separation regimes, unequal variances, unequal weights, and small sample sizes shows stable estimates for clearly separated cases and expected instability for boundary cases. Performance benchmarks show critband is typically 3-10 times faster per case than R's modetest() in the tested setup.

2026-05-18T17:23:41Z 12 pages Ruiyu Zhang Qihao Wang http://arxiv.org/abs/2605.19254v1 A C implementation of the Smith massager algorithm 2026-05-19T01:58:00Z

We describe a C implementation of the Las Vegas algorithm of Birmpilis, Labahn and Storjohann from 2020 for computing the Smith normal form of a nonsingular integer matrix. The algorithm computes a Smith massager for the input matrix using $O(n^ω\, \B(\log n + \log \|A\|)\, (\log n)^2)$ bit operations, which is softly equivalent to the cost of multiplying two matrices of the same dimension and entry size. We describe the key implementation techniques that bridge the gap between the theoretical algorithm and practical performance, including BLAS-accelerated modular arithmetic via the Residue Number System and an adaptive batching scheme that collapses the theoretical $O(\log n)$ iterations to $O(1)$ in practice. Experiments on matrices of dimension up to $n = 10007$ show that the implementation's running time scales proportionally to that of a single BLAS matrix multiplication, with both exhibiting the same effective growth rate on a log-log plot.

2026-05-19T01:58:00Z Ziwen Wang Stavros Birmpilis George Labahn Arne Storjohann http://arxiv.org/abs/2605.17884v1 Optimizing Optimizations, Declaratively: Optimizing the Higher-Order Functions in Mathematical Optimization with egglog 2026-05-18T05:45:01Z

We present two applications of egglog to mathematical optimization in JijModeling 2, a mathematical modeller whose internal representation is based on simply typed $λ$-calculus. First, we use egglog to improve $\LaTeX$ output for mathematical models expressed with higher-order functions. Python comprehensions are desugared into stream operations such as $\textsf{map}$, $\textsf{flat_map}$, and $\textsf{filter}$; emitting these terms directly produces unnatural mathematical notation. We reconstruct comprehension syntax by \emph{ensugaring} higher-order terms and use equality saturation with a custom cost model to minimize temporary variable rebindings. Second, we use egglog as a declarative engine for \emph{constraint detection}, extending the previous egg-based approach presented at EGRAPHS '25. Egglog's datalog-style rules let us express multi-step detection logic directly, without external Rust orchestration code. We encode parametrized constraints using \emph{Henkin-like constants} and propagate side conditions on subterms and indices through egglog facts. Finally, we show that the same ensugaring procedure also reduces large domain-set conditions before saturation, turning a problematic detection case from minutes or nontermination into a few seconds. Through these topics, we want to provide an example of an industrial application of egglog, demonstrate the trick to propagate the constraints using the ideas from mathematical logic, and show the importance of optimizing \emph{premises} of egglog rules to get practical performance in egglog programs.

2026-05-18T05:45:01Z To be presented at EGRAPHS '26 https://pldi26.sigplan.org/home/egraphs-2026 Hiromi Ishii http://arxiv.org/abs/2602.15965v3 FLoPS: Semantics, Operations, and Properties of P3109 Floating-Point Representations in Lean 2026-05-17T14:25:56Z

The upcoming IEEE-P3109 standard for low-precision floating-point arithmetic can become the foundation of future machine learning hardware and software. Unlike IEEE-754, P3109 introduces a parametric framework defined by bitwidth, precision, signedness, and domain. This flexibility results in a vast combinatorial space of formats -- some with as little as one bit of precision -- alongside novel features such as stochastic rounding and saturation arithmetic. These deviations create a unique verification gap that this paper intends to address. This paper presents FLoPS, Formalization in Lean of the P3109 Standard, which is a comprehensive formal model of P3109 in Lean. Our work serves as a rigorous, machine-checked specification that facilitates deep analysis of the standard. We demonstrate the model's utility by verifying foundational properties and analyzing key algorithms within the P3109 context. Specifically, we reveal that FastTwoSum exhibits a novel property of computing exact "overflow error" under saturation using any rounding mode, whereas previously established properties of the ExtractScalar algorithm fail for formats with one bit of precision. This work provides a verified foundation for reasoning about P3109 and enables formal verification of future numerical software. Our Lean development is open source and publicly available.

2026-02-17T19:30:24Z 18 pages Tung-Che Chang Sehyeok Park Jay P Lim Santosh Nagarakatte http://arxiv.org/abs/2605.17424v1 A Hybrid Optimization Framework for Spatial Packaging of Interconnected Systems 2026-05-17T12:39:05Z

This paper presents an optimization framework for Spatial Packaging of Interconnected Systems with Physical Interactions (SPI2) that addresses the geometric challenges of three-dimensional component placement and routing. While SPI2 generally includes physical interactions, this study isolates the spatial optimization aspect to evaluate placement and routing performance independently. The framework integrates the Maximal Disjoint Ball Decomposition (MDBD) for geometric abstraction with a hybrid optimization strategy that combines stochastic initialization and gradient-based refinement with interior point optimization. It is formulated to handle the nonlinear, non-convex, and continuous characteristics of spatially coupled design problems. The proposed framework is evaluated against a use case from prior SPI2 research and tested with a newly introduced benchmark that enables verifiable assessment of optimization performance. Results indicate that the presented method achieves more than a 10% improvement over existing SPI2 implementations and converges to spatially analytical optima across various benchmark scenarios. Benchmark experiments show solution accuracy of 0.6-2% relative to the ground truth.

2026-05-17T12:39:05Z S. Westerhof T. Hofman http://arxiv.org/abs/2605.17387v1 Spatial Optimization of Interconnected Systems in Non-Convex Design Spaces 2026-05-17T11:11:23Z

This paper presents a spatial optimization methodology that extends the Spatial Packaging of Interconnected Systems with Physical Interaction (SPI2) framework to support arbitrary, non-convex design boundaries. We introduce a smooth, differentiable inside-outside evaluation for components represented using the Maximal Disjoint Ball Decomposition (MDBD) method. The framework also incorporates center-of-gravity and moment-of-inertia calculations directly into the optimization, and provides an end-to-end computer-aided design (CAD) workflow for importing components and reconstructing the optimized assembly. The method is demonstrated on a fictional aircraft auxiliary unit. Results show that the optimizer can place multiple interconnected components within a custom geometry while simultaneously handling routing and physics-based objectives. The approach maintains geometric feasibility within numerical tolerance and illustrates the potential of MDBD-based SPI2 methods for practical engineering design applications.

2026-05-17T11:11:23Z n/a S. Westerhof T. Hofman http://arxiv.org/abs/2605.16169v1 LeanBET: Formally-verified surface area calculations in Lean 2026-05-15T16:48:52Z

The Brunauer--Emmett--Teller (BET) method is a standard tool for estimating surface areas from adsorption isotherms, yet practical implementations involve multiple algorithmic steps whose correctness is rarely made explicit. In this work, we present a fully executable and formally verified BET analysis pipeline implemented in the Lean~4 theorem prover. Our formalization covers the complete BET Surface Identification (BETSI)-style workflow, including window enumeration, monotonicity checks, knee selection, and linear regression. We carry out computations in floating-point arithmetic and develop the corresponding correctness proofs over the real numbers, using a shared polymorphic implementation that supports both. On the proof side, we show that the regression coefficients returned by the algorithm agree with their specification-level definitions and minimize the least-squares error under the stated assumptions. We also formalize the algebraic derivation of the BET linearized expression and connect that result directly to the executable analysis pipeline. We further prove that the window enumeration is sound and complete, and that the admissibility checks and knee-based selection satisfy their formal specifications. We evaluate the implementation against the BETSI reference method on benchmark adsorption isotherms. Compared to BETSI, LeanBET agrees to machine precision for 18 of the 19 isotherms, with only a 0.03\% deviation for the UiO-66 dataset. This demonstrates that a scientific computing workflow can be built in Lean, yielding both formal verification guarantees and numerical agreement with an established Python reference implementation.

2026-05-15T16:48:52Z Ejike D. Ugwuanyi Colin T. Jones John Velkey Tyler R. Josephson http://arxiv.org/abs/2601.17198v3 Odd but Error-Free FastTwoSum: More General Conditions for FastTwoSum as an Error-Free Transformation for Faithful Rounding Modes 2026-05-15T16:42:49Z

This paper proposes sufficient, yet more general conditions for applying FastTwoSum as an error-free transformation (EFT) under all faithful rounding modes. Additionally, it also identifies guarantees tailored to round-to-odd for establishing FastTwoSum as an EFT. This paper also describes a floating-point splitting tailored for round-to-odd that is an EFT where the distribution of bits is configurable (i.e., ExtractScalar for round-to-odd). Our sufficient conditions are more general than those previously known in the literature (i.e., it applies to a wider operand domain).

2026-01-23T22:06:52Z 10 pages Sehyeok Park Jay P. Lim Santosh Nagarakatte http://arxiv.org/abs/2605.16058v1 High-Performance Star-M SVD for Big Data Compression 2026-05-15T15:24:03Z

In the era of big data, effectively compressing large datasets while performing complex mathematical operations is crucial. Tensor-based decomposition methods have shown superior compression capabilities with minimal loss of accuracy compared to traditional matrix methods. Under the star-M tensor framework, tensors can be decomposed in a matrix-mimetic way, including using the star-M SVD. This tensor SVD has optimality guarantees and has shown exceptional performance on specific types of data, but software implementations have been mostly limited to productivity-oriented languages. In this work, we present our development of a shared-memory parallel, high-performance solution designed to efficiently implement the underlying algorithms. This software will enable optimal compression of extensive scientific datasets, paving the way for enhanced data analysis and insights.

2026-05-15T15:24:03Z Md Taufique Hussain Grey Ballard Aditya Devarakonda Srinivas Eswar Naman Pesricha Vishwas Rao http://arxiv.org/abs/2605.15547v1 Correctly Rounded Functions For Vector Applications: A Performance Study 2026-05-15T02:39:39Z

Following recent interest in correctly rounded math library functions (as currently recommended by the IEEE 754 standard), we have designed several SIMD algorithms for one-input single precision functions and integrated them into our CPU math library; these will form the core of the first correctly rounded vector math library, to be available to users in mid-2026. To take advantage of the cross-platform bitwise reproducibility afforded by correct rounding, we adapted and evaluated a few SIMD implementations on graphics processing units (GPU). In addition, we designed and evaluated proof-of-concept SIMD implementations of two correctly rounded double precision functions.

2026-05-15T02:39:39Z 5 pages, 2 figures, 4 tables Cristina Anderson Marius Cornea Andrey Stepin Mihai Tudor Panu http://arxiv.org/abs/2605.13736v1 Porting the Nonlinear Optimization Library HiOp to Accelerator-Based Hardware Architectures 2026-05-13T16:13:32Z

While interior point methods have been the centerpiece of nonlinear programming tools used in science and engineering, their reliance on linear solvers that can tackle sparse symmetric indefinite and highly ill-conditioned problems made it difficult to implement them effectively on hardware accelerators. At this time, there are few sparse linear solvers that can be used in this context. Here, we present a novel formulation of an interior point method implemented in our HiOp library, which is designed to be able to run entirely on hardware accelerators. This formulation avoids dependence on sparse solvers altogether, which is achieved by compressing the underlying sparse linear problem into a dense one of manageable size. We demonstrate feasibility of this approach and provide a baseline for future interior point method implementations on hardware accelerators. Our investigation is motivated by problems arising in optimal power flow analysis in power systems engineering and our approach is tailored to the broad class of problems arising in that important domain. We also demonstrate utility of modern programming models based on performance portability libraries, namely, Umpire and RAJA. We discuss trade-offs between performance, portability and development cost in the solution space for this non-linear optimization problem. As a result of this research, we demonstrate for the first time that interior point methods for sparse problems can be efficiently realized on modern computing systems where more than 90% of processing power is in GPUs.

2026-05-13T16:13:32Z Slaven Peles Kalyan S. Perumalla Maksudul Alam Asher J. Mancinelli R. Cameron Rutherford Jake Ryan Cosmin G. Petra http://arxiv.org/abs/2605.13607v1 Ergodicity Library: A Python Toolkit for Stochastic-Process Simulation, Time-Average Diagnostics, and Agent-Based Experiments 2026-05-13T14:41:28Z

ergodicity is an open-source Python library for computational work on stochastic dynamics, with particular emphasis on non-ergodicity, time-average behavior, heavy-tailed processes, and decision making under uncertainty. The package brings together three layers that are often split across ad hoc scripts: process definitions and simulators, analysis and fitting tools, and agent-based experimentation. This article documents the implemented software rather than presenting new stochastic theory. We describe the package architecture, the supported process families, the analysis workflow, and the practical boundaries of the current implementation. We also provide fully reproducible examples covering heavy-tailed ensemble spread, multiplicative Levy growth diagnostics, adaptive memory mean reversion, preasymptotic fluctuation analysis, and partial stochastic differential equation simulation. The package is positioned as an integration layer on top of the scientific Python stack, reducing the amount of glue code required to move from process specification to diagnostics and comparative experiments.

2026-05-13T14:41:28Z Ihor Kendiukhov http://arxiv.org/abs/2605.12443v1 Basilisk and Docker for Reproducible GN&C Simulation: A Workflow Reference 2026-05-12T17:37:37Z

Basilisk is an open-source astrodynamics simulation framework widely used for spacecraft guidance, navigation, and control (GN&C) research and development. Despite its flexibility and computational capabilities, configuring Basilisk consistently across heterogeneous development environments presents practical challenges due to dependency management, operating system compatibility, and software configuration requirements. This paper presents a Docker-based containerization workflow for Basilisk that encapsulates the complete build environment, dependencies, and simulation infrastructure within a portable container image. The workflow is demonstrated through a progression of simulation scenarios of increasing complexity, from standalone orbital dynamics scripts to BSKSim-based attitude dynamics and control simulations with Monte Carlo analysis. The BSKSim class hierarchy, dynamics model architecture, flight software implementation, and scenario execution patterns are described in detail. The presented workflow provides a self-contained implementation reference for GN&C engineers and researchers seeking reproducible and portable Basilisk simulation environments. This work expands upon a workshop presentation delivered at the 46th Rocky Mountain AAS GN&C Conference, February 2024, available at https://doi.org/10.5281/zenodo.15008785.

2026-05-12T17:37:37Z 21 pages, 8 figures Anubhav Gupta http://arxiv.org/abs/2605.12583v1 QuPort: Topology-, Port-, and Congestion-Aware Compilation for Modular Multi-QPU Quantum Systems 2026-05-12T17:12:30Z

Modular quantum processors require a compiler to reason about two resources at the same time: local device connectivity and communication across QPUs. A mapping that is acceptable on a single coupling graph may be unsuitable for a modular machine if it creates excessive cross-QPU traffic, concentrates that traffic on a small number of interconnect links, or assigns many boundary qubits to a QPU with few communication ports. This paper presents QuPort, a Python and Qiskit-based compilation framework that studies this setting through an explicit three-level model: a weighted logical interaction graph, a directed physical coupling map, and an undirected QPU-level interconnect graph. The main partitioning method, TPCCAP, optimizes the implemented objective formed by weighted cut distance, communication-port overflow, and routed link-load congestion. The framework also includes heavy-edge clustering, balanced greedy partitioning, simulated-annealing refinement, communication-port-aware layout, extraction of remote two-qubit operations, local-only routing of per-QPU circuits, and topology-aware schedule estimation. The model is a compiler-level abstraction. It does not claim a calibrated hardware runtime or an implementation of a physical remote-gate protocol.

2026-05-12T17:12:30Z Soumyadip Sarkar Subhasree Bhattacharjee http://arxiv.org/abs/2605.10678v1 A Performance-Portable, Massively Parallel Distributed Nonuniform FFT 2026-05-11T14:56:41Z

The nonuniform fast Fourier transform (NUFFT) enables spectral methods for problems with irregularly spaced samples, with applications in medical imaging, molecular dynamics, and kinetic plasma simulations. Existing implementations are limited to shared-memory execution, restricting problem sizes to what fits on a single node. We present the first distributed, performance-portable NUFFT for heterogeneous supercomputers. Our Kokkos-based implementation runs without modification on NVIDIA and AMD GPUs. We develop multiple spreading and interpolation kernels optimized for different accuracy requirements and architectures. Our spreading kernels match or exceed the single-GPU throughput of the state-of-the-art CUDA-based NUFFT library cuFINUFFT at production particle densities, while our Kokkos-based implementation additionally supports AMD GPUs. Strong scaling experiments on Alps (NVIDIA GH200), JUWELS Booster (NVIDIA A100), and LUMI (AMD MI250X) demonstrate scaling up to 1024 GPUs. At scale, the distributed FFT is a significant part of the total runtime, making higher NUFFT accuracy less expensive. We apply the method to massively parallel Particle-in-Fourier simulations of Landau damping with up to $1024^3$ Fourier modes and 8.6 billion particles on Alps, JUWELS, and LUMI, demonstrating that distributed NUFFTs enable kinetic plasma simulations at resolutions previously inaccessible to spectral particle methods.

2026-05-11T14:56:41Z Accepted in The Platform for Advanced Scientific Computing (PASC26) conference proceedings Paul Fischill Andreas Adelmann Sriramkrishnan Muralikrishnan