https://arxiv.org/api/0sItsfNY+OZ6Ii0JgxdByBrdpvk 2026-06-09T20:32:15Z 2652 0 15 http://arxiv.org/abs/2606.09686v1 An 84-Format Numeric Catalog with Bit-Exact Conformance Vectors: A Vendor-Neutral Reference for FP8, BF16, MXFP4, and Microscaling Formats 2026-06-08T16:04:15Z Numeric format proliferation in machine learning hardware -- FP8 (E4M3 and E5M2), BF16, MXFP4, microscaling block formats, and dozens of research variants -- has outpaced the availability of vendor-neutral, bit-exact reference material. Engineers porting models across accelerators encounter silent divergences that are difficult to diagnose without a shared ruler. This paper describes a catalog of 84 numeric formats spanning 13 families, a suite of six bit-exact conformance packs covering GF16, MXFP4 element, BF16, FP8 E4M3, FP8 E5M2, and E8M0 block scale, and an IEEE P3109 v3.2.0 cross-walk that maps each pack to its corresponding standards-track configured format. Each pack is a self-contained JSON document with a SHA-256 fingerprint, a shared row schema, and an anchor vector that encodes 3.0 -- the identity phi^2 + 1/phi^2 = 3 -- as a cross-pack sanity check. Packs are cross-validated against ml_dtypes 0.5.4 (Google/JAX); any divergence is documented explicitly and interpreted as a spec-permitted interpretation gap rather than hidden. The work is framed as registry filling: it does not propose new formats, make model-accuracy claims, or assert superiority over any vendor's implementation. All artifacts are publicly available at https://github.com/gHashTag/t27 under an open license. 2026-06-08T16:04:15Z 17 pages. Source repository: https://github.com/gHashTag/paper3-methodology tag v4.0-trinity. Paper CC BY 4.0; code MIT. ORCID 0009-0008-4294-6159 Dmitrii Vasilev http://arxiv.org/abs/2604.27210v2 Fast-Vollib: A Fast Implied Volatility Library for Pythonwith PyTorch, JAX, and CUDA Fused-Kernel Backends 2026-06-08T10:21:56Z We present fast-vollib, an open-source Python library that provides high-performance European option pricing, implied volatility (IV) computation, and Greeks under the Black-76, Black-Scholes, and Black-Scholes-Merton models. The library is designed as a drop-in alternative to the de-facto-standard py_vollib and py_vollib_vectorized packages, with pluggable PyTorch and JAX execution backends, a CUDA fused-kernel Triton contribution for batched IV workloads, and a compatibility-first public API. In addition to a vectorized Halley-method IV solver, fast-vollib ships an experimental, fully-vectorized implementation of Jäckel's "Let's Be Rational" (LBR) algorithm with NumPy/Numba, torch.compile, JAX, and Triton single-pass GPU kernels for batched option chains. This note announces the library and describes its public API surface, with source, documentation, and packaging artifacts available at: GitHub (https://github.com/raeidsaqur/fast-vollib), Docs (https://raeidsaqur.github.io/fast-vollib/), PyPI (https://pypi.org/project/fast-vollib/). 2026-04-29T21:29:32Z 5 pages, 1 figure, 1 table. Software announcement / reference note. Code: https://github.com/raeidsaqur/fast-vollib. Install: pip install fast-vollib Raeid Saqur http://arxiv.org/abs/2604.06922v4 A Practical Introduction to Tensor Network Renormalization with TNRKit.jl 2026-06-08T09:07:30Z We present TNRKit, an open-source Julia package for Tensor Network Renormalization (TNR) of two- and three-dimensional classical statistical models and Euclidean lattice field theories. Built on top of TensorKit, it provides a symmetry-aware framework for constructing tensor-network representations of partition functions and coarse-graining them using methods such as TRG, HOTRG, and LoopTNR. Beyond thermodynamic quantities, the package enables the extraction of universal conformal data -- including scaling dimensions and the central charge -- directly from fixed-point tensors. TNRKit is designed with both usability and extensibility in mind, offering a practical platform for applying, benchmarking, and developing modern tensor renormalization algorithms. This paper also serves as a self-contained introduction to the TNR framework. 2026-04-08T10:23:24Z Victor Vanthilt Adwait Naravane Chenqi Meng Atsushi Ueda http://arxiv.org/abs/2606.09001v1 JAX-AMG: A GPU-Accelerated Differentiable Sparse Linear Solver Library for JAX 2026-06-08T03:57:19Z Sparse linear systems from PDE discretizations are central to scientific computing, yet no existing JAX-ecosystem solver simultaneously provides GPU-accelerated algebraic multigrid (AMG), automatic differentiation (AD), and distributed multi-GPU execution. JAX-AMG fills this gap by wrapping the Nvidia AmgX solver suite as a native JAX primitive, exposing AMG and Krylov methods with configurable preconditioners through a unified interface compatible with JIT compilation, reverse-mode AD via adjoint methods, batched solves, and MPI-based distributed execution. Solver caching amortizes setup costs across repeated solves, making JAX-AMG practical for PDE-constrained optimization and inverse problems. The result is a robust, scalable sparse linear algebra layer that integrates seamlessly into differentiable simulation and scientific machine learning pipelines. 2026-06-08T03:57:19Z Yi Liu Xiantao Fan Jian-Xun Wang http://arxiv.org/abs/2605.06057v3 FalconGEMM: Surpassing Hardware Peaks with Lower-Complexity Matrix Multiplication 2026-06-08T03:12:53Z Peak breaking Matrix Multiplication is a promising technique to improve the performance of DL, especially in LLM training and inference. We present FalconGEMM, a cross-platform framework that automates the deployment, optimization, and selection of Lower-Complexity Matrix Multiplication Algorithms (LCMAs) across diverse hardware. There are three key innovations: (1) a Deployment Module that enables portable execution across various hardware and input configurations through code generation; (2) an Execution Module with Group-Parallel Optimizations that maximizes on-chip data reuse, utilizes parallel resources, and reduces bandwidth overhead; and (3) a Decision Module featuring a lightweight analytical performance model to select the optimal strategy based on matrix shapes and hardware profiles. Extensive evaluation is conducted on LLM workloads across GPU (H20, A100) and CPU (ARM, x86) architectures with multiple data types. FalconGEMM succeeds in delivering peak breaking performance and outperforms GEMM libraries (e.g., cuBLAS, CUTLASS, Intel MKL, etc) by 7.59%-17.85% and LCMA competitors like AlphaTensor by 12.41%-55.61%. Our framework makes the theoretical promise of LCMAs practical for production deployment across the heterogeneous landscape of modern hardware. 2026-05-07T11:41:54Z Honglin Zhu Jiaping Cao Jiang Shao Siyuan Feng Qian Qiu Peng Chen Xu Zhang Yixian Zhou Man Lung Yiu Guang Ji Minwen Deng Jintao Meng Wenxi Zhu http://arxiv.org/abs/2510.12705v3 Accelerating Bidiagonalization of Banded Matrices through Memory-Aware Bulge-Chasing on GPUs 2026-06-08T01:40:20Z The reduction of a banded matrix to bidiagonal form is a critical step in the calculation of Singular Values, a cornerstone of scientific computing and AI. Although inherently parallel, this step has traditionally been considered unsuitable for GPUs due to its memory-bound nature. However, recent advances in GPU architectures, such as increased L1 memory per Streaming Multiprocessor or Compute Unit and larger L2 caches, have shifted this paradigm. In this work, we present the first GPU-accelerated algorithm for reducing a banded matrix to bidiagonal form, integrated into an open-source software package. Our algorithm builds on prior multicore CPU cache-efficient bulge-chasing methods, adapted to modern GPU architectures to optimize throughput. Leveraging Julia's high-level array abstractions and KernelAbstractions.jl, we implement a single function that is both hardware-agnostic and data-precision-aware, running efficiently across NVIDIA, AMD, Intel, and Apple Metal GPUs. We develop a hardware-aware performance model to guide tuning and identify key hyperparameters that govern optimal GPU performance for memory-bound workloads. We show that such workloads, when carefully optimized, can achieve substantial speed-ups on modern GPUs: our implementation outperforms multithreaded CPU libraries (PLASMA,SLATE) starting from matrix sizes as small as 1024x1024, and achieves over 100x speed-up on 32k x 32k matrices. Moreover, the algorithm's performance scales linearly with the matrix bandwidth, enabling efficient reduction of matrices with larger bandwidths, previously considered impractical. 2025-10-14T16:39:29Z 14 pages, 7 figures, 3 tables Evelyne Ringoot Rabab Alomairy Alan Edelman http://arxiv.org/abs/2601.16510v3 Learning to Optimize by Differentiable Programming 2026-06-07T04:00:33Z Solving massive-scale optimization problems requires scalable first-order methods with low per-iteration cost. This tutorial highlights a shift in optimization: using differentiable programming not only to execute algorithms but to learn how to design them. Modern frameworks such as PyTorch, TensorFlow, and JAX enable this paradigm through efficient automatic differentiation. Embedding first-order methods within these systems allows end-to-end training that improves convergence and solution quality. Guided by Fenchel-Rockafellar duality, the tutorial demonstrates how duality-informed iterative schemes such as ADMM and PDHG can be learned and adapted. Case studies across LP, NNV, Sum-Rate maximization, OPF, and LRMP illustrate these gains. 2026-01-23T07:18:07Z Liping Tao Xindi Tong Chee Wei Tan http://arxiv.org/abs/2606.08366v1 MetaboliSim: a Python implementation of the Mader model for dynamic and steady-state simulation of muscular energy metabolism 2026-06-06T22:59:09Z The Mader model is the most widely used mathematical framework for muscular energy metabolism in German-language sport science, underpinning lactate diagnostics, maximal lactate steady state (MLSS) estimation and training prescription. Despite decades of use, neither its dynamic ODE formulation nor its steady-state equations have been available as open code, leaving results based on the model impossible to reproduce independently. We close this gap with MetaboliSim, an open-source Python implementation of both formulations: a dynamic model that integrates the five-variable ODE system (phosphate potential, $\dot{V}\mathrm{O}_2$, muscle and blood lactate, and glycogen) with a fourth-order Runge-Kutta scheme, and a steady-state model that computes MLSS power and the lactate-power relationship in one- and two-compartment variants. We verified implementation correctness against published reference values and assessed physiological plausibility across constant-load, step-test, sprint and running protocols. The implementation reproduces the published reference output within stated tolerances and remains numerically stable throughout (halving the time step changes blood lactate by less than 0.01 mmol/L), with both formulations yielding congruent MLSS estimates. Key physiological behaviour ($\dot{V}\mathrm{O}_2$ on-kinetics, lactate accumulation, PCr dynamics and the sub/supra-MLSS separation) emerges directly from the model equations without protocol-specific tuning, and a sensitivity analysis shows MLSS power varying approximately linearly with $\dot{V}\mathrm{O}_{2\max}$ and nonlinearly with $\dot{V}\mathrm{La}_{\max}$. As the first openly available implementation of the complete Mader model (AGPL-3.0), MetaboliSim lets independent groups reproduce, verify and build on published model-based results. Source code: https://codeberg.org/3phos/metabolisim; Platform: https://metabolisim.org 2026-06-06T22:59:09Z Katharina Dunst Vincent Scharf Clemens Hesse Alexander Asteroth http://arxiv.org/abs/2606.08339v1 Floating-point autotuning with customized precisions 2026-06-06T21:10:12Z Reduced-precision arithmetic offers significant opportunities to improve performance, memory usage, and energy efficiency in numerical applications, provided that numerical accuracy is preserved. This work investigates automated precision tuning through customized floating-point formats with user-defined exponent and significand sizes, enabling the emulation of emerging low-precision formats and the exploration of non-standard precision configurations within a unified mixed-precision framework. The proposed methodology, implemented in the PROMISE precision autotuning tool, combines numerical validation with a systematic search to generate program variants that satisfy user-defined accuracy requirements. To address the computational cost of this exploration, a containerized benchmarking framework supports parallel execution across multiple algorithms and parameter configurations. The approach is evaluated on a suite of numerical programs, including linear solvers and applications from the Rodinia benchmark. Results show that a substantial proportion of variables can be safely reduced to lower precision while preserving accuracy, indicating that standard double precision is often over-provisioned. These findings highlight the potential of automated precision tuning to derive efficient mixed-precision configurations tailored to application-specific accuracy requirements. 2026-06-06T21:10:12Z Xinye Chen Thibault Hilaire Fabienne Jézéquel http://arxiv.org/abs/2606.07062v1 CATEKAPPA: An R Shiny Application for Design and Analysis of Consistency Tests Based on the Kappa Statistic for Categorical Responses 2026-06-05T09:02:13Z The kappa statistic is the most widely used measure of inter-rater agreement for categorical data. Despite its popularity, applied researchers often encounter two major hurdles: (i) determining the sample size required to achieve a desired level of agreement with given power, and (ii) computing appropriate kappa coefficients with proper interpretation. Existing R packages such as irr and kappaSize provide these functionalities but require programming skills and lack an integrated, user-friendly interface. We present CATEKAPPA, an R package that bridges this gap by combining sample size planning (via kappaSize) and agreement analysis (via irr) into a single Shiny-based web application. The package supports Cohen's kappa for two raters, Fleiss' kappa for three or more raters, and Light's kappa, and provides automatic interpretation using the Landis & Koch scale. Users can either launch an interactive graphical interface or use command-line functions for scripting. The package is freely available on CRAN. 2026-06-05T09:02:13Z 10 pages, 4 figures; This open-source R package CATEKAPPA is available on CRAN at https://CRAN.R-project.org/package=catekappa, source code repository is hosted at https://github.com/satellite837/catekappa. Manuscript planned for submission to Journal of Statistical Software (JSS). Supplementary R package source code uploaded as ancillary file Zheng Gai Li Xincheng Jiang Wangyingjie Zhao Panwei http://arxiv.org/abs/2606.06386v1 On GPU Implementation for Multi-Precision Integer Division 2026-06-04T16:51:22Z This paper presents the issues arising in implementing a fast integer division algorithm on general purpose GPUs. The algorithm uses a Newton iteration based on the shifted inverse operation, keeping all arithmetic in the integer domain and relying on data-parallel operators. The principal contribution is an efficient GPU/CUDA implementation for integer precisions from $2^{15}$ to $2^{18}$ -- sizes not supported by \cgbn{} division. We propose algorithmic refinements, define a cost model in terms of multiplications, build on prefix sums and previous work on multi-precision multiplication, and present an evaluation showing near-optimal performance relative to the model for the target precision. 2026-06-04T16:51:22Z Martin B. Marchioro Aske N. Raahauge Marc I. Løvenskjold Cosmin E. Oancea Stephen M. Watt http://arxiv.org/abs/2606.06310v1 RedZeD: Computing persistent homology by Reduction to Zero Differentials 2026-06-04T15:51:27Z We introduce a new algorithm for computing persistent homology of Vietoris--Rips filtrations, which in many cases offers a considerable speedup over the existing implementation of the persistence pairing algorithm. The key innovation, called active enumeration, is made possible by a new theoretical framework of Reduction to Zero Differentials (hence RedZeD) in which to view persistent homology. 2026-06-04T15:51:27Z 30 pages; comments welcome Chris Kapulkin Nathan Kershaw http://arxiv.org/abs/2606.05466v1 Look Before You Leap: Checking in on Type Tag Checking 2026-06-03T21:44:02Z Tagging of generic dynamic values is important in symbolic-computation and dynamic-language systems, but the trade-offs change as machine architectures and workloads evolve. In particular, old folklore about boxed values, immediate values, and type tags must be recalibrated from time to time. We revisit the performance of badged object headers, low-bit tagging, and two NaN-boxing layouts on a range of platforms in use today, including AArch64 and x86-64 architectures from different manufacturers. The experiments isolate two distinct effects: the cost avoided by not heap-allocating common scalar values, and the cost avoided by obtaining tag information from the value word rather than by performing a heap read. The results show that several local bit operations are often cheaper than opening a heap object to obtain a tag or small value. Low-bit tagging remains the simplest and usually fastest choice for mostly symbolic workloads, while NaN-boxing is close in access cost and avoids the time and space of heap allocation for ordinary floating-point values. 2026-06-03T21:44:02Z Stephen M. Watt http://arxiv.org/abs/2606.05017v1 GoldenFloat: A Phi-Derived Static-Split Floating-Point Family from GF4 to GF256 with a Lucas-Exact Integer Identity 2026-06-03T15:41:16Z We present a hardware-oriented description of GoldenFloat (GF), a static-split floating-point family generated by a single closed rule, and three concrete artefacts: (i) an open multi-width RTL generator covering GF4-GF256 with a continuous-integration differential sweep against a correctly-rounded reference; (ii) an integer-backed Lucas-exact accumulator path verified at 500-digit precision for n = 1, ..., 256; and (iii) a GF16 FPGA codec passing a 35-of-35 testbench at 323 MHz on Artix-7 (Xilinx XC7A35T). For each total width N >= 4, the exponent width is e = round((N-1)/phi^2) with fraction f = N-1-e and phi = (1+sqrt(5))/2. The rule reproduces the realised exponent widths of nine formats (9/9) and extends consistently to GF128, GF512, GF1024. The rule is positioned alongside posit, takum, OCP-MX, and the IEEE P3109 multi-width float draft. We make no per-rung accuracy or superiority claim against any of them. The breadth/toolchain-coherence framing is recorded as an open conjecture with a pre-registered falsification path. A falsification ledger (FL-002) records open questions and the experiments that would settle them. An RTL-correctness erratum dated 2026-05-31 is reported; the fabricated TTSKY26b dies carry the defective multiplier portfolio, and the corrected generator is the regeneration baseline. 2026-06-03T15:41:16Z 19 pages, single-file LaTeX, ASCII source. RTL generator and CI artefacts at github.com/gHashTag/goldenfloat-preprint Dmitrii Vasiliev http://arxiv.org/abs/2606.04670v1 Fitting scattered data with optional monotonicity constraints on GPU: LipFit package 2026-06-03T09:51:56Z This paper presents a method of multivariate scattered data interpolation and approximation that produces optimal Lipschitz-continuous approximation, subject to the desired monotonicity constraints. This method relies on tight upper and lower approximations to the data, and is similar in its spirit to the nearest-neighbour approximation but does not suffer from discontinuities. Local Lipschitz interpolation and Lipschitz smoothing are also presented. This approach falls under the umbrella of instance-based approximation with no training phase, and it is suitable for GPU-based parallelisation. A Python GPU-friendly package LipFit which implements the methods discussed is discussed. 2026-06-03T09:51:56Z Gleb Beliakov