https://arxiv.org/api/0sItsfNY+OZ6Ii0JgxdByBrdpvk2026-06-09T20:32:15Z2652015http://arxiv.org/abs/2606.09686v1An 84-Format Numeric Catalog with Bit-Exact Conformance Vectors: A Vendor-Neutral Reference for FP8, BF16, MXFP4, and Microscaling Formats2026-06-08T16:04:15ZNumeric format proliferation in machine learning hardware -- FP8 (E4M3 and E5M2), BF16, MXFP4, microscaling block formats, and dozens of research variants -- has outpaced the availability of vendor-neutral, bit-exact reference material. Engineers porting models across accelerators encounter silent divergences that are difficult to diagnose without a shared ruler.
This paper describes a catalog of 84 numeric formats spanning 13 families, a suite of six bit-exact conformance packs covering GF16, MXFP4 element, BF16, FP8 E4M3, FP8 E5M2, and E8M0 block scale, and an IEEE P3109 v3.2.0 cross-walk that maps each pack to its corresponding standards-track configured format. Each pack is a self-contained JSON document with a SHA-256 fingerprint, a shared row schema, and an anchor vector that encodes 3.0 -- the identity phi^2 + 1/phi^2 = 3 -- as a cross-pack sanity check. Packs are cross-validated against ml_dtypes 0.5.4 (Google/JAX); any divergence is documented explicitly and interpreted as a spec-permitted interpretation gap rather than hidden. The work is framed as registry filling: it does not propose new formats, make model-accuracy claims, or assert superiority over any vendor's implementation. All artifacts are publicly available at https://github.com/gHashTag/t27 under an open license.2026-06-08T16:04:15Z17 pages. Source repository: https://github.com/gHashTag/paper3-methodology tag v4.0-trinity. Paper CC BY 4.0; code MIT. ORCID 0009-0008-4294-6159Dmitrii Vasilevhttp://arxiv.org/abs/2604.27210v2Fast-Vollib: A Fast Implied Volatility Library for Pythonwith PyTorch, JAX, and CUDA Fused-Kernel Backends2026-06-08T10:21:56ZWe present fast-vollib, an open-source Python library that provides high-performance European option pricing, implied volatility (IV) computation, and Greeks under the Black-76, Black-Scholes, and Black-Scholes-Merton models. The library is designed as a drop-in alternative to the de-facto-standard py_vollib and py_vollib_vectorized packages, with pluggable PyTorch and JAX execution backends, a CUDA fused-kernel Triton contribution for batched IV workloads, and a compatibility-first public API. In addition to a vectorized Halley-method IV solver, fast-vollib ships an experimental, fully-vectorized implementation of Jäckel's "Let's Be Rational" (LBR) algorithm with NumPy/Numba, torch.compile, JAX, and Triton single-pass GPU kernels for batched option chains. This note announces the library and describes its public API surface, with source, documentation, and packaging artifacts available at: GitHub (https://github.com/raeidsaqur/fast-vollib), Docs (https://raeidsaqur.github.io/fast-vollib/), PyPI (https://pypi.org/project/fast-vollib/).2026-04-29T21:29:32Z5 pages, 1 figure, 1 table. Software announcement / reference note. Code: https://github.com/raeidsaqur/fast-vollib. Install: pip install fast-vollibRaeid Saqurhttp://arxiv.org/abs/2604.06922v4A Practical Introduction to Tensor Network Renormalization with TNRKit.jl2026-06-08T09:07:30ZWe present TNRKit, an open-source Julia package for Tensor Network Renormalization (TNR) of two- and three-dimensional classical statistical models and Euclidean lattice field theories. Built on top of TensorKit, it provides a symmetry-aware framework for constructing tensor-network representations of partition functions and coarse-graining them using methods such as TRG, HOTRG, and LoopTNR. Beyond thermodynamic quantities, the package enables the extraction of universal conformal data -- including scaling dimensions and the central charge -- directly from fixed-point tensors. TNRKit is designed with both usability and extensibility in mind, offering a practical platform for applying, benchmarking, and developing modern tensor renormalization algorithms. This paper also serves as a self-contained introduction to the TNR framework.2026-04-08T10:23:24ZVictor VanthiltAdwait NaravaneChenqi MengAtsushi Uedahttp://arxiv.org/abs/2606.09001v1JAX-AMG: A GPU-Accelerated Differentiable Sparse Linear Solver Library for JAX2026-06-08T03:57:19ZSparse linear systems from PDE discretizations are central to scientific computing, yet no existing JAX-ecosystem solver simultaneously provides GPU-accelerated algebraic multigrid (AMG), automatic differentiation (AD), and distributed multi-GPU execution. JAX-AMG fills this gap by wrapping the Nvidia AmgX solver suite as a native JAX primitive, exposing AMG and Krylov methods with configurable preconditioners through a unified interface compatible with JIT compilation, reverse-mode AD via adjoint methods, batched solves, and MPI-based distributed execution. Solver caching amortizes setup costs across repeated solves, making JAX-AMG practical for PDE-constrained optimization and inverse problems. The result is a robust, scalable sparse linear algebra layer that integrates seamlessly into differentiable simulation and scientific machine learning pipelines.2026-06-08T03:57:19ZYi LiuXiantao FanJian-Xun Wanghttp://arxiv.org/abs/2605.06057v3FalconGEMM: Surpassing Hardware Peaks with Lower-Complexity Matrix Multiplication2026-06-08T03:12:53ZPeak breaking Matrix Multiplication is a promising technique to improve the performance of DL, especially in LLM training and inference. We present FalconGEMM, a cross-platform framework that automates the deployment, optimization, and selection of Lower-Complexity Matrix Multiplication Algorithms (LCMAs) across diverse hardware. There are three key innovations: (1) a Deployment Module that enables portable execution across various hardware and input configurations through code generation; (2) an Execution Module with Group-Parallel Optimizations that maximizes on-chip data reuse, utilizes parallel resources, and reduces bandwidth overhead; and (3) a Decision Module featuring a lightweight analytical performance model to select the optimal strategy based on matrix shapes and hardware profiles. Extensive evaluation is conducted on LLM workloads across GPU (H20, A100) and CPU (ARM, x86) architectures with multiple data types. FalconGEMM succeeds in delivering peak breaking performance and outperforms GEMM libraries (e.g., cuBLAS, CUTLASS, Intel MKL, etc) by 7.59%-17.85% and LCMA competitors like AlphaTensor by 12.41%-55.61%. Our framework makes the theoretical promise of LCMAs practical for production deployment across the heterogeneous landscape of modern hardware.2026-05-07T11:41:54ZHonglin ZhuJiaping CaoJiang ShaoSiyuan FengQian QiuPeng ChenXu ZhangYixian ZhouMan Lung YiuGuang JiMinwen DengJintao MengWenxi Zhuhttp://arxiv.org/abs/2510.12705v3Accelerating Bidiagonalization of Banded Matrices through Memory-Aware Bulge-Chasing on GPUs2026-06-08T01:40:20ZThe reduction of a banded matrix to bidiagonal form is a critical step in the calculation of Singular Values, a cornerstone of scientific computing and AI. Although inherently parallel, this step has traditionally been considered unsuitable for GPUs due to its memory-bound nature. However, recent advances in GPU architectures, such as increased L1 memory per Streaming Multiprocessor or Compute Unit and larger L2 caches, have shifted this paradigm. In this work, we present the first GPU-accelerated algorithm for reducing a banded matrix to bidiagonal form, integrated into an open-source software package. Our algorithm builds on prior multicore CPU cache-efficient bulge-chasing methods, adapted to modern GPU architectures to optimize throughput. Leveraging Julia's high-level array abstractions and KernelAbstractions.jl, we implement a single function that is both hardware-agnostic and data-precision-aware, running efficiently across NVIDIA, AMD, Intel, and Apple Metal GPUs. We develop a hardware-aware performance model to guide tuning and identify key hyperparameters that govern optimal GPU performance for memory-bound workloads. We show that such workloads, when carefully optimized, can achieve substantial speed-ups on modern GPUs: our implementation outperforms multithreaded CPU libraries (PLASMA,SLATE) starting from matrix sizes as small as 1024x1024, and achieves over 100x speed-up on 32k x 32k matrices. Moreover, the algorithm's performance scales linearly with the matrix bandwidth, enabling efficient reduction of matrices with larger bandwidths, previously considered impractical.2025-10-14T16:39:29Z14 pages, 7 figures, 3 tablesEvelyne RingootRabab AlomairyAlan Edelmanhttp://arxiv.org/abs/2601.16510v3Learning to Optimize by Differentiable Programming2026-06-07T04:00:33ZSolving massive-scale optimization problems requires scalable first-order methods with low per-iteration cost. This tutorial highlights a shift in optimization: using differentiable programming not only to execute algorithms but to learn how to design them. Modern frameworks such as PyTorch, TensorFlow, and JAX enable this paradigm through efficient automatic differentiation. Embedding first-order methods within these systems allows end-to-end training that improves convergence and solution quality. Guided by Fenchel-Rockafellar duality, the tutorial demonstrates how duality-informed iterative schemes such as ADMM and PDHG can be learned and adapted. Case studies across LP, NNV, Sum-Rate maximization, OPF, and LRMP illustrate these gains.2026-01-23T07:18:07ZLiping TaoXindi TongChee Wei Tanhttp://arxiv.org/abs/2606.08366v1MetaboliSim: a Python implementation of the Mader model for dynamic and steady-state simulation of muscular energy metabolism2026-06-06T22:59:09ZThe Mader model is the most widely used mathematical framework for muscular energy metabolism in German-language sport science, underpinning lactate diagnostics, maximal lactate steady state (MLSS) estimation and training prescription. Despite decades of use, neither its dynamic ODE formulation nor its steady-state equations have been available as open code, leaving results based on the model impossible to reproduce independently. We close this gap with MetaboliSim, an open-source Python implementation of both formulations: a dynamic model that integrates the five-variable ODE system (phosphate potential, $\dot{V}\mathrm{O}_2$, muscle and blood lactate, and glycogen) with a fourth-order Runge-Kutta scheme, and a steady-state model that computes MLSS power and the lactate-power relationship in one- and two-compartment variants. We verified implementation correctness against published reference values and assessed physiological plausibility across constant-load, step-test, sprint and running protocols. The implementation reproduces the published reference output within stated tolerances and remains numerically stable throughout (halving the time step changes blood lactate by less than 0.01 mmol/L), with both formulations yielding congruent MLSS estimates. Key physiological behaviour ($\dot{V}\mathrm{O}_2$ on-kinetics, lactate accumulation, PCr dynamics and the sub/supra-MLSS separation) emerges directly from the model equations without protocol-specific tuning, and a sensitivity analysis shows MLSS power varying approximately linearly with $\dot{V}\mathrm{O}_{2\max}$ and nonlinearly with $\dot{V}\mathrm{La}_{\max}$. As the first openly available implementation of the complete Mader model (AGPL-3.0), MetaboliSim lets independent groups reproduce, verify and build on published model-based results. Source code: https://codeberg.org/3phos/metabolisim; Platform: https://metabolisim.org2026-06-06T22:59:09ZKatharina DunstVincent ScharfClemens HesseAlexander Asterothhttp://arxiv.org/abs/2606.08339v1Floating-point autotuning with customized precisions2026-06-06T21:10:12ZReduced-precision arithmetic offers significant opportunities to improve performance, memory usage, and energy efficiency in numerical applications, provided that numerical accuracy is preserved. This work investigates automated precision tuning through customized floating-point formats with user-defined exponent and significand sizes, enabling the emulation of emerging low-precision formats and the exploration of non-standard precision configurations within a unified mixed-precision framework. The proposed methodology, implemented in the PROMISE precision autotuning tool, combines numerical validation with a systematic search to generate program variants that satisfy user-defined accuracy requirements. To address the computational cost of this exploration, a containerized benchmarking framework supports parallel execution across multiple algorithms and parameter configurations. The approach is evaluated on a suite of numerical programs, including linear solvers and applications from the Rodinia benchmark. Results show that a substantial proportion of variables can be safely reduced to lower precision while preserving accuracy, indicating that standard double precision is often over-provisioned. These findings highlight the potential of automated precision tuning to derive efficient mixed-precision configurations tailored to application-specific accuracy requirements.2026-06-06T21:10:12ZXinye ChenThibault HilaireFabienne Jézéquelhttp://arxiv.org/abs/2606.07062v1CATEKAPPA: An R Shiny Application for Design and Analysis of Consistency Tests Based on the Kappa Statistic for Categorical Responses2026-06-05T09:02:13ZThe kappa statistic is the most widely used measure of inter-rater agreement for categorical data. Despite its popularity, applied researchers often encounter two major hurdles: (i) determining the sample size required to achieve a desired level of agreement with given power, and (ii) computing appropriate kappa coefficients with proper interpretation. Existing R packages such as irr and kappaSize provide these functionalities but require programming skills and lack an integrated, user-friendly interface. We present CATEKAPPA, an R package that bridges this gap by combining sample size planning (via kappaSize) and agreement analysis (via irr) into a single Shiny-based web application. The package supports Cohen's kappa for two raters, Fleiss' kappa for three or more raters, and Light's kappa, and provides automatic interpretation using the Landis & Koch scale. Users can either launch an interactive graphical interface or use command-line functions for scripting. The package is freely available on CRAN.2026-06-05T09:02:13Z10 pages, 4 figures; This open-source R package CATEKAPPA is available on CRAN at https://CRAN.R-project.org/package=catekappa, source code repository is hosted at https://github.com/satellite837/catekappa. Manuscript planned for submission to Journal of Statistical Software (JSS). Supplementary R package source code uploaded as ancillary fileZheng GaiLi XinchengJiang WangyingjieZhao Panweihttp://arxiv.org/abs/2606.06386v1On GPU Implementation for Multi-Precision Integer Division2026-06-04T16:51:22ZThis paper presents the issues arising in implementing a fast integer division algorithm on general purpose GPUs. The algorithm uses a Newton iteration based on the shifted inverse operation, keeping all arithmetic in the integer domain and relying on data-parallel operators. The principal contribution is an efficient GPU/CUDA implementation for integer precisions from $2^{15}$ to $2^{18}$ -- sizes not supported by \cgbn{} division. We propose algorithmic refinements, define a cost model in terms of multiplications, build on prefix sums and previous work on multi-precision multiplication, and present an evaluation showing near-optimal performance relative to the model for the target precision.2026-06-04T16:51:22ZMartin B. MarchioroAske N. RaahaugeMarc I. LøvenskjoldCosmin E. OanceaStephen M. Watthttp://arxiv.org/abs/2606.06310v1RedZeD: Computing persistent homology by Reduction to Zero Differentials2026-06-04T15:51:27ZWe introduce a new algorithm for computing persistent homology of Vietoris--Rips filtrations, which in many cases offers a considerable speedup over the existing implementation of the persistence pairing algorithm. The key innovation, called active enumeration, is made possible by a new theoretical framework of Reduction to Zero Differentials (hence RedZeD) in which to view persistent homology.2026-06-04T15:51:27Z30 pages; comments welcomeChris KapulkinNathan Kershawhttp://arxiv.org/abs/2606.05466v1Look Before You Leap: Checking in on Type Tag Checking2026-06-03T21:44:02ZTagging of generic dynamic values is important in symbolic-computation and dynamic-language systems, but the trade-offs change as machine architectures and workloads evolve. In particular, old folklore about boxed values, immediate values, and type tags must be recalibrated from time to time. We revisit the performance of badged object headers, low-bit tagging, and two NaN-boxing layouts on a range of platforms in use today, including AArch64 and x86-64 architectures from different manufacturers. The experiments isolate two distinct effects: the cost avoided by not heap-allocating common scalar values, and the cost avoided by obtaining tag information from the value word rather than by performing a heap read. The results show that several local bit operations are often cheaper than opening a heap object to obtain a tag or small value. Low-bit tagging remains the simplest and usually fastest choice for mostly symbolic workloads, while NaN-boxing is close in access cost and avoids the time and space of heap allocation for ordinary floating-point values.2026-06-03T21:44:02ZStephen M. Watthttp://arxiv.org/abs/2606.05017v1GoldenFloat: A Phi-Derived Static-Split Floating-Point Family from GF4 to GF256 with a Lucas-Exact Integer Identity2026-06-03T15:41:16ZWe present a hardware-oriented description of GoldenFloat (GF), a static-split floating-point family generated by a single closed rule, and three concrete artefacts: (i) an open multi-width RTL generator covering GF4-GF256 with a continuous-integration differential sweep against a correctly-rounded reference; (ii) an integer-backed Lucas-exact accumulator path verified at 500-digit precision for n = 1, ..., 256; and (iii) a GF16 FPGA codec passing a 35-of-35 testbench at 323 MHz on Artix-7 (Xilinx XC7A35T). For each total width N >= 4, the exponent width is e = round((N-1)/phi^2) with fraction f = N-1-e and phi = (1+sqrt(5))/2. The rule reproduces the realised exponent widths of nine formats (9/9) and extends consistently to GF128, GF512, GF1024. The rule is positioned alongside posit, takum, OCP-MX, and the IEEE P3109 multi-width float draft. We make no per-rung accuracy or superiority claim against any of them. The breadth/toolchain-coherence framing is recorded as an open conjecture with a pre-registered falsification path. A falsification ledger (FL-002) records open questions and the experiments that would settle them. An RTL-correctness erratum dated 2026-05-31 is reported; the fabricated TTSKY26b dies carry the defective multiplier portfolio, and the corrected generator is the regeneration baseline.2026-06-03T15:41:16Z19 pages, single-file LaTeX, ASCII source. RTL generator and CI artefacts at github.com/gHashTag/goldenfloat-preprintDmitrii Vasilievhttp://arxiv.org/abs/2606.04670v1Fitting scattered data with optional monotonicity constraints on GPU: LipFit package2026-06-03T09:51:56ZThis paper presents a method of multivariate scattered data interpolation and approximation that produces optimal Lipschitz-continuous approximation, subject to the desired monotonicity constraints. This method relies on tight upper and lower approximations to the data, and is similar in its spirit to the nearest-neighbour approximation but does not suffer from discontinuities. Local Lipschitz interpolation and Lipschitz smoothing are also presented. This approach falls under the umbrella of instance-based approximation with no training phase, and it is suitable for GPU-based parallelisation. A Python GPU-friendly package LipFit which implements the methods discussed is discussed.2026-06-03T09:51:56ZGleb Beliakov