https://arxiv.org/api/Hx+fxzYJi2W25Sp4YhCMW4HdYuE2026-03-24T09:45:44Z50733015http://arxiv.org/abs/2603.09555v1Compiler-First State Space Duality and Portable $O(1)$ Autoregressive Caching for Inference2026-03-10T12:03:00ZState-space model releases are typically coupled to fused CUDA and Triton kernels, inheriting a hard dependency on NVIDIA hardware. We show that Mamba-2's state space duality algorithm -- diagonal state structure, chunkable recurrence, and einsum-dominated compute with static control flow -- maps cleanly onto what XLA's fusion and tiling passes actually optimise, making custom kernels optional rather than required. We implement the full inference path (prefill, cached autoregressive decoding) as shaped standard primitives under XLA, without hand-written kernels, and realise the architecture's theoretical $O(1)$ state management as a compiled on-device cache requiring no host synchronisation during generation. The implementation runs unmodified on CPU, NVIDIA GPU, and Google Cloud TPU from a single JAX source. On TPU v6e across five model scales (130M--2.7B parameters), XLA-generated code reaches approximately 140 TFLOPS on single-stream prefill ($15%$ MFU) and up to $64%$ bandwidth utilisation on decode. Greedy decoding matches the PyTorch/CUDA reference token-for-token across 64 steps, with hidden-state agreement within float32 rounding tolerance. The pattern transfers to any SSM recurrence satisfying the same structural conditions, on any platform with a mature XLA backend. The implementation is publicly available at https://github.com/CosmoNaught/mamba2-jax and merged into the Bonsai JAX model library.2026-03-10T12:03:00Z18 pages, 6 figures. Code available at: https://github.com/CosmoNaught/mamba2-jaxCosmo Santonihttp://arxiv.org/abs/2603.09333v1Dynamic Precision Math Engine for Linear Algebra and Trigonometry Acceleration on Xtensa LX6 Microcontrollers2026-03-10T08:10:21ZLow-cost embedded processors such as the ESP32 (Xtensa LX6, 32-bit dual-core, 240 MHz) are increasingly used in edge computing applications that require real-time physical simulation, sensor fusion, and control systems. Although the ESP32 integrates a single-precision IEEE 754 floating-point unit, floating-point operations introduce pipeline overhead and higher energy consumption compared to integer arithmetic, limiting throughput for floating-point intensive workloads. This paper presents the design, formal specification, and empirical evaluation of a Dynamic Precision Math Engine for the ESP32. The system integrates three main components: a Q16.16 fixed-point arithmetic core that maps mathematical operations onto the integer pipeline of the Xtensa LX6, a 16-iteration CORDIC trigonometric module that computes sine and cosine using only additions and bit shifts, and a cache-aware tiled matrix multiplication kernel with deferred correction to reduce rounding operations. The architecture introduces a runtime precision switching mechanism implemented through function pointer dispatch and a synchronization protocol compatible with FreeRTOS. This mechanism allows applications to dynamically transition between a fast fixed-point execution path and a precise IEEE 754 floating-point path without recompilation. Experimental evaluation on ESP32-WROOM-32 hardware using 300 paired measurements shows that the CORDIC trigonometric module achieves median latencies of 293 cycles for both sine and cosine, corresponding to mean speedups of 18.5x and 24.7x compared to the standard math library. The results demonstrate that precision-aware software architecture can significantly improve numerical performance on low-cost microcontrollers.2026-03-10T08:10:21Z22 pages, 2 figures, experimental evaluation on ESP32-WROOM-32 hardwareElian Alfonso Lopez Preciadohttp://arxiv.org/abs/2603.09038v1Accelerating High-Order Finite Element Simulations at Extreme Scale with FP64 Tensor Cores2026-03-10T00:12:47ZFinite element simulations play a critical role in a wide range of applications, from automotive design to tsunami modeling and computational electromagnetics. Performing these simulations efficiently at the high resolutions needed for practical applications and scientific insights necessitates the use of high-order methods and large-scale supercomputing. While much progress has been made in porting finite element codes to GPU systems in recent years, additional improvements in the efficiency and computational speed of GPU-accelerated high-order finite element simulations are in constant demand. In this paper, we demonstrate that the FP64 tensor cores on NVIDIA GPUs can be used to further accelerate such simulations, achieving significant speedups in key kernels of MFEM, a scalable open-source finite element library widely used in HPC applications. By integrating FP64 tensor cores with kernel fusion optimizations, we were able to achieve up to 2$\times$ performance gains and up to 83% energy efficiency gains on NVIDIA's Grace Hopper GH200 and Grace Blackwell GB200 architectures. To the best of our knowledge, this is the first time that FP64 tensor cores have been directly programmed to accelerate large-scale finite element scientific computing applications. We demonstrate the performance of the optimized kernels at exascale by showing near-perfect weak scaling efficiency and 90% strong scaling efficiency across nearly 10,000 GPUs on the Alps system. The new algorithms and MFEM enhancements directly benefit complex production codes, including the 2025 Gordon Bell Prize-winning application for real-time tsunami forecasting.2026-03-10T00:12:47ZJiqun TuIan KarlinJohn CamierVeselin DobrevTzanio KolevStefan HennekingOmar Ghattashttp://arxiv.org/abs/2603.08960v1The $qs$ Inequality: Quantifying the Double Penalty of Mixture-of-Experts at Inference2026-03-09T21:48:04ZMixture-of-Experts (MoE) models deliver high quality at low training FLOPs, but this efficiency often vanishes at inference. We identify a double penalty that structurally disadvantages MoE architectures during decoding: first, expert routing fragments microbatches and reduces weight reuse; second, massive resident expert pools reduce high-bandwidth memory (HBM) headroom for the KV cache. This phenomenon, formalized as reuse fragmentation, pushes feed-forward networks (FFNs) into a bandwidth-bound regime, especially at long context lengths.
We introduce the $qs$ inequality, a predictive criterion that identifies when MoE is structurally disadvantaged relative to a quality-matched dense model. This criterion unifies sparsity ($s$), the fraction of parameters activated per token, and the quality-equivalence factor ($q$), the size multiplier required for a dense model to match MoE performance. Our evaluation across frontier models including DeepSeek-V3, Qwen3-235B, Grok-1, and Switch-C demonstrates that this fragmentation is a general architectural phenomenon. For DeepSeek-V3 at 128k context, this results in a 4.5x throughput advantage for a quality-matched dense baseline. Crucially, massive architectures like Switch-C can become infeasible on cluster sizes where a quality-matched dense model remains viable.
Our results suggest that training-time FLOP efficiency is an incomplete proxy for inference-time performance in long-context serving. They also indicate that MoE may be best viewed as a training-time optimization, with distillation into dense models as a possible path toward inference-efficient deployment.2026-03-09T21:48:04Z10 pages, 6 tablesVignesh AdhinarayananNuwan Jayasenahttp://arxiv.org/abs/2603.08929v1bsort: A theoretically efficient non-comparison-based sorting algorithm for integer and floating-point numbers2026-03-09T20:58:03ZThis paper presents bsort, a non-comparison-based sorting algorithm for signed and unsigned integers, and floating-point values. The algorithm unifies these cases through an approach derived from binary quicksort, achieving $O(wn)$ runtime asymptotic behavior and $O(w)$ auxiliary space, where $w$ is the element word size. This algorithm is highly efficient for data types with small word sizes, where empirical analysis exhibits performance competitive with highly optimized hybrid algorithms from popular libraries.2026-03-09T20:58:03Z9 pages, 9 figures, for sources go to https://benjaminguzman.devBenjamín Guzmánhttp://arxiv.org/abs/2603.08026v1DyLLM: Efficient Diffusion LLM Inference via Saliency-based Token Selection and Partial Attention2026-03-09T07:02:01ZMasked Diffusion Language Models (MDLMs) enable parallel token decoding, providing a promising alternative to the sequential nature of autoregressive generation. However, their iterative denoising process remains computationally expensive because it repeatedly processes the entire sequence at every step. We observe that across these diffusion steps, most token representations remain stable; only a small subset, which we term salient tokens, contributes meaningfully to the next update. Leveraging this temporal sparsity, we present DyLLM, a training-free inference framework that accelerates decoding by selectively computing only these salient tokens. DyLLM identifies saliency by measuring the cosine similarity of attention contexts between adjacent denoising steps. It recomputes feed-forward and attention operations only for salient tokens while reusing cached activations for the remainder. Across diverse reasoning and code-generation benchmarks, DyLLM achieves up to 9.6x higher throughput while largely preserving the baseline accuracy of state-of-the-art models like LLaDA and Dream.2026-03-09T07:02:01Z18 pages, 10 figuresYounjoo LeeJunghoo LeeSeungkyun DanJaiyoung ParkJung Ho Ahnhttp://arxiv.org/abs/2603.07850v1A Lock-Free, Fully GPU-Resident Architecture for the Verification of Goldbach's Conjecture2026-03-08T23:58:47ZWe present a fully device-resident, multi-GPU architecture for the large-scale computational verification of Goldbach's conjecture. In prior work, a segmented double-sieve eliminated monolithic VRAM bottlenecks but remained constrained by host-side sieve construction and PCIe transfer latency. In this work, we migrate the entire segment generation pipeline to the GPU using highly optimised L1 shared-memory tiling, achieving near-zero host-device communication during the critical verification path. To fully leverage heterogeneous multi-GPU clusters, we introduce an asynchronous, lock-free work-stealing pool that replaces static workload partitioning with atomic segment claiming, enabling $99.7$% parallel efficiency at 2 GPUs and $98.6$% at $4$ GPUs. We further implement strict mathematical overflow guards guaranteeing the soundness of the 64-bit verification pipeline up to its theoretical ceiling of $1.84 \times 10^{19}$. On the same hardware, the new architecture achieves a $45.6\times$ algorithmic speedup over its host-coupled predecessor at N = $10^{10}$. End-to-end, the framework verifies Goldbach's conjecture up to $10^{12}$ in $36.5$ seconds on a single NVIDIA RTX 5090, and up to $10^{13}$ in $133.5$ seconds on a four-GPU system. All code is open-source and reproducible on commodity hardware.2026-03-08T23:58:47Z14 pages, 4 figures, 3 tables. The presented work details a major architectural overhaul: migration of the segmented sieve to GPU L1 shared memory and the implementation of a lock-free multi-GPU work pool. Source code available at: https://github.com/isaac-6/goldbach-gpuIsaac Llorente-Saguerhttp://arxiv.org/abs/2505.23819v5Linear Layouts: Robust Code Generation of Efficient Tensor Computation Using $\mathbb{F}_2$2026-03-06T16:48:24ZEfficient tensor computation is a cornerstone of modern deep learning (DL) workloads, yet existing approaches struggle to achieve flexible and performant design and implementation of tensor layouts -- mappings between logical tensors and hardware resources. The increasing complexity of DL algorithms and hardware demands a generic and systematic approach to handling tensor layouts. In this work, we introduce Linear Layouts, a novel approach that models tensor layouts using linear algebra over $\mathbb{F}_2$. By representing tensor layouts as binary matrices acting on the bits of the hardware representation, our approach enables a generic layout definition -- as opposed to the classical case-by-case approach -- and allows for generic layout-to-layout conversions, eliminating the quadratic explosion that plagues existing solutions. We integrate linear layouts with Triton and demonstrate their effectiveness in optimizing individual Triton operators as well as kernels written in Triton. We also show that linear layouts reduce engineering effort in the compiler backend while fixing several bugs in Triton's legacy layout system.2025-05-28T00:45:50ZKeren ZhouMario LezcanoAdam GoucherAkhmed RakhmatiJeff NiuJustin LebarPawel SzczerbukPeter BellPhil TilletThomas RaouxZahi Moudallalhttp://arxiv.org/abs/2512.15028v5Reexamining Paradigms of End-to-End Data Movement2026-03-06T05:55:17ZThe pursuit of high-performance data transfer often focuses on raw network bandwidth, where international links of 100 Gbps or higher are frequently considered the primary enabler. While necessary, this network-centric view is incomplete. It equates provisioned link speeds with practical, sustainable data movement capabilities. It is a common observation that lower-than-desired data rates manifest even on 10 Gbps links and commodity hardware, with higher-speed networks only amplifying their visibility. We investigate six paradigms -- from network latency and TCP congestion control to host-side factors such as CPU performance and virtualization -- that critically impact data movement workflows. These paradigms represent widely accepted engineering assumptions that inform system design, procurement decisions, and operational practices in production data movement environments. We introduce the Drainage Basin Pattern conceptual model for reasoning about end-to-end data flow constraints across heterogeneous hardware and software components at varying desired data rates to address the fidelity gap between raw bandwidth and application-level throughput. Our findings are validated through rigorous production-scale deployments, from 10 Gbps links to U.S. DOE ESnet technical evaluations and transcontinental production trials over 100 Gbps operational links. The results demonstrate that principal bottlenecks often reside outside the network core, and that a holistic hardware-software co-design enables consistent, predictable performance for moving data at scale and speed.2025-12-17T02:38:06Z27 pages and 13 figuresChin FangTimothy StittMichael J. McManusToshio Moriyahttp://arxiv.org/abs/2603.05692v1Parallelization Strategies for Dense LLM Deployment: Navigating Through Application-Specific Tradeoffs and Bottlenecks2026-03-05T21:33:24ZBreakthroughs in the generative AI domain have fueled an explosion of large language model (LLM)-powered applications, whose workloads fundamentally consist of sequences of inferences through transformer architectures. Within this rapidly expanding ecosystem, dense LLMs--those that activate all model parameters for each token generation--form the foundation for advanced expert-based variants. Dense models continue to dominate because of their strong generalization ability, scalability, ease of fine-tuning, and versatility across diverse tasks. In LLM inference systems, performance is mainly characterized by latency, response time, and throughput (i.e., tokens generated per unit of time). Latency and throughput are inherently coupled: optimizing for one often comes at the expense of the other. Moreover, batching strategies and parallelism configurations, which are essential when dense model parameters exceed device memory capacity, can significantly affect both latency and overall system throughput. This paper (i) investigates the workloads of two representative dense LLMs--Llama-3.1-70B and Llama-3.1-405B, focusing in particular on intra-node parallelization schemes, (ii) analyzes how input characteristics, batching, and parallelism strategies influence latency flexibility and the latency-throughput tradeoff, and (iii) identifies key performance bottlenecks that inform design choices for meeting service-level agreements (SLAs) and sustaining inference quality. Our empirical evaluations reveal that Tensor Parallelism (TP) improves the latency objectives while Pipeline Parallelism (PP) is better-suited for throughput-oriented applications. We highlight that their hybrid usage by controlling the TP and PP degrees provides control over the latency-throughput interplay.2026-03-05T21:33:24Z17 pages, 8 figures, 3 tablesBurak TopcuMusa Oguzhan CimPoovaiah PalangappaMeena ArunachalamMahmut Taylan Kandemirhttp://arxiv.org/abs/2309.09359v2Concurrent Deterministic Skiplist and Other Data Structures2026-03-05T18:38:35ZSkiplists are used in a variety of applications for storing data subject to order criteria. In this article we discuss the design, analysis and performance of a concurrent deterministic skiplist on many-core NUMA nodes. We also evaluate the performance of concurrent lock-free unbounded queue implementation and two concurrent multi-reader,multi-writer(MWMR) hash table implementations and compare them with those from Intel's Thread Building Blocks(TBB) library. We introduce strategies for memory management that reduce page faults and cache misses for the memory access patterns in these data structures. This paper proposes hierarchical usage of concurrent data structures in programs to improve memory latencies by reducing memory accesses from remote NUMA nodes.2023-09-17T19:50:26ZAparna Sasidharanhttp://arxiv.org/abs/2603.04937v1FluxSieve: Unifying Streaming and Analytical Data Planes for Scalable Cloud Observability2026-03-05T08:36:59ZDespite many advances in query optimization, indexing techniques, and data storage, modern data platforms still face difficulties in delivering robust query performance under high concurrency and computationally intensive queries. This challenge is particularly pronounced in large-scale observability platforms handling high-volume, high-velocity data records. For instance, recurrent, expensive filtering queries at query time impose substantial computational and storage overheads in the analytical data plane. In this paper, we propose FluxSieve, a unified architecture that reconciles traditional pull-based query processing with push-based stream processing by embedding a lightweight in-stream precomputation and filtering layer directly into the data ingestion path. This avoids the complexity and operational burden of running queries in dedicated stream processing frameworks. Concretely, this work (i) introduces a foundational architecture that unifies streaming and analytical data planes via in-stream filtering and records enrichment, (ii) designs a scalable multi-pattern matching mechanism that supports concurrent evaluation and on-the-fly updates of filtering rules with minimal per-record overhead, (iii) demonstrates how to integrate this ingestion-time processing with two open-source analytical systems -- Apache Pinot as a Real-Time Online Analytical Processing (RTOLAP) engine and DuckDB as an embedded analytical database, and (iv) performs comprehensive experimental evaluation of our approach. Our evaluation across different systems, query types, and performance metrics shows up to orders-of-magnitude improvements in query performance at the cost of negligible additional storage and very low computational overhead.2026-03-05T08:36:59ZAdriano VogelSören HenningOtmar Ertlhttp://arxiv.org/abs/2603.04860v1Rethinking Temporal Models for TinyML: LSTM versus 1D-CNN in Resource-Constrained Devices2026-03-05T06:34:21ZTime series classification underpins applications such as human activity recognition, healthcare monitoring, and gesture detection in the IoT domain. Tiny Machine Learning enables models to run directly on low-power microcontroller units, improving efficiency, ensuring privacy, and reducing cost by avoiding reliance on cloud or edge computing. While Long Short-Term Memory networks are widely used for capturing temporal dependencies, their high computational and memory demands make real-time MCU deployment impractical. In this work, we conduct a hardware-aware feasibility study of LSTM versus 1D Convolutional Neural Networks across five benchmark datasets. Results show that 1D-CNN consistently achieves comparable or higher accuracy around 95% than LSTM which is around 89%, while requiring 35% less RAM, approx. 25% less Flash, and enabling real-time inference that is 27.6 ms vs. 2038 ms. Being so lightweight, 1D-CNN is particularly suitable for on-device processing in wearables and other low-power, battery-operated systems, establishing it as a practical and resource-efficient choice for TinyML deployment.2026-03-05T06:34:21ZBidyut SahaRiya Samantahttp://arxiv.org/abs/2603.04782v1Unlocking Python's Cores: Hardware Usage and Energy Implications of Removing the GIL2026-03-05T04:01:30ZPython's Global Interpreter Lock prevents execution on more than one CPU core at the same time, even when multiple threads are used. However, starting with Python 3.13 an experimental build allows disabling the GIL. While prior work has examined speedup implications of this disabling, the effects on energy consumption and hardware utilization have received less attention. This study measures execution time, CPU utilization, memory usage, and energy consumption using four workload categories: NumPy-based, sequential kernels, threaded numerical workloads, and threaded object workloads, comparing GIL and free-threaded builds of Python 3.14.2.
The results highlight a trade-off. For parallelizable workloads operating on independent data, the free-threaded build reduces execution time by up to 4 times, with a proportional reduction in energy consumption, and effective multi-core utilization, at the cost of an increase in memory usage. In contrast, sequential workloads do not benefit from removing the GIL and instead show a 13-43% increase in energy consumption. Similarly, workloads where threads frequently access and modify the same objects show reduced improvements or even degradation due to lock contention. Across all workloads, energy consumption is proportional to execution time, indicating that disabling the GIL does not significantly affect power consumption, even when CPU utilization increases. When it comes to memory, the no-GIL build shows a general increase, more visible in virtual memory than in physical memory. This increase is primarily attributed to per-object locking, additional thread-safety mechanisms in the runtime, and the adoption of a new memory allocator.
These findings suggest that Python's no-GIL build is not a universal improvement. Developers should evaluate whether their workload can effectively benefit from parallel execution before adoption.2026-03-05T04:01:30ZJosé Daniel Montoya Salazarhttp://arxiv.org/abs/2603.04092v1Characterizing Machine Learning Force Fields as Emerging Molecular Dynamics Workloads on Graphics Processing Units2026-03-04T14:02:56ZMolecular dynamics (MD) simulates the time evolution of atomic systems governed by interatomic forces, and the fidelity of these simulations depends critically on the underlying force model. Classical force fields (CFFs) rely on fixed functional forms fitted to experimental or theoretical data, offering computational efficiency and broad applicability but limited accuracy in chemically diverse or reactive environments. In contrast, machine learning force fields (MLFFs) deliver near quantum chemical accuracy at molecular-mechanics cost by learning interatomic interactions directly from high level electronic structure data. While MLFFs offer improved accuracy at a fraction of the cost of quantum methods, they introduce significant computational overhead, particularly in descriptor evaluation and neural network inference. These operations pose challenges for parallel hardware due to irregular memory access, minimum data reuse and inefficient kernel execution. This work investigates the hardware performance of such models using poly alanine chains, a novel benchmark molecule system(s) with controllable input size, which used as performance evaluation test cases highlighting the computational bottlenecks of the graphical processor units when scaling out MLFF simulations. The analysis identifies key bottlenecks in descriptor and force computation, memory handling, highlighting the opportunities for improvements in the emerging area of MLFF based MD in drug discovery, that has received limited attention from a computer architecture perspective.2026-03-04T14:02:56ZAccepted to IEEE ISPASS - 2026Udari De AlwisBenjamin E. MayerTom J. AshbyMaria BarreraTimon EvenblijJoyjit Kundu