https://arxiv.org/api/zwQdaH0Z71HHVsrqyO/owzAVZhw2026-04-03T08:31:04Z509628515http://arxiv.org/abs/2511.22551v13RSeT: Read Disturbance Rate Reduction in STT-MRAM Caches by Selective Tag Comparison2025-11-27T15:39:45ZRecent development in memory technologies has introduced Spin-Transfer Torque Magnetic RAM (STT-MRAM) as the most promising replacement for SRAMs in on-chip cache memories. Besides its lower leakage power, higher density, immunity to radiation-induced particles, and non-volatility, an unintentional bit flip during read operation, referred to as read disturbance error, is a severe reliability challenge in STT-MRAM caches. One major source of read disturbance error in STT-MRAM caches is simultaneous accesses to all tags for parallel comparison operation in a cache set, which has not been addressed in previous work. This paper first demonstrates that high read accesses to tag array extremely increase the read disturbance rate and then proposes a low-cost scheme, so-called Read Disturbance Rate Reduction in STT-MRAM Caches by Selective Tag Comparison (3RSeT), to reduce the error rate by eliminating a significant portion of tag reads. 3RSeT proactively disables the tags that have no chance for hit, using low significant bits of the tags on each access request. Our evaluations using gem5 full-system cycle-accurate simulator show that 3RSeT reduces the read disturbance rate in the tag array by 71.8%, which results in 3.6x improvement in Mean Time To Failure (MTTF). In addition, the energy consumption is reduced by 62.1% without compromising performance and with less than 0.4% area overhead.2025-11-27T15:39:45ZElham CheshmikhaniHamed FarbehHossein Asadhttp://arxiv.org/abs/2511.22467v1Motion-to-Motion Latency Measurement Framework for Connected and Autonomous Vehicle Teleoperation2025-11-27T13:58:27ZLatency is a key performance factor for the teleoperation of Connected and Autonomous Vehicles (CAVs). It affects how quickly an operator can perceive changes in the driving environment and apply corrective actions. Most existing work focuses on Glass-to-Glass (G2G) latency, which captures delays only in the video pipeline. However, there is no standard method for measuring Motion-to-Motion (M2M) latency, defined as the delay between the physical steering movement of the remote operator and the corresponding steering motion in the vehicle. This paper presents an M2M latency measurement framework that uses Hall-effect sensors and two synchronized Raspberry Pi~5 devices. The system records interrupt-based timestamps on both sides to estimate M2M latency, independently of the underlying teleoperation architecture. Precision tests show an accuracy of 10--15~ms, while field results indicate that actuator delays dominate M2M latency, with median values above 750~ms.2025-11-27T13:58:27ZFrançois ProvostFaisal HawladerMehdi TestouriRaphaël Frankhttp://arxiv.org/abs/2506.17084v2JANUS: Resilient and Adaptive Data Transmission for Enabling Timely and Efficient Cross-Facility Scientific Workflows2025-11-26T19:13:42ZIn modern science, the growing complexity of large-scale scientific projects has led to an increasing reliance on cross-facility scientific workflows, where resources and expertise from multiple institutions and geographic locations are leveraged to accelerate scientific discovery. These workflows often require transmitting huge amounts of scientific data through wide-area networks. Although high-speed networks like ESnet and transfer services such as Globus have improved data mobility, several challenges remain. The sheer volume of data can overwhelm network bandwidth, widely used transport protocols such as TCP suffer from inefficiencies due to retransmissions triggered by packet loss, and existing fault-tolerance mechanisms like erasure coding introduce substantial overhead.
In this paper, we propose JANUS, a resilient and adaptable data transmission approach designed for cross-facility scientific workflows. Unlike traditional TCP-based methods, JANUSleverages UDP, integrates erasure coding for fault tolerance, and combines it with error-bounded lossy compression to reduce overhead. This novel design allows users to balance data transmission time and accuracy, optimizing transfer performance based on specific scientific requirements. Additionally, JANUS dynamically adjusts erasure coding parameters in response to real-time network conditions, ensuring efficient data transfers even in fluctuating environments. We develop optimization models for determining ideal configurations and implement adaptive data transfer protocols to enhance reliability. Through extensive simulations and real-network experiments, we demonstrate that JANUS significantly improves transfer efficiency while maintaining data fidelity.2025-06-20T15:40:14ZVladislav EsaulovJieyang ChenNorbert PodhorszkiFred SuterScott KlaskyAnu G BourgeoisLipeng Wanhttp://arxiv.org/abs/2511.21535v1Modeling the Effect of Data Redundancy on Speedup in MLFMA Near-Field Computation2025-11-26T16:01:32ZThe near-field (P2P) operator in the Multilevel Fast Multipole Algorithm (MLFMA) is a performance bottleneck on GPUs due to poor memory locality. This work introduces data redundancy to improve spatial locality by reducing memory access dispersion. For validation of results, we propose an analytical model based on a Locality metric that combines data volume and access dispersion to predict speedup trends without hardware-specific profiling. The approach is validated on two MLFMA-based applications: an electromagnetic solver (DBIM-MLFMA) with regular structure, and a stellar dynamics code (PhotoNs-2.0) with irregular particle distribution. Results show up to 7X kernel speedup due to improved cache behavior. However, increased data volume raises overheads in data restructuring, limiting end-to-end application speedup to 1.04X. While the model cannot precisely predict absolute speedups, it reliably captures performance trends across different problem sizes and densities. The technique is injectable into existing implementations with minimal code changes. This work demonstrates that data redundancy can enhance GPU performance for P2P operator, provided locality gains outweigh data movement costs.2025-11-26T16:01:32ZMorteza Sadeghihttp://arxiv.org/abs/2505.05623v3Characterizing GPU Energy Usage in Exascale-Ready Portable Science Applications2025-11-26T15:04:27ZWe characterize the GPU energy usage of two widely adopted exascale-ready applications representing two classes of particle and mesh solvers: (i) QMCPACK, a quantum Monte Carlo package, and (ii) AMReXCastro, an adaptive mesh astrophysical code. We analyze power, temperature, utilization, and energy traces from double-/single (mixed)-precision benchmarks on NVIDIA's A100 and H100 and AMD's MI250X GPUs using queries in NVML and rocm_smi_lib, respectively. We explore application-specific metrics to provide insights on energy vs. performance trade-offs. Our results suggest that mixed-precision energy savings range between 6-25% on QMCPACK and 45% on AMReX-Castro. Also, we found gaps in the AMD tooling used on Frontier GPUs that need to be understood, while query resolutions on NVML have little variability between 1 ms-1 s. Overall, application level knowledge is crucial to define energy-cost/science-benefit opportunities for the codesign of future supercomputer architectures in the post-Moore era.2025-05-08T20:02:45Z13 pages, 8 figures, 3 tables. Accepted at the Energy Efficiency with Sustainable Performance: Techniques, Tools, and Best Practices, EESP Workshop, in conjunction with ISC High Performance 2025In: Neuwirth, S., Paul, A.K., Weinzierl, T., Carson, E.C. (eds) High Performance Computing. ISC High Performance 2025. Lecture Notes in Computer Science, vol 16091. Springer, ChamWilliam F. GodoyOscar HernandezPaul R. C. KentMaria PatrouKazi AsifuzzamanNarasinga Rao MiniskarPedro Valero-LaraJeffrey S. VetterMatthew D. SinclairJason Lowe-PowerBobby R. Bruce10.1007/978-3-032-07612-0_14http://arxiv.org/abs/2511.21413v1Automated Dynamic AI Inference Scaling on HPC-Infrastructure: Integrating Kubernetes, Slurm and vLLM2025-11-26T14:06:22ZDue to rising demands for Artificial Inteligence (AI) inference, especially in higher education, novel solutions utilising existing infrastructure are emerging. The utilisation of High-Performance Computing (HPC) has become a prevalent approach for the implementation of such solutions. However, the classical operating model of HPC does not adapt well to the requirements of synchronous, user-facing dynamic AI application workloads. In this paper, we propose our solution that serves LLMs by integrating vLLM, Slurm and Kubernetes on the supercomputer \textit{RAMSES}. The initial benchmark indicates that the proposed architecture scales efficiently for 100, 500 and 1000 concurrent requests, incurring only an overhead of approximately 500 ms in terms of end-to-end latency.2025-11-26T14:06:22Z6 pages, 3 figuresTim TrappenRobert KeßlerRoland PabelViktor AchterStefan Wesner10.1145/3774902.3776632http://arxiv.org/abs/2511.20834v1Accelerating Sparse Convolutions in Voxel-Based Point Cloud Networks2025-11-25T20:34:37ZSparse Convolution (SpC) powers 3D point cloud networks widely used in autonomous driving and AR/VR. SpC builds a kernel map that stores mappings between input voxel coordinates, output coordinates, and weight offsets, then uses this map to compute feature vectors for output coordinates. Our work identifies three key properties of voxel coordinates: they are integer-valued, bounded within a limited spatial range, and geometrically continuous-neighboring voxels on the same object surface are highly likely to exist at small spatial offsets from each other. Prior SpC engines do not fully exploit these properties and suffer from high pre-processing and post-processing overheads during kernel map construction. To address this, we design Spira, the first voxel-property-aware SpC engine for GPUs. Spira proposes: (i) a high-performance one-shot search algorithm that builds the kernel map with no preprocessing and high memory locality, (ii) an effective packed-native processing scheme that accesses packed voxel coordinates at low cost, (iii) a flexible dual-dataflow execution mechanism that efficiently computes output feature vectors by adapting to layer characteristics, and (iv) a network-wide parallelization strategy that builds kernel maps for all SpC layers concurrently at network start. Our evaluation shows that Spira significantly outperforms prior SpC engines by 1.71x on average and up to 2.31x for end-to-end inference, and by 2.13x on average and up to 3.32x for layer-wise execution across diverse layer configurations.2025-11-25T20:34:37ZDionysios AdamopoulosAnastasia PoulopoulouGeorgios GoumasChristina Giannoulahttp://arxiv.org/abs/2511.20048v1Reducing Latency of LLM Search Agent via Speculation-based Algorithm-System Co-Design2025-11-25T08:15:17ZLLM-based search agents achieve strong performance but suffer from severe latency, as each step requires serialized LLM reasoning followed by action of tool execution. We revisit this bottleneck through the lens of speculation. While traditional predict-verify speculation paradigm can break serial execution, its benefit remains limited, as it retains the full original workload and adds extra inference overhead. We observe that early agent steps often involve simple evidence-gathering, where correct actions can often be predicted without full reasoning. Building on these observations, we present SPAgent, an algorithm-system co-design framework that expands the role of speculation in search agents to reduce latency. Algorithmically, SPAgent introduces a two-phase adaptive speculation mechanism that selectively omits verification when safe. System-wise, a two-level scheduler regulates speculative requests based on engine load to ensure speculation remains beneficial. We implement SPAgent in real-world systems. Across extensive experimental settings, SPAgent achieves up to $1.65\times$ end-to-end speedup while maintaining same or even achieving higher accuracy, enabling practical deployment of multi-step search agents.2025-11-25T08:15:17ZZixiao HuangWen ZengTianyu FuTengxuan LiuYizhou SunKe HongXinhao YangChengchun LiuYan LiQuanlu ZhangGuohao DaiZhenhua ZhuYu Wanghttp://arxiv.org/abs/2507.16274v2STAlloc: Enhancing Memory Efficiency in Large-Scale Model Training with Spatio-Temporal Planning2025-11-25T07:36:10ZThe rapid scaling of large language models (LLMs) has significantly increased GPU memory pressure, which is further aggravated by training optimization techniques such as virtual pipeline and recomputation that disrupt tensor lifespans and introduce considerable memory fragmentation. Such fragmentation stems from the use of online GPU memory allocators in popular deep learning frameworks like PyTorch, which disregard tensor lifespans. As a result, this inefficiency can waste as much as 43% of memory and trigger out-of-memory errors, undermining the effectiveness of optimization methods. To address this, we introduce STAlloc, a GPU memory allocator for deep learning frameworks that reduces fragmentation by exploiting the spatial and temporal regularity in memory allocation behaviors of training workloads. STAlloc introduces a novel paradigm that combines offline planning with online allocation. The offline planning leverages spatio-temporal regularities to generate a near-optimal allocation plan, while the online allocation handles complex and dynamic models such as Mixture-of-Experts (MoE). Built as a pluggable PyTorch memory allocator, STAlloc reduces fragmentation ratio on average by 85.1% (up to 100%) across both dense and MoE models, with negligible overhead. This enables more efficient, high-throughput training configurations and improves throughput performance by up to 32.5%.2025-07-22T06:39:07ZZixiao HuangJunhao HuHao LinChunyang ZhuYueran TangQuanlu ZhangZhen GuoZhenhua LiShengen YanZhenhua ZhuGuohao DaiYu Wang10.1145/3767295.3769335http://arxiv.org/abs/2504.06443v2cuTeSpMM: Accelerating Sparse-Dense Matrix Multiplication using GPU Tensor Cores2025-11-24T04:13:17ZMany recent GPUs feature matrix multiplication engines (aka Tensor Core Units or TCUs) that perform small fixed-size matrix-matrix products at very high throughput. They have been used very effectively to speed up dense matrix-matrix multiplication libraries like Nvidia's cuBLAS, enabling significantly higher performance over use of the traditional scalar GPU cores. There also been recent interest in using these dense TCUs for the important sparse-dense matrix-matrix multiplication (SpMM) kernel via explicit zero-filling.
However, an examination of the attainable performance of TC-GNN, the state-of-the-art TCU-enhanced SpMM implementation, indicates that for a substantial majority of the sparse matrices in the SuiteSparse collection, the achieved performance falls significantly short of the state-of-the-art SpMM kernels that only utilize scalar cores.
In this paper, we therefore address the question: Can dense TCUs be effectively used to accelerate SpMM for a range of sparse matrices arising from multiple application domains, such as those found in the SuiteSparse matrix collection? We answer this question in the affirmative by developing a very efficient TCU-based GPU kernel - cuTeSpMM (cuda Tensor core SpMM) that achieves substantially higher performance over TC-GNN. We also develop a notion of the TCU-Synergy of a sparse-matrix, based on its non-zero structure and a modeled Operational Intensity. For sparse matrices with high TCU-synergy, cuTeSpMM outperforms state-of-the-art scalar-core SpMM implementations, while achieving only slightly lower performance on matrices with low TCU-Synergy.2025-04-08T21:30:58ZLizhi XiangOmid AsudehGerald SabinAravind Sukumaran-RajamP. Sadayappanhttp://arxiv.org/abs/2409.19156v2ZERNIPAX: A Fast and Accurate Zernike Polynomial Calculator in Python2025-11-24T03:48:54ZZernike polynomials serve as an orthogonal basis on the unit disc, and have proven to be effective in optics simulations, astrophysics, and more recently in plasma simulations. Unlike Bessel functions, Zernike polynomials are inherently finite and smooth at the disc center (r=0), ensuring continuous differentiability along the axis. This property makes them particularly suitable for simulations, requiring no additional handling at the origin. We developed ZERNIPAX, an open-source Python package capable of utilizing CPU/GPUs, leveraging Google's JAX package and available on GitHub as well as the Python software repository PyPI. Our implementation of the recursion relation between Jacobi polynomials significantly improves computation time compared to alternative methods by use of parallel computing while still performing more accurately for high-mode numbers.2024-09-27T21:44:37ZYigit Gunsur ElmaciogluRory ConlinDaniel W. DudtDario PaniciEgemen Kolemen10.1016/j.amc.2025.129534http://arxiv.org/abs/2511.18692v1VLM in a flash: I/O-Efficient Sparsification of Vision-Language Model via Neuron Chunking2025-11-24T02:27:19ZEdge deployment of large Vision-Language Models (VLMs) increasingly relies on flash-based weight offloading, where activation sparsification is used to reduce I/O overhead. However, conventional sparsification remains model-centric, selecting neurons solely by activation magnitude and neglecting how access patterns influence flash performance. We present Neuron Chunking, an I/O-efficient sparsification strategy that operates on chunks (i.e., groups of contiguous neurons in memory) and couples neuron importance with storage access cost. The method models I/O latency through a lightweight abstraction of access contiguity and selects chunks with high utility, defined as neuron importance normalized by estimated latency. By aligning sparsification decisions with the underlying storage behavior, Neuron Chunking improves I/O efficiency by up to 4.65x and 5.76x on Jetson Orin Nano and Jetson AGX Orin, respectively.2025-11-24T02:27:19ZKichang YangSeonjun KimMinjae KimNairan ZhangChi ZhangYoungki Leehttp://arxiv.org/abs/2511.18674v1Low-Rank GEMM: Efficient Matrix Multiplication via Low-Rank Approximation with FP8 Acceleration2025-11-24T01:13:52ZLarge matrix multiplication is a cornerstone of modern machine learning workloads, yet traditional approaches suffer from cubic computational complexity (e.g., $\mathcal{O}(n^3)$ for a matrix of size $n\times n$). We present Low-Rank GEMM, a novel approach that leverages low-rank matrix approximations to achieve sub-quadratic complexity while maintaining hardware-accelerated performance through FP8 precision and intelligent kernel selection. On a NVIDIA RTX 4090, our implementation achieves up to 378 TFLOPS on matrices up to $N=20480$, providing 75\% memory savings and $7.8\times$ speedup over PyTorch FP32 for large matrices. The system automatically adapts to hardware capabilities, selecting optimal decomposition methods (SVD, randomized SVD) and precision levels based on matrix characteristics and available accelerators. Comprehensive benchmarking on NVIDIA RTX 4090 demonstrates that Low-Rank GEMM becomes the fastest approach for matrices $N\geq10240$, surpassing traditional cuBLAS implementations through memory bandwidth optimization rather than computational shortcuts.2025-11-24T01:13:52ZAlfredo Meterehttp://arxiv.org/abs/2511.08803v3PANDA: Noise-Resilient Antagonist Identification in Production Datacenters2025-11-23T19:45:48ZModern warehouse-scale datacenters commonly collocate multiple jobs on shared machines to improve resource utilization. However, such collocation often leads to performance interference caused by antagonistic jobs that overconsume shared resources. Existing antagonist-detection approaches either rely on offline profiling, which is costly and unscalable, or use a sample-from-production approach, which suffers from noisy measurements and fails under multi-victim scenarios. We present PANDA, a noise-resilient antagonist identification framework for production-scale datacenters. Like prior correlation-based methods, PANDA uses cycles per instruction (CPI) as its performance metric, but it differs by (i) leveraging global historical knowledge across all machines to suppress sampling noise and (ii) introducing a machine-level CPI metric that captures shared-resource contention among multiple co-located tasks. Evaluation on a recent Google production trace shows that PANDA ranks true antagonists far more accurately than prior methods -- improving average suspicion percentile from 50-55% to 82.6% -- and achieves consistent antagonist identification under multi-victim scenarios, all with negligible runtime overhead.2025-11-11T22:18:43ZAdd acknowledgementSixiang ZhouNan DengKrzysiek RzadcaXiaojun LinY. Charlie Huhttp://arxiv.org/abs/2511.18222v1Using MLIR Transform to Design Sliced Convolution Algorithm2025-11-22T23:51:51ZThis paper proposes SConvTransform, a Transform dialect extension that provides operations for optimizing 2D convolutions in MLIR. Its main operation, SConvOp, lowers Linalg convolutions into tiled and packed generic operations through a fully declarative transformation pipeline. The process is guided by a Convolution Slicing Analysis that determines tile sizes and data layout strategies based on input and filter shapes, as well as target architecture parameters. SConvOp handles edge cases by splitting irregular regions and adjusting affine maps where needed. All packing and tiling operations are derived from a parametric set of affine equations, enabling reusable and analyzable transformations. Although functional correctness was the primary goal of this work, the experimental evaluation demonstrates the effectiveness of SConvTransform, achieving good enough performance across different target architectures. Future work will focus on optimizing performance and porting to other target devices. When applied to standard convolution configurations, the generated code achieves up to 60% of peak performance on ARM SME and 67% on Intel AVX512. These results validate the benefit of combining static shape analysis with structured tiling and packing strategies within the MLIR Transform dialect. Furthermore, the modular design of SConvTransform facilitates integration with future extensions, enabling continued optimization of convolution workloads through MLIR's extensible compilation infrastructure.2025-11-22T23:51:51ZVictor FerrariMarcio PereiraLucas AlvarengaGustavo LeiteGuido Araujo