https://arxiv.org/api/zwQdaH0Z71HHVsrqyO/owzAVZhw 2026-04-03T08:31:04Z 5096 285 15 http://arxiv.org/abs/2511.22551v1 3RSeT: Read Disturbance Rate Reduction in STT-MRAM Caches by Selective Tag Comparison 2025-11-27T15:39:45Z

Recent development in memory technologies has introduced Spin-Transfer Torque Magnetic RAM (STT-MRAM) as the most promising replacement for SRAMs in on-chip cache memories. Besides its lower leakage power, higher density, immunity to radiation-induced particles, and non-volatility, an unintentional bit flip during read operation, referred to as read disturbance error, is a severe reliability challenge in STT-MRAM caches. One major source of read disturbance error in STT-MRAM caches is simultaneous accesses to all tags for parallel comparison operation in a cache set, which has not been addressed in previous work. This paper first demonstrates that high read accesses to tag array extremely increase the read disturbance rate and then proposes a low-cost scheme, so-called Read Disturbance Rate Reduction in STT-MRAM Caches by Selective Tag Comparison (3RSeT), to reduce the error rate by eliminating a significant portion of tag reads. 3RSeT proactively disables the tags that have no chance for hit, using low significant bits of the tags on each access request. Our evaluations using gem5 full-system cycle-accurate simulator show that 3RSeT reduces the read disturbance rate in the tag array by 71.8%, which results in 3.6x improvement in Mean Time To Failure (MTTF). In addition, the energy consumption is reduced by 62.1% without compromising performance and with less than 0.4% area overhead.

2025-11-27T15:39:45Z Elham Cheshmikhani Hamed Farbeh Hossein Asad http://arxiv.org/abs/2511.22467v1 Motion-to-Motion Latency Measurement Framework for Connected and Autonomous Vehicle Teleoperation 2025-11-27T13:58:27Z

Latency is a key performance factor for the teleoperation of Connected and Autonomous Vehicles (CAVs). It affects how quickly an operator can perceive changes in the driving environment and apply corrective actions. Most existing work focuses on Glass-to-Glass (G2G) latency, which captures delays only in the video pipeline. However, there is no standard method for measuring Motion-to-Motion (M2M) latency, defined as the delay between the physical steering movement of the remote operator and the corresponding steering motion in the vehicle. This paper presents an M2M latency measurement framework that uses Hall-effect sensors and two synchronized Raspberry Pi~5 devices. The system records interrupt-based timestamps on both sides to estimate M2M latency, independently of the underlying teleoperation architecture. Precision tests show an accuracy of 10--15~ms, while field results indicate that actuator delays dominate M2M latency, with median values above 750~ms.

2025-11-27T13:58:27Z François Provost Faisal Hawlader Mehdi Testouri Raphaël Frank http://arxiv.org/abs/2506.17084v2 JANUS: Resilient and Adaptive Data Transmission for Enabling Timely and Efficient Cross-Facility Scientific Workflows 2025-11-26T19:13:42Z

In modern science, the growing complexity of large-scale scientific projects has led to an increasing reliance on cross-facility scientific workflows, where resources and expertise from multiple institutions and geographic locations are leveraged to accelerate scientific discovery. These workflows often require transmitting huge amounts of scientific data through wide-area networks. Although high-speed networks like ESnet and transfer services such as Globus have improved data mobility, several challenges remain. The sheer volume of data can overwhelm network bandwidth, widely used transport protocols such as TCP suffer from inefficiencies due to retransmissions triggered by packet loss, and existing fault-tolerance mechanisms like erasure coding introduce substantial overhead. In this paper, we propose JANUS, a resilient and adaptable data transmission approach designed for cross-facility scientific workflows. Unlike traditional TCP-based methods, JANUSleverages UDP, integrates erasure coding for fault tolerance, and combines it with error-bounded lossy compression to reduce overhead. This novel design allows users to balance data transmission time and accuracy, optimizing transfer performance based on specific scientific requirements. Additionally, JANUS dynamically adjusts erasure coding parameters in response to real-time network conditions, ensuring efficient data transfers even in fluctuating environments. We develop optimization models for determining ideal configurations and implement adaptive data transfer protocols to enhance reliability. Through extensive simulations and real-network experiments, we demonstrate that JANUS significantly improves transfer efficiency while maintaining data fidelity.

2025-06-20T15:40:14Z Vladislav Esaulov Jieyang Chen Norbert Podhorszki Fred Suter Scott Klasky Anu G Bourgeois Lipeng Wan http://arxiv.org/abs/2511.21535v1 Modeling the Effect of Data Redundancy on Speedup in MLFMA Near-Field Computation 2025-11-26T16:01:32Z

The near-field (P2P) operator in the Multilevel Fast Multipole Algorithm (MLFMA) is a performance bottleneck on GPUs due to poor memory locality. This work introduces data redundancy to improve spatial locality by reducing memory access dispersion. For validation of results, we propose an analytical model based on a Locality metric that combines data volume and access dispersion to predict speedup trends without hardware-specific profiling. The approach is validated on two MLFMA-based applications: an electromagnetic solver (DBIM-MLFMA) with regular structure, and a stellar dynamics code (PhotoNs-2.0) with irregular particle distribution. Results show up to 7X kernel speedup due to improved cache behavior. However, increased data volume raises overheads in data restructuring, limiting end-to-end application speedup to 1.04X. While the model cannot precisely predict absolute speedups, it reliably captures performance trends across different problem sizes and densities. The technique is injectable into existing implementations with minimal code changes. This work demonstrates that data redundancy can enhance GPU performance for P2P operator, provided locality gains outweigh data movement costs.

2025-11-26T16:01:32Z Morteza Sadeghi http://arxiv.org/abs/2505.05623v3 Characterizing GPU Energy Usage in Exascale-Ready Portable Science Applications 2025-11-26T15:04:27Z

We characterize the GPU energy usage of two widely adopted exascale-ready applications representing two classes of particle and mesh solvers: (i) QMCPACK, a quantum Monte Carlo package, and (ii) AMReXCastro, an adaptive mesh astrophysical code. We analyze power, temperature, utilization, and energy traces from double-/single (mixed)-precision benchmarks on NVIDIA's A100 and H100 and AMD's MI250X GPUs using queries in NVML and rocm_smi_lib, respectively. We explore application-specific metrics to provide insights on energy vs. performance trade-offs. Our results suggest that mixed-precision energy savings range between 6-25% on QMCPACK and 45% on AMReX-Castro. Also, we found gaps in the AMD tooling used on Frontier GPUs that need to be understood, while query resolutions on NVML have little variability between 1 ms-1 s. Overall, application level knowledge is crucial to define energy-cost/science-benefit opportunities for the codesign of future supercomputer architectures in the post-Moore era.

2025-05-08T20:02:45Z 13 pages, 8 figures, 3 tables. Accepted at the Energy Efficiency with Sustainable Performance: Techniques, Tools, and Best Practices, EESP Workshop, in conjunction with ISC High Performance 2025 In: Neuwirth, S., Paul, A.K., Weinzierl, T., Carson, E.C. (eds) High Performance Computing. ISC High Performance 2025. Lecture Notes in Computer Science, vol 16091. Springer, Cham William F. Godoy Oscar Hernandez Paul R. C. Kent Maria Patrou Kazi Asifuzzaman Narasinga Rao Miniskar Pedro Valero-Lara Jeffrey S. Vetter Matthew D. Sinclair Jason Lowe-Power Bobby R. Bruce 10.1007/978-3-032-07612-0_14 http://arxiv.org/abs/2511.21413v1 Automated Dynamic AI Inference Scaling on HPC-Infrastructure: Integrating Kubernetes, Slurm and vLLM 2025-11-26T14:06:22Z

Due to rising demands for Artificial Inteligence (AI) inference, especially in higher education, novel solutions utilising existing infrastructure are emerging. The utilisation of High-Performance Computing (HPC) has become a prevalent approach for the implementation of such solutions. However, the classical operating model of HPC does not adapt well to the requirements of synchronous, user-facing dynamic AI application workloads. In this paper, we propose our solution that serves LLMs by integrating vLLM, Slurm and Kubernetes on the supercomputer \textit{RAMSES}. The initial benchmark indicates that the proposed architecture scales efficiently for 100, 500 and 1000 concurrent requests, incurring only an overhead of approximately 500 ms in terms of end-to-end latency.

2025-11-26T14:06:22Z 6 pages, 3 figures Tim Trappen Robert Keßler Roland Pabel Viktor Achter Stefan Wesner 10.1145/3774902.3776632 http://arxiv.org/abs/2511.20834v1 Accelerating Sparse Convolutions in Voxel-Based Point Cloud Networks 2025-11-25T20:34:37Z

Sparse Convolution (SpC) powers 3D point cloud networks widely used in autonomous driving and AR/VR. SpC builds a kernel map that stores mappings between input voxel coordinates, output coordinates, and weight offsets, then uses this map to compute feature vectors for output coordinates. Our work identifies three key properties of voxel coordinates: they are integer-valued, bounded within a limited spatial range, and geometrically continuous-neighboring voxels on the same object surface are highly likely to exist at small spatial offsets from each other. Prior SpC engines do not fully exploit these properties and suffer from high pre-processing and post-processing overheads during kernel map construction. To address this, we design Spira, the first voxel-property-aware SpC engine for GPUs. Spira proposes: (i) a high-performance one-shot search algorithm that builds the kernel map with no preprocessing and high memory locality, (ii) an effective packed-native processing scheme that accesses packed voxel coordinates at low cost, (iii) a flexible dual-dataflow execution mechanism that efficiently computes output feature vectors by adapting to layer characteristics, and (iv) a network-wide parallelization strategy that builds kernel maps for all SpC layers concurrently at network start. Our evaluation shows that Spira significantly outperforms prior SpC engines by 1.71x on average and up to 2.31x for end-to-end inference, and by 2.13x on average and up to 3.32x for layer-wise execution across diverse layer configurations.

2025-11-25T20:34:37Z Dionysios Adamopoulos Anastasia Poulopoulou Georgios Goumas Christina Giannoula http://arxiv.org/abs/2511.20048v1 Reducing Latency of LLM Search Agent via Speculation-based Algorithm-System Co-Design 2025-11-25T08:15:17Z

LLM-based search agents achieve strong performance but suffer from severe latency, as each step requires serialized LLM reasoning followed by action of tool execution. We revisit this bottleneck through the lens of speculation. While traditional predict-verify speculation paradigm can break serial execution, its benefit remains limited, as it retains the full original workload and adds extra inference overhead. We observe that early agent steps often involve simple evidence-gathering, where correct actions can often be predicted without full reasoning. Building on these observations, we present SPAgent, an algorithm-system co-design framework that expands the role of speculation in search agents to reduce latency. Algorithmically, SPAgent introduces a two-phase adaptive speculation mechanism that selectively omits verification when safe. System-wise, a two-level scheduler regulates speculative requests based on engine load to ensure speculation remains beneficial. We implement SPAgent in real-world systems. Across extensive experimental settings, SPAgent achieves up to $1.65\times$ end-to-end speedup while maintaining same or even achieving higher accuracy, enabling practical deployment of multi-step search agents.

2025-11-25T08:15:17Z Zixiao Huang Wen Zeng Tianyu Fu Tengxuan Liu Yizhou Sun Ke Hong Xinhao Yang Chengchun Liu Yan Li Quanlu Zhang Guohao Dai Zhenhua Zhu Yu Wang http://arxiv.org/abs/2507.16274v2 STAlloc: Enhancing Memory Efficiency in Large-Scale Model Training with Spatio-Temporal Planning 2025-11-25T07:36:10Z

The rapid scaling of large language models (LLMs) has significantly increased GPU memory pressure, which is further aggravated by training optimization techniques such as virtual pipeline and recomputation that disrupt tensor lifespans and introduce considerable memory fragmentation. Such fragmentation stems from the use of online GPU memory allocators in popular deep learning frameworks like PyTorch, which disregard tensor lifespans. As a result, this inefficiency can waste as much as 43% of memory and trigger out-of-memory errors, undermining the effectiveness of optimization methods. To address this, we introduce STAlloc, a GPU memory allocator for deep learning frameworks that reduces fragmentation by exploiting the spatial and temporal regularity in memory allocation behaviors of training workloads. STAlloc introduces a novel paradigm that combines offline planning with online allocation. The offline planning leverages spatio-temporal regularities to generate a near-optimal allocation plan, while the online allocation handles complex and dynamic models such as Mixture-of-Experts (MoE). Built as a pluggable PyTorch memory allocator, STAlloc reduces fragmentation ratio on average by 85.1% (up to 100%) across both dense and MoE models, with negligible overhead. This enables more efficient, high-throughput training configurations and improves throughput performance by up to 32.5%.

2025-07-22T06:39:07Z Zixiao Huang Junhao Hu Hao Lin Chunyang Zhu Yueran Tang Quanlu Zhang Zhen Guo Zhenhua Li Shengen Yan Zhenhua Zhu Guohao Dai Yu Wang 10.1145/3767295.3769335 http://arxiv.org/abs/2504.06443v2 cuTeSpMM: Accelerating Sparse-Dense Matrix Multiplication using GPU Tensor Cores 2025-11-24T04:13:17Z

Many recent GPUs feature matrix multiplication engines (aka Tensor Core Units or TCUs) that perform small fixed-size matrix-matrix products at very high throughput. They have been used very effectively to speed up dense matrix-matrix multiplication libraries like Nvidia's cuBLAS, enabling significantly higher performance over use of the traditional scalar GPU cores. There also been recent interest in using these dense TCUs for the important sparse-dense matrix-matrix multiplication (SpMM) kernel via explicit zero-filling. However, an examination of the attainable performance of TC-GNN, the state-of-the-art TCU-enhanced SpMM implementation, indicates that for a substantial majority of the sparse matrices in the SuiteSparse collection, the achieved performance falls significantly short of the state-of-the-art SpMM kernels that only utilize scalar cores. In this paper, we therefore address the question: Can dense TCUs be effectively used to accelerate SpMM for a range of sparse matrices arising from multiple application domains, such as those found in the SuiteSparse matrix collection? We answer this question in the affirmative by developing a very efficient TCU-based GPU kernel - cuTeSpMM (cuda Tensor core SpMM) that achieves substantially higher performance over TC-GNN. We also develop a notion of the TCU-Synergy of a sparse-matrix, based on its non-zero structure and a modeled Operational Intensity. For sparse matrices with high TCU-synergy, cuTeSpMM outperforms state-of-the-art scalar-core SpMM implementations, while achieving only slightly lower performance on matrices with low TCU-Synergy.

2025-04-08T21:30:58Z Lizhi Xiang Omid Asudeh Gerald Sabin Aravind Sukumaran-Rajam P. Sadayappan http://arxiv.org/abs/2409.19156v2 ZERNIPAX: A Fast and Accurate Zernike Polynomial Calculator in Python 2025-11-24T03:48:54Z

Zernike polynomials serve as an orthogonal basis on the unit disc, and have proven to be effective in optics simulations, astrophysics, and more recently in plasma simulations. Unlike Bessel functions, Zernike polynomials are inherently finite and smooth at the disc center (r=0), ensuring continuous differentiability along the axis. This property makes them particularly suitable for simulations, requiring no additional handling at the origin. We developed ZERNIPAX, an open-source Python package capable of utilizing CPU/GPUs, leveraging Google's JAX package and available on GitHub as well as the Python software repository PyPI. Our implementation of the recursion relation between Jacobi polynomials significantly improves computation time compared to alternative methods by use of parallel computing while still performing more accurately for high-mode numbers.

2024-09-27T21:44:37Z Yigit Gunsur Elmacioglu Rory Conlin Daniel W. Dudt Dario Panici Egemen Kolemen 10.1016/j.amc.2025.129534 http://arxiv.org/abs/2511.18692v1 VLM in a flash: I/O-Efficient Sparsification of Vision-Language Model via Neuron Chunking 2025-11-24T02:27:19Z

Edge deployment of large Vision-Language Models (VLMs) increasingly relies on flash-based weight offloading, where activation sparsification is used to reduce I/O overhead. However, conventional sparsification remains model-centric, selecting neurons solely by activation magnitude and neglecting how access patterns influence flash performance. We present Neuron Chunking, an I/O-efficient sparsification strategy that operates on chunks (i.e., groups of contiguous neurons in memory) and couples neuron importance with storage access cost. The method models I/O latency through a lightweight abstraction of access contiguity and selects chunks with high utility, defined as neuron importance normalized by estimated latency. By aligning sparsification decisions with the underlying storage behavior, Neuron Chunking improves I/O efficiency by up to 4.65x and 5.76x on Jetson Orin Nano and Jetson AGX Orin, respectively.

2025-11-24T02:27:19Z Kichang Yang Seonjun Kim Minjae Kim Nairan Zhang Chi Zhang Youngki Lee http://arxiv.org/abs/2511.18674v1 Low-Rank GEMM: Efficient Matrix Multiplication via Low-Rank Approximation with FP8 Acceleration 2025-11-24T01:13:52Z

Large matrix multiplication is a cornerstone of modern machine learning workloads, yet traditional approaches suffer from cubic computational complexity (e.g., $\mathcal{O}(n^3)$ for a matrix of size $n\times n$). We present Low-Rank GEMM, a novel approach that leverages low-rank matrix approximations to achieve sub-quadratic complexity while maintaining hardware-accelerated performance through FP8 precision and intelligent kernel selection. On a NVIDIA RTX 4090, our implementation achieves up to 378 TFLOPS on matrices up to $N=20480$, providing 75\% memory savings and $7.8\times$ speedup over PyTorch FP32 for large matrices. The system automatically adapts to hardware capabilities, selecting optimal decomposition methods (SVD, randomized SVD) and precision levels based on matrix characteristics and available accelerators. Comprehensive benchmarking on NVIDIA RTX 4090 demonstrates that Low-Rank GEMM becomes the fastest approach for matrices $N\geq10240$, surpassing traditional cuBLAS implementations through memory bandwidth optimization rather than computational shortcuts.

2025-11-24T01:13:52Z Alfredo Metere http://arxiv.org/abs/2511.08803v3 PANDA: Noise-Resilient Antagonist Identification in Production Datacenters 2025-11-23T19:45:48Z

Modern warehouse-scale datacenters commonly collocate multiple jobs on shared machines to improve resource utilization. However, such collocation often leads to performance interference caused by antagonistic jobs that overconsume shared resources. Existing antagonist-detection approaches either rely on offline profiling, which is costly and unscalable, or use a sample-from-production approach, which suffers from noisy measurements and fails under multi-victim scenarios. We present PANDA, a noise-resilient antagonist identification framework for production-scale datacenters. Like prior correlation-based methods, PANDA uses cycles per instruction (CPI) as its performance metric, but it differs by (i) leveraging global historical knowledge across all machines to suppress sampling noise and (ii) introducing a machine-level CPI metric that captures shared-resource contention among multiple co-located tasks. Evaluation on a recent Google production trace shows that PANDA ranks true antagonists far more accurately than prior methods -- improving average suspicion percentile from 50-55% to 82.6% -- and achieves consistent antagonist identification under multi-victim scenarios, all with negligible runtime overhead.

2025-11-11T22:18:43Z Add acknowledgement Sixiang Zhou Nan Deng Krzysiek Rzadca Xiaojun Lin Y. Charlie Hu http://arxiv.org/abs/2511.18222v1 Using MLIR Transform to Design Sliced Convolution Algorithm 2025-11-22T23:51:51Z

This paper proposes SConvTransform, a Transform dialect extension that provides operations for optimizing 2D convolutions in MLIR. Its main operation, SConvOp, lowers Linalg convolutions into tiled and packed generic operations through a fully declarative transformation pipeline. The process is guided by a Convolution Slicing Analysis that determines tile sizes and data layout strategies based on input and filter shapes, as well as target architecture parameters. SConvOp handles edge cases by splitting irregular regions and adjusting affine maps where needed. All packing and tiling operations are derived from a parametric set of affine equations, enabling reusable and analyzable transformations. Although functional correctness was the primary goal of this work, the experimental evaluation demonstrates the effectiveness of SConvTransform, achieving good enough performance across different target architectures. Future work will focus on optimizing performance and porting to other target devices. When applied to standard convolution configurations, the generated code achieves up to 60% of peak performance on ARM SME and 67% on Intel AVX512. These results validate the benefit of combining static shape analysis with structured tiling and packing strategies within the MLIR Transform dialect. Furthermore, the modular design of SConvTransform facilitates integration with future extensions, enabling continued optimization of convolution workloads through MLIR's extensible compilation infrastructure.

2025-11-22T23:51:51Z Victor Ferrari Marcio Pereira Lucas Alvarenga Gustavo Leite Guido Araujo