https://arxiv.org/api/Gkr77sAx4ZjI3LN+CxeN8vreoWM 2026-03-24T12:40:24Z 5073 60 15 http://arxiv.org/abs/2602.23556v1 Rudder: Steering Prefetching in Distributed GNN Training using LLM Agents 2026-02-26T23:39:42Z Large-scale Graph Neural Networks (GNNs) are typically trained by sampling a vertex's neighbors to a fixed distance. Because large input graphs are distributed, training requires frequent irregular communication that stalls forward progress. Moreover, fetched data changes with graph, graph distribution, sample and batch parameters, and caching polices. Consequently, any static prefetching method will miss crucial opportunities to adapt to different dynamic conditions. In this paper, we introduce Rudder, a software module embedded in the state-of-the-art AWS DistDGL framework, to autonomously prefetch remote nodes and minimize communication. Rudder's adaptation contrasts with both standard heuristics and traditional ML classifiers. We observe that the generative AI found in contemporary Large Language Models (LLMs) exhibits emergent properties like In-Context Learning (ICL) for zero-shot tasks, with logical multi-step reasoning. We find this behavior well-suited for adaptive control even with substantial undertraining. Evaluations using standard datasets and unseen configurations on the NERSC Perlmutter supercomputer show up to 91% improvement in end-to-end training performance over baseline DistDGL (no prefetching), and an 82% improvement over static prefetching, reducing communication by over 50%. Our code is available at https://github.com/aishwaryyasarkar/rudder-llm-agent. 2026-02-26T23:39:42Z Accepted to the 40th ACM International Conference on Supercomputing (ICS 2026) Aishwarya Sarkar Sayan Ghosh Nathan Tallent Aman Chadha Tanya Roosta Ali Jannesari http://arxiv.org/abs/2601.01265v3 CounterPoint: Using Hardware Event Counters to Refute and Refine Microarchitectural Assumptions (Extended Version) 2026-02-26T22:04:31Z Hardware event counters offer the potential to reveal not only performance bottlenecks but also detailed microarchitectural behavior. In practice, this promise is undermined by their vague specifications, opaque designs, and multiplexing noise, making event counter data hard to interpret. We introduce CounterPoint, a framework that tests user-specified microarchitectural models - expressed as $μ$path Decision Diagrams - for consistency with performance counter data. When mismatches occur, CounterPoint pinpoints plausible microarchitectural features that could explain them, using multi-dimensional counter confidence regions to mitigate multiplexing noise. We apply CounterPoint to the Haswell Memory Management Unit as a case study, shedding light on multiple undocumented and underdocumented microarchitectural behaviors. These include a load-store queue-side TLB prefetcher, merging page table walkers, abortable page table walks, and more. Overall, CounterPoint helps experts reconcile noisy hardware performance counter measurements with their mental model of the microarchitecture - uncovering subtle, previously hidden hardware features along the way. 2026-01-03T19:24:00Z This is an extended version of a paper which has been accepted to the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems conference (ASPLOS, March 2026). 20 pages, 20 figures, 8 tables Nick Lindsay Yale University Caroline Trippel Stanford University Anurag Khandelwal Yale University Abhishek Bhattacharjee Yale University 10.1145/3779212.3790145 http://arxiv.org/abs/2602.22540v1 The Road to Useful Quantum Computers 2026-02-26T02:27:54Z Building a useful quantum computer is a grand science and engineering challenge, currently pursued intensely by teams around the world. In the 1980s, Richard Feynman and Yuri Manin observed independently that computers based on quantum mechanics might enable better simulations of quantum phenomena. Their vision remained an intellectual curiosity until Peter Shor published his famous quantum algorithm for integer factoring, and shortly thereafter a proof that errors in quantum computations can be corrected. Since then, quantum computing R&D has progressed rapidly, from small-scale experiments in university physics laboratories to well-funded industrial efforts and prototypes. Hype notwithstanding, quantum computers have yet to solve scientifically or practically important problems -- a target often called quantum utility. In this article, we describe the capabilities of contemporary quantum computers, compare them to the requirements of quantum utility, and illustrate how to track progress from today to utility. We highlight key science and engineering challenges on the road to quantum utility, touching on relevant aspects of our own research. 2026-02-26T02:27:54Z This article was written by invitation for the National Academy of Engineering (NAE)'s magazine, The Bridge, based on a talk given by the first author at The Grainger Foundation Frontiers of Engineering 2025 Symposium. It was written for a general audience of engineers. It is close to the published article, which can be found at https://www.nae.edu/344306/the-road-to-useful-quantum-computers The Bridge 55, 4, 29-36 (2026) Timothy Proctor Robin Blume-Kohout Andrew Baczewski http://arxiv.org/abs/2602.22423v1 CARAT: Client-Side Adaptive RPC and Cache Co-Tuning for Parallel File Systems 2026-02-25T21:34:59Z Tuning parallel file system in High-Performance Computing (HPC) systems remains challenging due to the complex I/O paths, diverse I/O patterns, and dynamic system conditions. While existing autotuning frameworks have shown promising results in tuning PFS parameters based on applications' I/O patterns, they lack scalability, adaptivity, and the ability to operate online. In this work, focusing on scalable online tuning, we present CARAT, an ML-guided framework to co-tune client-side RPC and caching parameters of PFS, leveraging only locally observable metrics. Unlike global or pattern-dependent approaches, CARAT enables each client to make independent and intelligent tuning decisions online, responding to real-time changes in both application I/O behaviors and system states. We then prototyped CARAT using Lustre and evaluated it extensively across dynamic I/O patterns, real-world HPC workloads, and multi-client deployments. The results demonstrated that CARAT can achieve up to 3x performance improvement over the default or static configurations, validating the effectiveness and generality of our approach. Due to its scalability and lightweight, we believe CARAT has the potential to be widely deployed into existing PFS and benefit various data-intensive applications. 2026-02-25T21:34:59Z to be published in 40th IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2026 Md Hasanur Rashid Nathan R. Tallent Forrest Sheng Bao Dong Dai http://arxiv.org/abs/2602.22409v1 AdapTBF: Decentralized Bandwidth Control via Adaptive Token Borrowing for HPC Storage 2026-02-25T21:03:04Z Modern high-performance computing (HPC) applications run on compute resources but share global storage systems. This design can cause problems when applications consume a disproportionate amount of storage bandwidth relative to their allocated compute resources. For example, an application running on a single compute node can issue many small, random writes and consume excessive I/O bandwidth from a storage server. This can hinder larger jobs that write to the same storage server and are allocated many compute nodes, resulting in significant resource waste. A straightforward solution is to limit each application's I/O bandwidth on storage servers in proportion to its allocated compute resources. This approach has been implemented in parallel file systems using Token Bucket Filter (TBF). However, strict proportional limits often reduce overall I/O efficiency because HPC applications generate short, bursty I/O. Limiting bandwidth can waste server capacity when applications are idle or prevent applications from temporarily using higher bandwidth during bursty phases. We argue that I/O control should maximize per-application performance and overall storage efficiency while ensuring fairness (e.g., preventing small jobs from blocking large-scale ones). We propose AdapTBF, which builds on TBF in modern parallel file systems (e.g., Lustre) and introduces a decentralized bandwidth control approach using adaptive borrowing and lending. We detail the algorithm, implement AdapTBF in Lustre, and evaluate it using synthetic workloads modeled after real-world scenarios. Results show that AdapTBF manages I/O bandwidth effectively while maintaining high storage utilization, even under extreme conditions. 2026-02-25T21:03:04Z 2025 IEEE International Parallel and Distributed Processing Symposium (IPDPS) (pp. 775-788) Md Hasanur Rashid Dong Dai 10.1109/IPDPS64566.2025.00074 http://arxiv.org/abs/2602.22392v1 DIAL: Decentralized I/O AutoTuning via Learned Client-side Local Metrics for Parallel File System 2026-02-25T20:41:45Z Enabling efficient, high-performance data access in parallel file systems (PFS) is critical for today's high-performance computing systems. PFS client-side I/O heavily impacts the final I/O performance delivered to individual applications and the entire system. Autotuning the key client-side I/O behaviors has been extensively studied and shows promising results. However, existing work has heavily relied on extensive number of global runtime metrics to monitor and accurate modeling of applications' I/O patterns. Such heavy overheads significantly limit the ability to enable fine-grained, dynamic tuning in practical systems. In this study, we propose DIAL (Decentralized I/O AutoTuning via Learned Client-side Local Metrics) which takes a drastically different approach. Instead of trying to extract the global I/O patterns of applications, DIAL takes a decentralized approach, treating each I/O client as an independent unit and tuning configurations using only its locally observable metrics. With the help of machine learning models, DIAL enables multiple tunable units to make independent but collective decisions, reacting to what is happening in the global storage systems in a timely manner and achieving better I/O performance globally for the application. 2026-02-25T20:41:45Z 2025 IEEE 25th International Symposium on Cluster, Cloud and Internet Computing (CCGrid) (pp. 01-04) Md Hasanur Rashid Xinyi Li Youbiao He Forrest Sheng Bao Dong Dai 10.1109/CCGRID64434.2025.00037 http://arxiv.org/abs/2602.22103v1 PASTA: A Modular Program Analysis Tool Framework for Accelerators 2026-02-25T16:51:24Z The increasing complexity and diversity of hardware accelerators in modern computing systems demand flexible, low-overhead program analysis tools. We present PASTA, a low-overhead and modular Program AnalysiS Tool Framework for Accelerators. PASTA abstracts over low-level profiling APIs and diverse deep learning frameworks, offering users a unified interface to capture and analyze runtime events at multiple levels. Its extensible design enables researchers and practitioners to rapidly prototype custom tools with minimal overhead. We demonstrate the utility of PASTA by developing several analysis tools, including a deep learning workload characterization tool and a UVM optimization tool. Through extensive evaluation on mainstream deep learning workloads tested on NVIDIA and AMD GPUs under both single- and multi-GPU scenarios, we demonstrate PASTA's broad applicability. On NVIDIA GPUs, we further show that PASTA provides detailed performance insights with significantly lower overhead, up to 1.3*10^4 faster than conventional analysis tools, thanks to its GPU-accelerated backend. PASTA strikes a practical balance between usability, extensibility, and efficiency, making it well-suited for modern accelerator-based computing environments. 2026-02-25T16:51:24Z Mao Lin Hyeran Jeon Keren Zhou http://arxiv.org/abs/2603.10026v1 RedFuser: An Automatic Operator Fusion Framework for Cascaded Reductions on AI Accelerators 2026-02-24T12:29:57Z Operator fusion, as a key performance optimization technique in the deployment of AI models, significantly improves execution efficiency and has been widely adopted in modern AI compilers. However, for cascaded reduction operations involving multiple loops with inter-loop data dependencies, such as the safe softmax followed by GEMM within attention mechanisms, existing compilers lack effective automated fusion and kernel generation capabilities. Although some works have addressed specific instances through hand-crafted fusion strategies, their solutions are limited in generality and difficult to extend to other similar structures. Given the prevalence of such computational patterns in deep learning models, there remains significant untapped potential in achieving general and automated fusion optimization. In this paper, we present a formal theoretical methodology for analyzing cascaded reductions which can fuse them into a single loop and introduce an incremental computation form. Based on this methodology, we design Reduction Fuser (RedFuser), a framework that automatically identifies supported cascaded reduction patterns and generates optimized fused kernels. Experiments show that RedFuser successfully fuses diverse workloads, achieving up to 2$\times$ to 5$\times$ speedup over state-of-the-art AI compilers and matching the performance of highly optimized hand-written kernels. The code is available at https://github.com/alibaba/redfuser 2026-02-24T12:29:57Z 22 pages, 13 figures, ASPLOS '26 Xinsheng Tang Yangcheng Li Nan Wang Zhiyi Shu Xingyu Ling Junna Xing Peng Zhou Qiang Liu 10.1145/3779212.3790209 http://arxiv.org/abs/2603.04445v1 Dynamic Model Routing and Cascading for Efficient LLM Inference: A Survey 2026-02-23T21:57:27Z The rapid growth of large language models (LLMs) with diverse capabilities, costs, and domains has created a critical need for intelligent model selection at inference time. While smaller models suffice for routine queries, complex tasks demand more capable models. However, static model deployment does not account for the complexity and domain of incoming queries, leading to suboptimal performance and increased costs. Dynamic routing systems that adaptively select models based on query characteristics have emerged as a solution to this challenge. We provide a systematic analysis of state-of-the-art multi-LLM routing and cascading approaches. In contrast to mixture-of-experts architectures, which route within a single model, we study routing across multiple independently trained LLMs. We cover diverse routing paradigms, including query difficulty, human preferences, clustering, uncertainty quantification, reinforcement learning, multimodality, and cascading. For each paradigm, we analyze representative methods and examine key trade-offs. Beyond taxonomy, we introduce a conceptual framework that characterizes routing systems along three dimensions: when decisions are made, what information is used, and how they are computed. This perspective highlights that practical systems are often compositional, integrating multiple paradigms under operational constraints. Our analysis demonstrates that effective multi-LLM routing requires balancing competing objectives. Choosing the optimal routing strategy depends on deployment and computational constraints. Well-designed routing systems can outperform even the most powerful individual models by strategically leveraging specialized capabilities across models while maximizing efficiency gains. Meanwhile, open challenges remain in developing routing mechanisms that generalize across diverse architectures, modalities, and applications. 2026-02-23T21:57:27Z Work funded by ADAPT Centre, Trinity College Dublin, and Huawei Ireland Yasmin Moslem John D. Kelleher http://arxiv.org/abs/2508.19073v3 CARMA: Collocation-Aware Resource Manager 2026-02-23T11:03:09Z GPUs running deep learning (DL) workloads are frequently underutilized. Collocating multiple DL training tasks on the same GPU can improve utilization but introduces two key risks: (1) out-of-memory (OOM) crashes for newly scheduled tasks, and (2) severe performance interference among co-running tasks, which can negate any throughput gains. These issues reduce system robustness, quality of service, and energy efficiency. We present CARMA, a task-level, collocation-aware resource manager for the server-scale. CARMA addresses collocation challenges via (1) fine-grained monitoring and bookkeeping of GPUs and a collocation risk analysis that filters out the high-risk GPUs; (2) task placement policies that cap GPU utilization to limit OOMs and interference; (3) integration of GPU memory need estimators for DL tasks to minimize OOMs during collocation; and (4) a lightweight recovery method that relaunches jobs crashed due to OOMs. Our evaluation on a DL training workload derived from real-world traces shows that CARMA uses GPUs more efficiently by making more informed collocation decisions: for the best-performing collocation policy, CARMA increases GPU streaming multiprocessor (SM) utilization by 54%, the parallelism achieved per SM by 61%, and memory use by 62%. This results in a ~35% and ~15% reduction in the end-to-end execution time (makespan) and GPU energy consumption, respectively, for this workload. 2025-08-26T14:29:34Z Ehsan Yousefzadeh-Asl-Miandoab Florina M. Ciorba Pınar Tözün http://arxiv.org/abs/2603.00126v1 QuickGrasp: Responsive Video-Language Querying Service via Accelerated Tokenization and Edge-Augmented Inference 2026-02-23T10:14:03Z Video-language models (VLMs) are reshaping video querying services, bringing unified solutions to complex perception and reasoning tasks. However, deploying large VLMs in real-world systems remains challenging due to their high resource demands, and remote-based deployment often results in unacceptable response delays. Although small, locally deployable VLMs offer faster responses, they unavoidably fall short in accuracy. To reconcile this trade-off, we propose QuickGrasp, a responsive, quality of service (QoS)-aware system that bridges this gap through a local-first architecture with on-demand edge augmentation. Built upon the highly modular architecture of VLMs, QuickGrasp shares the vision representation across model variants to avoid redundant computation. To maximize system-wide efficiency, QuickGrasp introduces three key designs: accelerated video tokenization, query-adaptive edge augmentation, and delay-aware, accuracy-preserving vision token density configuration. We implement a prototype of QuickGrasp and evaluate it across multiple video understanding benchmarks. The results show that QuickGrasp matches the accuracy of large VLMs while achieving up to a 12.8x reduction in response delay. QuickGrasp represents a key advancement toward building responsive video querying services for open-world understanding that fully leverage the capabilities of VLMs. 2026-02-23T10:14:03Z Miao Zhang Ruixiao Zhang Jianxin Shi Hengzhi Wang Hao Fang Jiangchuan Liu http://arxiv.org/abs/2602.05695v2 SweetSpot: An Analytical Model for Predicting Energy Efficiency of LLM Inference 2026-02-23T09:49:19Z Large Language Models (LLMs) inference is central to modern AI applications, dominating worldwide datacenter workloads, making it critical to predict its energy footprint. Existing approaches estimate energy consumption as a simple linear function of input and output sequence. However, by analyzing the autoregressive structure of Transformers, which implies a fundamentally non-linear relationship between input and output sequence lengths and energy consumption, we demonstrate the existence of a generation energy minima. Peak efficiency occurs with short-to-moderate inputs and medium-length outputs, while efficiency drops sharply for long inputs or very short outputs. Consequently, we propose SweetSpot, an analytical model derived from the computational and memory-access complexity of the Transformer architecture, which accurately characterizes the efficiency curve as a function of input and output lengths. To assess accuracy, we measure energy consumption using TensorRT-LLM on NVIDIA H100 GPUs across a diverse set of LLMs ranging from 1B to 9B parameters, including OPT, LLaMA, Gemma, Falcon, Qwen2, and Granite. We test input and output lengths from 64 to 4096 tokens and achieve a mean MAPE of 1.79%. Our results show that aligning sequence lengths with these efficiency "sweet spots" reduce energy usage, up to 33.41x, enabling informed truncation, summarization, and adaptive generation strategies in production systems. 2026-02-05T14:21:00Z To appear at ICPE 2026 (International Conference on Performance Engineering) Hiari Pizzini Cavagna Andrea Proia Giacomo Madella Giovanni B. Esposito Francesco Antici Daniele Cesarini Zeynep Kiziltan Andrea Bartolini http://arxiv.org/abs/2409.09874v3 The Landscape of GPU-Centric Communication 2026-02-22T12:56:03Z In recent years, GPUs have become the preferred accelerators for HPC and ML applications due to their parallelism and fast memory bandwidth. While GPUs boost computation, inter-GPU communication can create scalability bottlenecks, especially as the number of GPUs per node and cluster grows. Traditionally, the CPU managed multi-GPU communication, but advancements in GPU-centric communication now challenge this CPU dominance by reducing its involvement, granting GPUs more autonomy in communication tasks, and addressing mismatches in multi-GPU communication and computation. This paper provides a landscape of GPU-centric communication, focusing on vendor mechanisms and user-level library supports. It aims to clarify the complexities and diverse options in this field, define the terminology, and categorize existing approaches within and across nodes. The paper discusses vendor-provided mechanisms for communication and memory management in multi-GPU execution and reviews major communication libraries, their benefits, challenges, and performance insights. Then, it explores key research paradigms, future outlooks, and open research questions. By extensively describing GPU-centric communication techniques across the software and hardware stacks, we provide researchers, programmers, engineers, and library designers insights on how to exploit multi-GPU systems at their best. 2024-09-15T21:50:20Z Didem Unat Ilyas Turimbetov Mohammed Kefah Taha Issa Doğan Sağbili Flavio Vella Daniele De Sensi Ismayil Ismayilov http://arxiv.org/abs/2602.17808v1 Collaborative Processing for Multi-Tenant Inference on Memory-Constrained Edge TPUs 2026-02-19T20:17:18Z IoT applications are increasingly relying on on-device AI accelerators to ensure high performance, especially in limited connectivity and safety-critical scenarios. However, the limited on-chip memory of these accelerators forces inference runtimes to swap model segments between host and accelerator memory, substantially inflating latency. While collaborative processing by partitioning the model processing between CPU and accelerator resources can reduce accelerator memory pressure and latency, naive partitioning may worsen end-to-end latency by either shifting excessive computation to the CPU or failing to sufficiently curb swapping, a problem that is further amplified in multi-tenant and dynamic environments. To address these issues, we present SwapLess, a system for adaptive, multi-tenant TPU-CPU collaborative inference for memory-constrained Edge TPUs. SwapLess utilizes an analytic queueing model that captures partition-dependent CPU/TPU service times as well as inter- and intra-model swapping overheads across different workload mixes and request rates. Using this model, SwapLess continuously adjusts both the partition point and CPU core allocation online to minimize end-to-end response time with low decision overhead. An implementation on Edge TPU-equipped platforms demonstrates that SwapLess reduces mean latency by up to 63.8% for single-tenant workloads and up to 77.4% for multi-tenant workloads relative to the default Edge TPU compiler. 2026-02-19T20:17:18Z Nathan Ng Walid A. Hanafy Prashanthi Kadambi Balachandra Sunil Ayush Gupta David Irwin Yogesh Simmhan Prashant Shenoy http://arxiv.org/abs/2603.08727v1 ARKV: Adaptive and Resource-Efficient KV Cache Management under Limited Memory Budget for Long-Context Inference in LLMs 2026-02-19T16:24:08Z Large Language Models (LLMs) are increasingly deployed in scenarios demanding ultra-long context reasoning, such as agentic workflows and deep research understanding. However, long-context inference is constrained by the KV cache, a transient memory structure that grows linearly with sequence length and batch size, quickly dominating GPU memory usage. Existing memory reduction techniques, including eviction and quantization, often rely on static heuristics and suffer from degraded quality under tight budgets. In this paper, we propose ARKV, a lightweight and adaptive framework that dynamically allocates precision levels to cached tokens based on per-layer attention dynamics and token-level importance. During a short prefill phase, ARKV estimates the original quantization (OQ) ratio of each layer by computing statistical scores such as attention entropy, variance and kurtosis. During decoding, tokens are assigned to one of three states, Original (full precision), Quantization (low precision), or Eviction, according to a fast heavy-hitter scoring strategy. Our experiments on LLaMA3 and Qwen3 models across diverse long- and short-context tasks demonstrate that ARKV preserves ~97% of baseline accuracy on long-context benchmarks while reducing KV memory usage by 4x, with minimal throughput loss. On short-context tasks, ARKV matches full-precision baselines; on GSM8K math reasoning, it significantly outperforms uniform quantization. These results highlight the practical viability of ARKV for scalable LLM deployment, offering fine-grained, data-driven memory control without retraining or architectural modifications. The source code and artifacts can be found in: https://github.com/Large-scale-Sustainable-Computing-LSC/ARKV 2026-02-19T16:24:08Z Accepted in ACM/IEEE CCGRID 2025 conference Jianlong Lei Shashikant Ilager