https://arxiv.org/api/Gkr77sAx4ZjI3LN+CxeN8vreoWM2026-03-24T12:40:24Z50736015http://arxiv.org/abs/2602.23556v1Rudder: Steering Prefetching in Distributed GNN Training using LLM Agents2026-02-26T23:39:42ZLarge-scale Graph Neural Networks (GNNs) are typically trained by sampling a vertex's neighbors to a fixed distance. Because large input graphs are distributed, training requires frequent irregular communication that stalls forward progress. Moreover, fetched data changes with graph, graph distribution, sample and batch parameters, and caching polices. Consequently, any static prefetching method will miss crucial opportunities to adapt to different dynamic conditions. In this paper, we introduce Rudder, a software module embedded in the state-of-the-art AWS DistDGL framework, to autonomously prefetch remote nodes and minimize communication. Rudder's adaptation contrasts with both standard heuristics and traditional ML classifiers. We observe that the generative AI found in contemporary Large Language Models (LLMs) exhibits emergent properties like In-Context Learning (ICL) for zero-shot tasks, with logical multi-step reasoning. We find this behavior well-suited for adaptive control even with substantial undertraining. Evaluations using standard datasets and unseen configurations on the NERSC Perlmutter supercomputer show up to 91% improvement in end-to-end training performance over baseline DistDGL (no prefetching), and an 82% improvement over static prefetching, reducing communication by over 50%. Our code is available at https://github.com/aishwaryyasarkar/rudder-llm-agent.2026-02-26T23:39:42ZAccepted to the 40th ACM International Conference on Supercomputing (ICS 2026)Aishwarya SarkarSayan GhoshNathan TallentAman ChadhaTanya RoostaAli Jannesarihttp://arxiv.org/abs/2601.01265v3CounterPoint: Using Hardware Event Counters to Refute and Refine Microarchitectural Assumptions (Extended Version)2026-02-26T22:04:31ZHardware event counters offer the potential to reveal not only performance bottlenecks but also detailed microarchitectural behavior. In practice, this promise is undermined by their vague specifications, opaque designs, and multiplexing noise, making event counter data hard to interpret.
We introduce CounterPoint, a framework that tests user-specified microarchitectural models - expressed as $μ$path Decision Diagrams - for consistency with performance counter data. When mismatches occur, CounterPoint pinpoints plausible microarchitectural features that could explain them, using multi-dimensional counter confidence regions to mitigate multiplexing noise. We apply CounterPoint to the Haswell Memory Management Unit as a case study, shedding light on multiple undocumented and underdocumented microarchitectural behaviors. These include a load-store queue-side TLB prefetcher, merging page table walkers, abortable page table walks, and more.
Overall, CounterPoint helps experts reconcile noisy hardware performance counter measurements with their mental model of the microarchitecture - uncovering subtle, previously hidden hardware features along the way.2026-01-03T19:24:00ZThis is an extended version of a paper which has been accepted to the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems conference (ASPLOS, March 2026). 20 pages, 20 figures, 8 tablesNick LindsayYale UniversityCaroline TrippelStanford UniversityAnurag KhandelwalYale UniversityAbhishek BhattacharjeeYale University10.1145/3779212.3790145http://arxiv.org/abs/2602.22540v1The Road to Useful Quantum Computers2026-02-26T02:27:54ZBuilding a useful quantum computer is a grand science and engineering challenge, currently pursued intensely by teams around the world. In the 1980s, Richard Feynman and Yuri Manin observed independently that computers based on quantum mechanics might enable better simulations of quantum phenomena. Their vision remained an intellectual curiosity until Peter Shor published his famous quantum algorithm for integer factoring, and shortly thereafter a proof that errors in quantum computations can be corrected. Since then, quantum computing R&D has progressed rapidly, from small-scale experiments in university physics laboratories to well-funded industrial efforts and prototypes. Hype notwithstanding, quantum computers have yet to solve scientifically or practically important problems -- a target often called quantum utility. In this article, we describe the capabilities of contemporary quantum computers, compare them to the requirements of quantum utility, and illustrate how to track progress from today to utility. We highlight key science and engineering challenges on the road to quantum utility, touching on relevant aspects of our own research.2026-02-26T02:27:54ZThis article was written by invitation for the National Academy of Engineering (NAE)'s magazine, The Bridge, based on a talk given by the first author at The Grainger Foundation Frontiers of Engineering 2025 Symposium. It was written for a general audience of engineers. It is close to the published article, which can be found at https://www.nae.edu/344306/the-road-to-useful-quantum-computersThe Bridge 55, 4, 29-36 (2026)Timothy ProctorRobin Blume-KohoutAndrew Baczewskihttp://arxiv.org/abs/2602.22423v1CARAT: Client-Side Adaptive RPC and Cache Co-Tuning for Parallel File Systems2026-02-25T21:34:59ZTuning parallel file system in High-Performance Computing (HPC) systems remains challenging due to the complex I/O paths, diverse I/O patterns, and dynamic system conditions. While existing autotuning frameworks have shown promising results in tuning PFS parameters based on applications' I/O patterns, they lack scalability, adaptivity, and the ability to operate online. In this work, focusing on scalable online tuning, we present CARAT, an ML-guided framework to co-tune client-side RPC and caching parameters of PFS, leveraging only locally observable metrics. Unlike global or pattern-dependent approaches, CARAT enables each client to make independent and intelligent tuning decisions online, responding to real-time changes in both application I/O behaviors and system states. We then prototyped CARAT using Lustre and evaluated it extensively across dynamic I/O patterns, real-world HPC workloads, and multi-client deployments. The results demonstrated that CARAT can achieve up to 3x performance improvement over the default or static configurations, validating the effectiveness and generality of our approach. Due to its scalability and lightweight, we believe CARAT has the potential to be widely deployed into existing PFS and benefit various data-intensive applications.2026-02-25T21:34:59Zto be published in 40th IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2026Md Hasanur RashidNathan R. TallentForrest Sheng BaoDong Daihttp://arxiv.org/abs/2602.22409v1AdapTBF: Decentralized Bandwidth Control via Adaptive Token Borrowing for HPC Storage2026-02-25T21:03:04ZModern high-performance computing (HPC) applications run on compute resources but share global storage systems. This design can cause problems when applications consume a disproportionate amount of storage bandwidth relative to their allocated compute resources. For example, an application running on a single compute node can issue many small, random writes and consume excessive I/O bandwidth from a storage server. This can hinder larger jobs that write to the same storage server and are allocated many compute nodes, resulting in significant resource waste.
A straightforward solution is to limit each application's I/O bandwidth on storage servers in proportion to its allocated compute resources. This approach has been implemented in parallel file systems using Token Bucket Filter (TBF). However, strict proportional limits often reduce overall I/O efficiency because HPC applications generate short, bursty I/O. Limiting bandwidth can waste server capacity when applications are idle or prevent applications from temporarily using higher bandwidth during bursty phases.
We argue that I/O control should maximize per-application performance and overall storage efficiency while ensuring fairness (e.g., preventing small jobs from blocking large-scale ones). We propose AdapTBF, which builds on TBF in modern parallel file systems (e.g., Lustre) and introduces a decentralized bandwidth control approach using adaptive borrowing and lending. We detail the algorithm, implement AdapTBF in Lustre, and evaluate it using synthetic workloads modeled after real-world scenarios. Results show that AdapTBF manages I/O bandwidth effectively while maintaining high storage utilization, even under extreme conditions.2026-02-25T21:03:04Z2025 IEEE International Parallel and Distributed Processing Symposium (IPDPS) (pp. 775-788)Md Hasanur RashidDong Dai10.1109/IPDPS64566.2025.00074http://arxiv.org/abs/2602.22392v1DIAL: Decentralized I/O AutoTuning via Learned Client-side Local Metrics for Parallel File System2026-02-25T20:41:45ZEnabling efficient, high-performance data access in parallel file systems (PFS) is critical for today's high-performance computing systems. PFS client-side I/O heavily impacts the final I/O performance delivered to individual applications and the entire system. Autotuning the key client-side I/O behaviors has been extensively studied and shows promising results. However, existing work has heavily relied on extensive number of global runtime metrics to monitor and accurate modeling of applications' I/O patterns. Such heavy overheads significantly limit the ability to enable fine-grained, dynamic tuning in practical systems. In this study, we propose DIAL (Decentralized I/O AutoTuning via Learned Client-side Local Metrics) which takes a drastically different approach. Instead of trying to extract the global I/O patterns of applications, DIAL takes a decentralized approach, treating each I/O client as an independent unit and tuning configurations using only its locally observable metrics. With the help of machine learning models, DIAL enables multiple tunable units to make independent but collective decisions, reacting to what is happening in the global storage systems in a timely manner and achieving better I/O performance globally for the application.2026-02-25T20:41:45Z2025 IEEE 25th International Symposium on Cluster, Cloud and Internet Computing (CCGrid) (pp. 01-04)Md Hasanur RashidXinyi LiYoubiao HeForrest Sheng BaoDong Dai10.1109/CCGRID64434.2025.00037http://arxiv.org/abs/2602.22103v1PASTA: A Modular Program Analysis Tool Framework for Accelerators2026-02-25T16:51:24ZThe increasing complexity and diversity of hardware accelerators in modern computing systems demand flexible, low-overhead program analysis tools. We present PASTA, a low-overhead and modular Program AnalysiS Tool Framework for Accelerators. PASTA abstracts over low-level profiling APIs and diverse deep learning frameworks, offering users a unified interface to capture and analyze runtime events at multiple levels. Its extensible design enables researchers and practitioners to rapidly prototype custom tools with minimal overhead. We demonstrate the utility of PASTA by developing several analysis tools, including a deep learning workload characterization tool and a UVM optimization tool. Through extensive evaluation on mainstream deep learning workloads tested on NVIDIA and AMD GPUs under both single- and multi-GPU scenarios, we demonstrate PASTA's broad applicability. On NVIDIA GPUs, we further show that PASTA provides detailed performance insights with significantly lower overhead, up to 1.3*10^4 faster than conventional analysis tools, thanks to its GPU-accelerated backend. PASTA strikes a practical balance between usability, extensibility, and efficiency, making it well-suited for modern accelerator-based computing environments.2026-02-25T16:51:24ZMao LinHyeran JeonKeren Zhouhttp://arxiv.org/abs/2603.10026v1RedFuser: An Automatic Operator Fusion Framework for Cascaded Reductions on AI Accelerators2026-02-24T12:29:57ZOperator fusion, as a key performance optimization technique in the deployment of AI models, significantly improves execution efficiency and has been widely adopted in modern AI compilers. However, for cascaded reduction operations involving multiple loops with inter-loop data dependencies, such as the safe softmax followed by GEMM within attention mechanisms, existing compilers lack effective automated fusion and kernel generation capabilities. Although some works have addressed specific instances through hand-crafted fusion strategies, their solutions are limited in generality and difficult to extend to other similar structures. Given the prevalence of such computational patterns in deep learning models, there remains significant untapped potential in achieving general and automated fusion optimization.
In this paper, we present a formal theoretical methodology for analyzing cascaded reductions which can fuse them into a single loop and introduce an incremental computation form. Based on this methodology, we design Reduction Fuser (RedFuser), a framework that automatically identifies supported cascaded reduction patterns and generates optimized fused kernels. Experiments show that RedFuser successfully fuses diverse workloads, achieving up to 2$\times$ to 5$\times$ speedup over state-of-the-art AI compilers and matching the performance of highly optimized hand-written kernels. The code is available at https://github.com/alibaba/redfuser2026-02-24T12:29:57Z22 pages, 13 figures, ASPLOS '26Xinsheng TangYangcheng LiNan WangZhiyi ShuXingyu LingJunna XingPeng ZhouQiang Liu10.1145/3779212.3790209http://arxiv.org/abs/2603.04445v1Dynamic Model Routing and Cascading for Efficient LLM Inference: A Survey2026-02-23T21:57:27ZThe rapid growth of large language models (LLMs) with diverse capabilities, costs, and domains has created a critical need for intelligent model selection at inference time. While smaller models suffice for routine queries, complex tasks demand more capable models. However, static model deployment does not account for the complexity and domain of incoming queries, leading to suboptimal performance and increased costs. Dynamic routing systems that adaptively select models based on query characteristics have emerged as a solution to this challenge.
We provide a systematic analysis of state-of-the-art multi-LLM routing and cascading approaches. In contrast to mixture-of-experts architectures, which route within a single model, we study routing across multiple independently trained LLMs. We cover diverse routing paradigms, including query difficulty, human preferences, clustering, uncertainty quantification, reinforcement learning, multimodality, and cascading. For each paradigm, we analyze representative methods and examine key trade-offs. Beyond taxonomy, we introduce a conceptual framework that characterizes routing systems along three dimensions: when decisions are made, what information is used, and how they are computed. This perspective highlights that practical systems are often compositional, integrating multiple paradigms under operational constraints.
Our analysis demonstrates that effective multi-LLM routing requires balancing competing objectives. Choosing the optimal routing strategy depends on deployment and computational constraints. Well-designed routing systems can outperform even the most powerful individual models by strategically leveraging specialized capabilities across models while maximizing efficiency gains. Meanwhile, open challenges remain in developing routing mechanisms that generalize across diverse architectures, modalities, and applications.2026-02-23T21:57:27ZWork funded by ADAPT Centre, Trinity College Dublin, and Huawei IrelandYasmin MoslemJohn D. Kelleherhttp://arxiv.org/abs/2508.19073v3CARMA: Collocation-Aware Resource Manager2026-02-23T11:03:09ZGPUs running deep learning (DL) workloads are frequently underutilized. Collocating multiple DL training tasks on the same GPU can improve utilization but introduces two key risks: (1) out-of-memory (OOM) crashes for newly scheduled tasks, and (2) severe performance interference among co-running tasks, which can negate any throughput gains. These issues reduce system robustness, quality of service, and energy efficiency. We present CARMA, a task-level, collocation-aware resource manager for the server-scale. CARMA addresses collocation challenges via (1) fine-grained monitoring and bookkeeping of GPUs and a collocation risk analysis that filters out the high-risk GPUs; (2) task placement policies that cap GPU utilization to limit OOMs and interference; (3) integration of GPU memory need estimators for DL tasks to minimize OOMs during collocation; and (4) a lightweight recovery method that relaunches jobs crashed due to OOMs. Our evaluation on a DL training workload derived from real-world traces shows that CARMA uses GPUs more efficiently by making more informed collocation decisions: for the best-performing collocation policy, CARMA increases GPU streaming multiprocessor (SM) utilization by 54%, the parallelism achieved per SM by 61%, and memory use by 62%. This results in a ~35% and ~15% reduction in the end-to-end execution time (makespan) and GPU energy consumption, respectively, for this workload.2025-08-26T14:29:34ZEhsan Yousefzadeh-Asl-MiandoabFlorina M. CiorbaPınar Tözünhttp://arxiv.org/abs/2603.00126v1QuickGrasp: Responsive Video-Language Querying Service via Accelerated Tokenization and Edge-Augmented Inference2026-02-23T10:14:03ZVideo-language models (VLMs) are reshaping video querying services, bringing unified solutions to complex perception and reasoning tasks. However, deploying large VLMs in real-world systems remains challenging due to their high resource demands, and remote-based deployment often results in unacceptable response delays. Although small, locally deployable VLMs offer faster responses, they unavoidably fall short in accuracy. To reconcile this trade-off, we propose QuickGrasp, a responsive, quality of service (QoS)-aware system that bridges this gap through a local-first architecture with on-demand edge augmentation. Built upon the highly modular architecture of VLMs, QuickGrasp shares the vision representation across model variants to avoid redundant computation. To maximize system-wide efficiency, QuickGrasp introduces three key designs: accelerated video tokenization, query-adaptive edge augmentation, and delay-aware, accuracy-preserving vision token density configuration. We implement a prototype of QuickGrasp and evaluate it across multiple video understanding benchmarks. The results show that QuickGrasp matches the accuracy of large VLMs while achieving up to a 12.8x reduction in response delay. QuickGrasp represents a key advancement toward building responsive video querying services for open-world understanding that fully leverage the capabilities of VLMs.2026-02-23T10:14:03ZMiao ZhangRuixiao ZhangJianxin ShiHengzhi WangHao FangJiangchuan Liuhttp://arxiv.org/abs/2602.05695v2SweetSpot: An Analytical Model for Predicting Energy Efficiency of LLM Inference2026-02-23T09:49:19ZLarge Language Models (LLMs) inference is central to modern AI applications, dominating worldwide datacenter workloads, making it critical to predict its energy footprint. Existing approaches estimate energy consumption as a simple linear function of input and output sequence. However, by analyzing the autoregressive structure of Transformers, which implies a fundamentally non-linear relationship between input and output sequence lengths and energy consumption, we demonstrate the existence of a generation energy minima. Peak efficiency occurs with short-to-moderate inputs and medium-length outputs, while efficiency drops sharply for long inputs or very short outputs. Consequently, we propose SweetSpot, an analytical model derived from the computational and memory-access complexity of the Transformer architecture, which accurately characterizes the efficiency curve as a function of input and output lengths. To assess accuracy, we measure energy consumption using TensorRT-LLM on NVIDIA H100 GPUs across a diverse set of LLMs ranging from 1B to 9B parameters, including OPT, LLaMA, Gemma, Falcon, Qwen2, and Granite. We test input and output lengths from 64 to 4096 tokens and achieve a mean MAPE of 1.79%. Our results show that aligning sequence lengths with these efficiency "sweet spots" reduce energy usage, up to 33.41x, enabling informed truncation, summarization, and adaptive generation strategies in production systems.2026-02-05T14:21:00ZTo appear at ICPE 2026 (International Conference on Performance Engineering)Hiari Pizzini CavagnaAndrea ProiaGiacomo MadellaGiovanni B. EspositoFrancesco AnticiDaniele CesariniZeynep KiziltanAndrea Bartolinihttp://arxiv.org/abs/2409.09874v3The Landscape of GPU-Centric Communication2026-02-22T12:56:03ZIn recent years, GPUs have become the preferred accelerators for HPC and ML applications due to their parallelism and fast memory bandwidth. While GPUs boost computation, inter-GPU communication can create scalability bottlenecks, especially as the number of GPUs per node and cluster grows. Traditionally, the CPU managed multi-GPU communication, but advancements in GPU-centric communication now challenge this CPU dominance by reducing its involvement, granting GPUs more autonomy in communication tasks, and addressing mismatches in multi-GPU communication and computation.
This paper provides a landscape of GPU-centric communication, focusing on vendor mechanisms and user-level library supports. It aims to clarify the complexities and diverse options in this field, define the terminology, and categorize existing approaches within and across nodes. The paper discusses vendor-provided mechanisms for communication and memory management in multi-GPU execution and reviews major communication libraries, their benefits, challenges, and performance insights. Then, it explores key research paradigms, future outlooks, and open research questions. By extensively describing GPU-centric communication techniques across the software and hardware stacks, we provide researchers, programmers, engineers, and library designers insights on how to exploit multi-GPU systems at their best.2024-09-15T21:50:20ZDidem UnatIlyas TurimbetovMohammed Kefah Taha IssaDoğan SağbiliFlavio VellaDaniele De SensiIsmayil Ismayilovhttp://arxiv.org/abs/2602.17808v1Collaborative Processing for Multi-Tenant Inference on Memory-Constrained Edge TPUs2026-02-19T20:17:18ZIoT applications are increasingly relying on on-device AI accelerators to ensure high performance, especially in limited connectivity and safety-critical scenarios. However, the limited on-chip memory of these accelerators forces inference runtimes to swap model segments between host and accelerator memory, substantially inflating latency. While collaborative processing by partitioning the model processing between CPU and accelerator resources can reduce accelerator memory pressure and latency, naive partitioning may worsen end-to-end latency by either shifting excessive computation to the CPU or failing to sufficiently curb swapping, a problem that is further amplified in multi-tenant and dynamic environments.
To address these issues, we present SwapLess, a system for adaptive, multi-tenant TPU-CPU collaborative inference for memory-constrained Edge TPUs. SwapLess utilizes an analytic queueing model that captures partition-dependent CPU/TPU service times as well as inter- and intra-model swapping overheads across different workload mixes and request rates. Using this model, SwapLess continuously adjusts both the partition point and CPU core allocation online to minimize end-to-end response time with low decision overhead. An implementation on Edge TPU-equipped platforms demonstrates that SwapLess reduces mean latency by up to 63.8% for single-tenant workloads and up to 77.4% for multi-tenant workloads relative to the default Edge TPU compiler.2026-02-19T20:17:18ZNathan NgWalid A. HanafyPrashanthi KadambiBalachandra SunilAyush GuptaDavid IrwinYogesh SimmhanPrashant Shenoyhttp://arxiv.org/abs/2603.08727v1ARKV: Adaptive and Resource-Efficient KV Cache Management under Limited Memory Budget for Long-Context Inference in LLMs2026-02-19T16:24:08ZLarge Language Models (LLMs) are increasingly deployed in scenarios demanding ultra-long context reasoning, such as agentic workflows and deep research understanding. However, long-context inference is constrained by the KV cache, a transient memory structure that grows linearly with sequence length and batch size, quickly dominating GPU memory usage. Existing memory reduction techniques, including eviction and quantization, often rely on static heuristics and suffer from degraded quality under tight budgets. In this paper, we propose ARKV, a lightweight and adaptive framework that dynamically allocates precision levels to cached tokens based on per-layer attention dynamics and token-level importance. During a short prefill phase, ARKV estimates the original quantization (OQ) ratio of each layer by computing statistical scores such as attention entropy, variance and kurtosis. During decoding, tokens are assigned to one of three states, Original (full precision), Quantization (low precision), or Eviction, according to a fast heavy-hitter scoring strategy. Our experiments on LLaMA3 and Qwen3 models across diverse long- and short-context tasks demonstrate that ARKV preserves ~97% of baseline accuracy on long-context benchmarks while reducing KV memory usage by 4x, with minimal throughput loss. On short-context tasks, ARKV matches full-precision baselines; on GSM8K math reasoning, it significantly outperforms uniform quantization. These results highlight the practical viability of ARKV for scalable LLM deployment, offering fine-grained, data-driven memory control without retraining or architectural modifications. The source code and artifacts can be found in: https://github.com/Large-scale-Sustainable-Computing-LSC/ARKV2026-02-19T16:24:08ZAccepted in ACM/IEEE CCGRID 2025 conferenceJianlong LeiShashikant Ilager