https://arxiv.org/api/Lzio2PqyAj6K9Ndo3BORWvCR9Gs2026-04-07T21:04:13Z2791331515http://arxiv.org/abs/2603.14630v1Towards an Adaptive Runtime System for Cloud-Native HPC2026-03-15T22:03:48ZThe ongoing convergence of HPC and cloud computing presents a fundamental challenge: HPC applications, designed for static and homogeneous supercomputers, are ill-suited for the dynamic, heterogeneous, and volatile nature of the cloud. Traditional parallel programming models like MPI struggle to leverage key cloud advantages, such as resource elasticity and low-cost spot instances, while also failing to address challenges like performance variability and processor heterogeneity. This paper demonstrates how the asynchronous, message-driven paradigm of the Charm++ parallel runtime system can bridge this gap. We present a set of tools and strategies that enable HPC applications to run efficiently and resiliently on dynamic cloud infrastructure across both CPU and GPU resources. Our work makes two key contributions. First, we demonstrate that rate-aware load balancing in Charm++ improves performance for applications running on heterogeneous CPU and GPU instances on the cloud. We further demonstrate how core Charm++ principles mitigate performance degradation from common cloud challenges like network contention and processor performance variability, which are exacerbated by the tightly coupled, globally synchronized nature of many science and engineering applications. Second, we extend an existing resource management framework to support GPU and CPU spot instances with minimal interruption overhead. Together, these contributions provide a robust framework for adapting HPC applications to achieve efficient, resilient, and cost-effective performance on the cloud.2026-03-15T22:03:48ZAditya BhosaleAdvait TahilyaniLaxmikant KaleSara Kokkila-Schumacherhttp://arxiv.org/abs/2603.14583v1Machine Learning-Driven Intelligent Memory System Design: From On-Chip Caches to Storage2026-03-15T20:02:05ZDespite the data-rich environment in which memory systems of modern computing platforms operate, many state-of-the-art architectural policies employed in the memory system rely on static, human-designed heuristics that fail to truly adapt to the workload and system behavior via principled learning methodologies. In this article, we propose a fundamentally different design approach: using lightweight and practical machine learning (ML) methods to enable adaptive, data-driven control throughout the memory hierarchy.
We present three ML-guided architectural policies: (1) Pythia, a reinforcement learning-based data prefetcher for on-chip caches, (2) Hermes, a perceptron learning-based off-chip predictor for multi-level cache hierarchies, and (3) Sibyl, a reinforcement learning-based data placement policy for hybrid storage systems. Our evaluation shows that Pythia, Hermes, and Sibyl significantly outperform the best-prior human-designed policies, while incurring modest hardware overheads. Collectively, this article demonstrates that integrating adaptive learning into memory subsystems can lead to intelligent, self-optimizing architectures that unlock performance and efficiency gains beyond what is possible with traditional human-designed approaches.2026-03-15T20:02:05ZExtended version of the IEEE Micro 2026 articleRahul BeraRakesh NadigOnur Mutlu10.1109/MM.2026.3667076http://arxiv.org/abs/2603.14577v1Covariance-Guided Resource Adaptive Learning for Efficient Edge Inference2026-03-15T19:54:08ZFor deep learning inference on edge devices, hardware configurations achieving the same throughput can differ by 2$\times$ in power consumption, yet operators often struggle to find the efficient ones without exhaustive profiling. Existing approaches often rely on inefficient static presets or require expensive offline profiling that must be repeated for each new model or device. To address this problem, we present CORAL, an online optimization method that discovers near-optimal configurations without offline profiling. CORAL leverages distance covariance to statistically capture the non-linear dependencies between hardware settings, e.g., DVFS and concurrency levels, and performance metrics. Unlike prior work, we explicitly formulate the challenge as a throughput-power co-optimization problem to satisfy power budgets and throughput targets simultaneously. We evaluate CORAL on two NVIDIA Jetson devices across three object detection models ranging from lightweight to heavyweight. In single-target scenarios, CORAL achieves 96% $\unicode{x2013}$ 100% of the optimal performance found by exhaustive search. In strict dual-constraint scenarios where baselines fail or exceed power budgets, CORAL consistently finds proper configurations online with minimal exploration.2026-03-15T19:54:08Z8 pages, 10 figuresAhmad N. L. NabhaanZaki SukmaRakandhiya D. RachmantoMuhammad Husni SantriajiByungjin ChoArief SetyantoIn Kee Kimhttp://arxiv.org/abs/2603.14445v1Committee Configuration Optimization for Parallel Byzantine Consensus in a Trusted Execution Environment2026-03-15T15:42:44ZParallel Byzantine Fault Tolerant (BFT) protocols based on committee-based sharding improve scalability but weaken safety since smaller node groups are responsible for consensus. Recent approaches integrate trusted execution environments (TEEs) into parallel BFT frameworks to enhance safety. While the scalability and safety issues are addressed by trusted parallel BFT, existing committee configuration methods often rely on randomized assignment, which can degrade performance. This paper proposes a committee configuration optimization (CCO) model based on mixed integer programming to improve transaction performance for trusted parallel BFT. The model considers communication delays and node failure rates to determine an optimal committee configuration that minimizes transaction latency under both normal operations and scenarios of trusted hardware failures. We integrate CCO into a trusted parallel BFT protocol and evaluate the performance on Microsoft virtual machines. Experimental results demonstrate 15% and 21% improved transaction throughput under normal operations and fallback process, respectively, highlighting the benefits of optimization-driven committee configuration in trusted parallel BFT systems.2026-03-15T15:42:44ZYifei XieBtissam Er-RahmadiXiao ChenTiejun MaJane Hillstonhttp://arxiv.org/abs/2603.14357v1Idiosyncrasies of Programmable Caching Engines2026-03-15T12:47:06ZProgrammable caching engines like CacheLib are widely used in production systems to support diverse workloads in multi-tenant environments. CacheLib's design focuses on performance, portability, and configurability, allowing applications to inherit caching improvements with minimal implementation effort. However, its behavior under dynamic and evolving workloads remains largely unexplored. This paper presents an empirical study of CacheLib with multi-tenant settings under dynamic and volatile environments. Our evaluation across multiple CacheLib configurations reveals several limitations that hinder its effectiveness under such environments, including rigid configurations, limited runtime adaptability, lack of quality-of-service support and coordination, which lead to suboptimal performance, inefficient memory usage, and tenant starvation. Based on these findings, we outline future research directions to improve the adaptability, fairness, and programmability of future caching engines.2026-03-15T12:47:06ZPaper accepted at the Workshop on Reliable Large-scale Data Management (co-located with IEEE SRDS 2025). Preliminary version of the paper "Holpaca: Holistic and Adaptable Cache Management for Shared Environments", accepted at 17th ACM/SPEC International Conference on Performance Engineering (ICPE 2026)José PeixotoAlexis GonzalezJanki BhimaniRaju RangaswamiCláudia BritoJoão PauloRicardo Macedohttp://arxiv.org/abs/2603.14236v1AeroGen: Agentic Drone Autonomy through Single-Shot Structured Prompting & Drone SDK2026-03-15T06:16:02ZDesigning correct UAV autonomy programs is challenging due to joint navigation, sensing and analytics requirements. While LLMs can generate code, their reliability for safety-critical UAVs remains uncertain. This paper presents AeroGen, an open-loop framework that enables consistently correct single-shot AI-generated drone control programs through structured guardrail prompting and integration with the AeroDaaS drone SDK. AeroGen encodes API descriptions, flight constraints and operational world rules directly into the system context prompt, enabling generic LLMs to produce constraint-aware code from user prompts, with minimal example code. We evaluate AeroGen across a diverse benchmark of 20 navigation tasks and 5 drone missions on urban, farm and inspection environments, using both imperative and declarative user prompts. AeroGen generates about 40 lines of AeroDaaS Python code in about 20s per mission, in both real-world and simulations, showing that structured prompting with a well-defined SDK improves robustness, correctness and deployability of LLM-generated drone autonomy programs.2026-03-15T06:16:02ZKautuk AstuYogesh Simmhanhttp://arxiv.org/abs/2504.18658v2The Big Send-off: Scalable and Performant Collectives for Deep Learning2026-03-15T03:52:28ZCollective communication is becoming increasingly important in data center and supercomputer workloads with an increase in distributed AI related jobs. However, existing libraries that provide collective support such as NCCL, RCCL, and Cray-MPICH exhibit several performance and scalability limitations on modern GPU supercomputers. To address these challenges, we introduce the Performant Collective Communication Library (PCCL), specifically targeted for distributed deep learning (DL) workloads. PCCL provides highly optimized implementations of key collectives used in distributed DL: all-gather, reduce-scatter, and all-reduce. PCCL uses a hierarchical design with learning-based adaptive selection of the best performing algorithms to scale efficiently to thousands of GPUs. It achieves substantial performance speedups over RCCL on 2048 GCDs of Frontier -- up to 168x for reduce-scatter, 33x for all-gather and 10x for all-reduce. More modest but still significant gains up to 5.7x over NCCL are observed on Perlmutter. These gains translate directly to performance improvement of production DL workloads: up to 4.9x speedup over RCCL in DeepSpeed ZeRO-3 training, and up to 2.4x speedup in DDP training.2025-04-25T19:23:46ZSiddharth SinghKeshav PradeepMahua SinghCunyang WeiAbhinav Bhatelehttp://arxiv.org/abs/2604.03265v1On the First Computer Science Research Paper in an Indian Language and the Future of Science in Indian Languages2026-03-14T20:33:49ZI describe my experience writing the first original, modern Computer Science research paper expressed entirely in an Indian language. The paper is in Telugu, a language with approximately 100 million speakers. The paper is in the field of distributed computing and it introduces a technique for proving epistemic logic based lower bounds for multiprocessor algorithms. A key hurdle to writing the paper was developing technical terminology for advanced computer science concepts, including those in algorithms, distributed computing, and discrete mathematics. I overcame this challenge by deriving and coining native language scientific terminology through the powerful, productive, Pāninian grammar of Samskrtam. The typesetting of the paper was an additional challenge, since mathematical typesetting in Telugu is underdeveloped. I overcame this problem by developing a Telugu XeLaTeX template, which I call TeluguTeX. Leveraging this experience of writing an original computer science research paper in an Indian language, I lay out a vision for how to ameliorate the state of scientific writing at all levels in Indic languages -- languages whose native speakers exceed one billion people -- through the further development of the Sanskrit technical lexicon and through technological internationalization.2026-03-14T20:33:49Z15 pages, some text in TeluguSiddhartha Visveswara Jayantihttp://arxiv.org/abs/2603.13945v1A Case for CATS: A Conductor-driven Asymmetric Transport Scheme for Semantic Prioritization2026-03-14T13:36:15ZStandard transport protocols like TCP operate as a blind, FIFO conveyor belt for data, a model that is increasingly suboptimal for latency-sensitive and interactive applications. This paper challenges this model by introducing CATS (Conductor-driven Asymmetric Transport Scheme), a framework that provides TCP with the semantic awareness necessary to prioritize critical content. By centralizing scheduling intelligence in a transport-native "Conductor", CATS significantly improves user-perceived performance by delivering essential data first. This architecture directly confronts a cascade of historical performance workarounds and their limitations, including the high overhead of parallel connections in HTTP/1.1, the transport-layer Head-of-Line blocking in HTTP/2, and the observed implementation heterogeneity of prioritization in HTTP/3 over QUIC. Built upon TCP BBR, our ns-3 implementation demonstrates this principle by reducing the First Contentful Paint by over 78% in a representative webpage download configured as a deliberate worst-case scenario, with no penalty to total page load time compared to the baseline.2026-03-14T13:36:15Z2025 6th International Conference on Innovative Computing (ICIC)Syed Muhammad Aqdas Rizvi10.1109/ICIC68258.2025.11413235http://arxiv.org/abs/2510.17015v2Justitia: Fair and Efficient Scheduling of Task-parallel LLM Agents with Selective Pampering2026-03-14T10:58:56ZLLM agents, which often comprise parallel inference tasks, are commonly adopted to solve real-world problems. When serving such task-parallel LLM agents in shared GPU servers, the scheduler is expected to attain fast agent completion with guaranteed worst-case performance. For that objective, our insight is to selectively pampering agents based on their completion order under idealized fair-sharing. We design Justitia, a fair and also efficient scheduler for task-parallel LLM agents. Noticing that memory is prevalently a bottleneck in LLM serving, Justitia quantifies the true agent cost in a memory-centric manner. It also adopts a light-weight yet accurate method to predict agent costs. Finally, Justitia adopts a virtual-time based fair queuing algorithm to reduce the overall performance with guaranteed worst-case delay. We have implemented Justitia atop vLLM, and experimental results involving diverse agents show that it can substantially enhance the scheduling efficiency with fairness preserved.2025-10-19T21:34:34ZMingyan YangGuanjie WangManqi LuoYifei LiuChen ChenHan ZhaoYu FengQuan ChenMinyi Guohttp://arxiv.org/abs/2512.20163v3Population Protocols Revisited: Parity and Beyond2026-03-14T08:20:32ZFor nearly two decades, population protocols have been extensively studied, yielding efficient solutions for central problems in distributed computing, including leader election, and majority computation, a predicate type in Presburger Arithmetic closely tied to population protocols. Surprisingly, no protocols have achieved both time- and space-efficiency for congruency predicates, such as parity computation, which are complementary in this arithmetic framework. This gap highlights a significant challenge in the field. To address this gap, we explore the parity problem, where agents are tasked with computing the parity of the given sub-population size. Then we extend the solution for parity to compute congruences modulo an arbitrary $m$.
Previous research on efficient population protocols has focused on protocols that minimise both stabilisation time and state utilisation for specific problems. In contrast, this work slightly relaxes this expectation, permitting protocols to place less emphasis on full optimisation and more on universality, robustness, and probabilistic guarantees. This allows us to propose a novel computing paradigm that integrates population weights (or simply weights), a robust clocking mechanism, and efficient anomaly detection coupled with a switching mechanism (which ensures slow but always correct solutions). This paradigm facilitates universal design of efficient multistage stable population protocols. Specifically, the first efficient parity and congruence protocols introduced here use both $O(\log^3 n)$ states and achieve silent stabilisation in $O(\log^3 n)$ time. We conclude by discussing the impact of implicit conversion between unary and binary representations enabled by the weight system, with applications to other problems, including the computation and representation of (sub-)population sizes.2025-12-23T08:41:10ZLeszek GąsieniecTytus GrodzickiTomasz JurdzińskiJakub KowalskiGrzegorz Stachowiakhttp://arxiv.org/abs/2603.13750v1The Forward-In-Time-Only Assumption in SmartNIC Resource Management: A Critique of Wave and the Case for Bilateral Interaction2026-03-14T04:51:06ZThe datacenter industry is converging on SmartNIC-based resource management. Wave (Humphries et al., ASPLOS '25) demonstrates the practical feasibility of offloading kernel thread scheduling, memory management, and RPC stacks to the ARM cores of Intel's Mount Evans Infrastructure Processing Unit (IPU). The engineering is careful and the results are honest: without Wave's PCIe latency mitigations, offloaded workloads degrade by 350%.
We argue that this 350% degradation is not an engineering problem to be optimized away but a diagnostic symptom of a deeper architectural issue: Wave's communication model is Forward-In-Time-Only (FITO). Every interaction between host and SmartNIC is a unidirectional message -- event forward, decision back -- creating a temporal vulnerability window in which decisions can become stale before they are enforced. Wave's entire optimization stack (write-combining page table entries, prestaging, prefetching, atomic transaction abort) exists to hide or tolerate this window.
We apply the FITO diagnostic to Wave's architecture systematically, identify the category mistake it inherits from Lamport's happened-before and Shannon's channel model, and show how Open Atomic Ethernet's bilateral swap primitive -- implemented on the same Intel IPU hardware -- dissolves the latency, atomicity, and timeout problems without engineering around them. The SmartNIC is the right location for resource management; what is missing is the right communication primitive at that location.2026-03-14T04:51:06Z14 pages, 19 references. Part of the Category Mistake SeriesPaul Borrillhttp://arxiv.org/abs/2603.13738v1The Markovianity of Time: The Category Mistake in Open Quantum Systems2026-03-14T03:50:26ZThe Markov approximation is arguably the most ubiquitous tool in physics, underpinning quantum master equations, stochastic processes, and -- via Shannon's channel model and Lamport's logical clocks -- the foundational assumptions of distributed computing. It is widely assumed that Markovianity inherently implies temporal asymmetry: that the Markov property is a forward-in-time-only (FITO) construct. We show that this assumption is a category mistake in the sense of Ryle (1949).
Guff, Shastry, and Rocco (2025) have recently demonstrated that the Markov approximation applied to the Caldeira-Leggett model -- a paradigmatic open quantum system -- maintains time-reversal symmetry in the derived equations of motion. The resulting time-symmetric formulations of quantum Brownian motion, Lindblad master equations, and Pauli master equations describe thermalisation that can occur in two opposing temporal directions. Asymmetry arises not from the dynamics but from boundary conditions.
We trace how Markovianity's assumed directionality propagated from physics through Shannon's information theory to Lamport's happens-before relation and the impossibility theorems of distributed computing (FLP, CAP, Two Generals). Each step encodes FITO as convention, then treats it as physical law -- the same category mistake repeated across domains. The Surrey result establishes that this conflation is not merely philosophically suspect but mathematically unnecessary: the most fundamental approximation used to derive irreversibility is itself time-symmetric.2026-03-14T03:50:26Z12 pagesPaul Borrillhttp://arxiv.org/abs/2504.09844v3MegaScale-Data: Scaling Dataloader for Multisource Large Foundation Model Training2026-03-14T02:46:46ZModern frameworks for training large foundation models (LFMs) employ dataloaders in a data-parallel manner, with each loader processing a disjoint subset of training data. When preparing data for LFM training that originates from multiple, distinct sources, two fundamental challenges arise. First, due to the quadratic computational complexity of the attention operator, the non-uniform sample distribution over data-parallel ranks leads to significant workload imbalance among dataloaders, degrading the training efficiency. Second, supporting diverse data sources requires per-dataset file access states that are redundantly replicated across parallel loaders, consuming excessive memory. This also hinders dynamic data mixing (e.g., curriculum learning) and causes redundant access/memory overhead in hybrid parallelism.
We present MegaScale-Data, an industrial-grade distributed data loading architecture for multisource LFMs training, with three key innovations: (1) Disaggregated data preprocessing via role-specific actors (Source Loaders/Data Constructors) to eliminate source and parallelism redundant data access and ensure multisource scalability. (2) Centralized and declarative data plane for load-time multisource orchestration, such as long-short context, multimodality, and curriculum learning. (3) Multi-level auto-partitioning and scaling mechanism for source loaders under heterogeneous preprocessing costs. We also contribute our designs and operational experience in deployment and fault tolerance. MegaScale-Data achieves up to: (1) 4.5x end-to-end training throughput improvement, and (2) 13.5x reduction in CPU memory usage.2025-04-14T03:31:22ZJuntao ZhaoQi LuWei JiaBorui WanLei ZuoJunda FengJianyu JiangYangrui ChenShuaishuai CaoJialing HeKaihua JiangYuanzhe HuShibiao NongYanghua PengHaibin LinChuan Wuhttp://arxiv.org/abs/2603.13668v1Audo-Sight: AI-driven Ambient Perception Across Edge-Cloud for Blind and Low Vision Users2026-03-14T00:30:04ZDespite advances in assistive technologies, Blind and Low-Vision (BLV) individuals continue to face challenges in understanding their surroundings. Delivering concise, useful, and timely scene descriptions for ambient perception remains a long-standing accessibility problem. To address this, we introduce Audo-Sight, an AI-driven assistive system across Edge-Cloud that enables BLV individuals to perceive their surroundings through voice-based conversational interaction. Audo-Sight employs a set of expert and generic AI agents, each supported by dedicated processing pipelines distributed across edge and cloud. It analyzes user queries by considering urgency and contextual information to infer the user intent and dynamically route each query, along with a scene frame, to the most suitable pipeline. In cases where users require fast responses, the system simultaneously leverages edge and cloud processing pipelines. The edge generates an initial response quickly, while the cloud provides more detailed and accurate information. To overcome the challenge of seamlessly combining these outputs, we introduce the Response Fusion Engine, which fuses the fast edge response with the more accurate cloud output, ensuring timely and high-accuracy response for the BLV users. Systematic evaluation shows that Audo-Sight delivers speech output around 80% faster for urgent tasks and generates complete responses approximately 50% faster across all tasks compared to a commercial cloud-based solution -- highlighting the effectiveness of our system across edge-cloud. Human evaluation of Audo-Sight shows that it is the preferred choice over GPT-5 for 62% of BLV participants with another 23% stating both perform comparably.2026-03-14T00:30:04ZJacob BradshawMohsen Riahi AlamBhanuja AinaryMinseo KimMohsen Amini Salehi