https://arxiv.org/api/zGB9Yp7ogOThg9MwAZDmIvbOz9M2026-04-07T09:59:48Z2791321015http://arxiv.org/abs/2603.22691v1Rank-Aware Resource Scheduling for Tightly-Coupled MPI Workloads on Kubernetes2026-03-24T01:30:59ZFully provisioned Message Passing Interface (MPI) parallelism achieves near-optimal wall-clock time for Computational Fluid Dynamics (CFD) solvers. This work addresses a complementary question for shared, cloud-managed clusters: can fine-grained CPU provisioning reduce resource reservation of low-load subdomains, improving cluster packing efficiency without unacceptably degrading performance?
We propose rank-aware resource scheduling on Kubernetes, mapping each MPI rank to a pod whose CPU request is proportional to its subdomain cell count. We also demonstrate In-Place Pod Vertical Scaling (Kubernetes v1.35 GA) for mid-simulation CPU adjustment without pod restart.
Three findings emerge. First, hard CPU limits via the Linux CFS bandwidth controller cause 78x slowdown through cascading stalls at MPI_Allreduce barriers; requests-only allocation eliminates throttling entirely. Second, on non-burstable c5.xlarge instances, concentric decomposition with equal CPU is 19% faster than the Scotch baseline, while adding proportional CPU yields a further 3% improvement. Third, at 16 MPI ranks on 101K-cell meshes, proportional allocation is 20% faster than equal allocation while reducing sparse-subdomain provisioned CPU by 82%, freeing 6.5 vCPU of scheduling headroom.
Experiments are conducted on AWS EC2 c5.xlarge clusters (4-16 ranks) running k3s v1.35. All scripts and data are released as open source.2026-03-24T01:30:59Z22 pages, 10 figures, 7 tables. Submitted to Journal of Cloud ComputingTianfang Xiehttp://arxiv.org/abs/2603.22542v1Interactive and Urgent HPC: State of the Research2026-03-23T20:01:32ZWhen we think of how we use smartphones, e-commerce, collaboration platforms, LLMs, etc., most of our interactions with computers are interactive and often urgent. Similar trends of interactivity and urgency are coming to HPC, with applications from simulations to data analysis and machine learning requiring more parallel computational capability and more interactivity. This chapter overviews the progress made so far along with some vectors of what the path forward will bring for greater integration of interactive and urgent HPC policies, techniques, and technologies into our HPC ecosystems.2026-03-23T20:01:32Z32 pages, 3 figuresAlbert ReutherWilliam ArndtJohannes BlaschkeChristian BoehmeNick BrownAntony ChazapisBjoern EndersJens Henrik GoebbertRobert HenschelJulian KunkelMaxime MartinassoMichael RingenburgRollin Thomashttp://arxiv.org/abs/2603.22514v1Communication-Efficient Approximate Gradient Coding2026-03-23T19:23:23ZLarge-scale distributed learning aims at minimizing a loss function $L$ that depends on a training dataset with respect to a $d$-length parameter vector. The distributed cluster typically consists of a parameter server (PS) and multiple workers. Gradient coding is a technique that makes the learning process resilient to straggling workers. It introduces redundancy within the assignment of data points to the workers and uses coding theoretic ideas so that the PS can recover $\nabla L$ exactly or approximately, even in the presence of stragglers. Communication-efficient gradient coding allows the workers to communicate vectors of length smaller than $d$ to the PS, thus reducing the communication time. While there have been schemes that address the exact recovery of $\nabla L$ within communication-efficient gradient coding, to the best of our knowledge the approximate variant has not been considered in a systematic manner. In this work we present constructions of communication-efficient approximate gradient coding schemes. Our schemes use structured matrices that arise from bipartite graphs, combinatorial designs and strongly regular graphs, along with randomization and algebraic constraints. We derive analytical upper bounds on the approximation error of our schemes that are tight in certain cases. Moreover, we derive a corresponding worst-case lower bound on the approximation error of any scheme. For a large class of our methods, under reasonable probabilistic worker failure models, we show that the expected value of the computed gradient equals the true gradient. This in turn allows us to prove that the learning algorithm converges to a stationary point over the iterations. Numerical experiments corroborate our theoretical findings.2026-03-23T19:23:23ZSubmitted to IEEE Transactions on Information Theory. This paper was presented in part at the IEEE International Symposium on Information Theory (ISIT), Ann Arbor, MI, USA, 2025Sifat MunimAditya Ramamoorthyhttp://arxiv.org/abs/2603.22495v1Linux and High-Performance Computing2026-03-23T19:00:30ZIn the 1980s, high-performance computing (HPC) became another tool for research in the open (non-defense) science and engineering research communities. However, HPC came with a high price tag; the first Cray-2 machines, released in 1985, cost between \$12 million and \$17 million, according to the Computer History Museum, and were largely available only at government research labs or through national supercomputing centers. In the 1990s, with demand for HPC increasing due to vast datasets, more complex modeling, and the growing computational needs of scientific applications, researchers began experimenting with building HPC machines from clusters of servers running the Linux operating system. By the late 1990s, two approaches to Linux-based parallel computing had emerged: the personal computer cluster methodology that became known as Beowulf and the Roadrunner architecture aimed at a more cost-effective supercomputer. While Beowulf attracted attention because of its low cost and thereby greater accessibility, Roadrunner took a different approach. While still affordable compared to vector processors and other commercially available supercomputers, Roadrunner integrated its commodity components with specialized networking technology. Furthermore, these systems initially served different purposes. While Beowulf focused on providing affordable parallel workstations for individual researchers at NASA, Roadrunner set out to provide a multi-user system that could compete with the commercial supercomputers that dominated the market at the time. This paper analyzes the technical decisions, performance implications, and long-term influence of both approaches. Through this analysis, we can start to judge the impact of both Roadrunner and Beowulf on the development of Linux-based supercomputers.2026-03-23T19:00:30Z18 pagesDavid A. Baderhttp://arxiv.org/abs/2603.22465v1A Theoretical Framework for Energy-Aware Gradient Pruning in Federated Learning2026-03-23T18:31:07ZFederated Learning (FL) is constrained by the communication and energy limitations of decentralized edge devices. While gradient sparsification via Top-K magnitude pruning effectively reduces the communication payload, it remains inherently energy-agnostic. It assumes all parameter updates incur identical downstream transmission and memory-update costs, ignoring hardware realities. We formalize the pruning process as an energy-constrained projection problem that accounts for the hardware-level disparities between memory-intensive and compute-efficient operations during the post-backpropagation phase. We propose Cost-Weighted Magnitude Pruning (CWMP), a selection rule that prioritizes parameter updates based on their magnitude relative to their physical cost. We demonstrate that CWMP is the optimal greedy solution to this constrained projection and provide a probabilistic analysis of its global energy efficiency. Numerical results on a non-IID CIFAR-10 benchmark show that CWMP consistently establishes a superior performance-energy Pareto frontier compared to the Top-K baseline.2026-03-23T18:31:07Z8 pages, 2 figures. This work has been submitted to the IEEE for possible publicationEmmanouil M. Athanasakoshttp://arxiv.org/abs/2603.22251v1exaCB: Reproducible Continuous Benchmark Collections at Scale Leveraging an Incremental Approach2026-03-23T17:44:49ZThe increasing heterogeneity of high-performance computing (HPC) systems and the transition to exascale architectures require systematic and reproducible performance evaluation across diverse workloads. While continuous integration (CI) ensures functional correctness in software engineering, performance and energy efficiency in HPC are typically evaluated outside CI workflows, motivating continuous benchmarking (CB) as a complementary approach. Integrating benchmarking into CI workflows enables reproducible evaluation, early detection of regressions, and continuous validation throughout the software development lifecycle.
We present exaCB, a framework for continuous benchmarking developed in the context of the JUPITER exascale system. exaCB enables application teams to integrate benchmarking into their workflows while supporting large-scale, system-wide studies through reusable CI/CD components, established harnesses, and a shared reporting protocol. The framework supports incremental adoption, allowing benchmarks to be onboarded easily and to evolve from basic runnability to more advanced instrumentation and reproducibility. The approach is demonstrated in JUREAP, the early-access program for JUPITER, where exaCB enabled continuous benchmarking of over 70 applications at varying maturity levels, supporting cross-application analysis, performance tracking, and energy-aware studies. These results illustrate the practicality using exaCB for continuous benchmarking for exascale HPC systems across large, diverse collections of scientific applications.2026-03-23T17:44:49ZJayesh BadwaikMathis BodeMichal RajskiAndreas Hertenhttp://arxiv.org/abs/2603.26758v1A Density-Delay Law for Stable Event-Driven State Progression in Open Distributed Systems2026-03-23T15:50:21ZDistributed systems in which concurrent proposals are mutually exclusive face a fundamental stability constraint under network delay. In open systems where global state progression is event-driven rather than round-driven, propagation delay creates a conflict window within which overlapping proposals may generate competing branches. This paper derives a density-delay law for such exclusive state progression processes. Under independent proposal arrivals and bounded propagation delay, overlap is approximated by a Poisson model and fork depth is represented by a birth-death process. The analysis shows that maintaining bounded fork depth as the number of participants grows requires the density-delay product $λΔ$ to remain $O(1)$, implying that aggregate proposal intensity must stay bounded and yielding an inverse-scaling law $g(N)=O(1/N)$ at the unit level. Simulation experiments across varying network sizes and propagation delays align with a common density-delay curve, supporting the predicted scaling behavior. The result provides a compact law for stable event-driven state progression in open distributed systems and offers a scaling-based interpretation of Bitcoin-style difficulty adjustment as a decentralized way to regulate effective event density.2026-03-23T15:50:21ZBin ChenDechuang Huanghttp://arxiv.org/abs/2603.09568v2Case Study: Performance Analysis of a Virtualized XRootD Frontend in Large-Scale WAN Transfers2026-03-23T13:48:29ZThis paper presents a detailed case study of the T2_BR_SPRACE storage frontend architecture and its observed performance in high-intensity data transfers. The architecture is composed of a heterogeneous cluster of XRootD [1] Virtual Machines (VMs) with 10 Gb/s and 40 Gb/s links, which aggregate data from a 77 Gb/s dCache [2] backend via pNFS to an external 100 Gb/s WAN link. We describe the system configuration, including the use of the BBR [3] congestion control algorithm and TCP extensions [4]. Under peak production conditions, we observed the system sustaining an aggregate throughput of 51.3 Gb/s. An analysis of a specific data flow to Fermilab (FNAL) showed peaks of 41.5 Gb/s, validated by external monitoring tools (CERN). This study documents the performance of a complex virtualized architecture under real load.2026-03-10T12:13:01ZJ M da SilvaM A CostaR L Iopehttp://arxiv.org/abs/2602.15510v3On the Geometric Coherence of Global Aggregation in Federated Graph Neural Networks2026-03-23T08:53:04ZFederated learning over graph-structured data exposes a fundamental mismatch between standard aggregation mechanisms and the operator nature of graph neural networks (GNNs). While federated optimization treats model parameters as elements of a shared Euclidean space, GNN parameters induce graph-dependent message-passing operators whose semantics depend on underlying topology. Under structurally and distributionally heterogeneous client graph distributions, local updates correspond to perturbations of distinct operator manifolds. Linear aggregation of such updates mixes geometrically incompatible directions, producing global models that converge numerically yet exhibit degraded relational behavior. We formalize this phenomenon as a geometric failure of global aggregation in cross-domain federated GNNs, characterized by destructive interference between operator perturbations and loss of coherence in message-passing dynamics. This degradation is not captured by conventional metrics such as loss or accuracy, as models may retain predictive performance while losing structural sensitivity. To address this, we propose GGRS (Global Geometric Reference Structure), a server-side aggregation framework operating on a data-free proxy of operator perturbations. GGRS enforces geometric admissibility via directional alignment, subspace compatibility, and sensitivity control, preserving the structure of the induced message-passing operator.2026-02-17T11:34:04ZThis is a developing preprint of an 18-page journal manuscript (6 figures), currently being prepared for formal peer-review submissionChethana Prasad KabgereShylaja SShttp://arxiv.org/abs/2603.21692v1Reasoning Provenance for Autonomous AI Agents: Structured Behavioral Analytics Beyond State Checkpoints and Execution Traces2026-03-23T08:27:54ZAs AI agents transition from human-supervised copilots to autonomous platform infrastructure, the ability to analyze their reasoning behavior across populations of investigations becomes a pressing infrastructure requirement. Existing operational tooling addresses adjacent needs effectively: state checkpoint systems enable fault tolerance; observability platforms provide execution traces for debugging; telemetry standards ensure interoperability. What current systems do not natively provide as a first-class, schema-level primitive is structured reasoning provenance -- normalized, queryable records of why the agent chose each action, what it concluded from each observation, how each conclusion shaped its strategy, and which evidence supports its final verdict. This paper introduces the Agent Execution Record (AER), a structured reasoning provenance primitive that captures intent, observation, and inference as first-class queryable fields on every step, alongside versioned plans with revision rationale, evidence chains, structured verdicts with confidence scores, and delegation authority chains. We formalize the distinction between computational state persistence and reasoning provenance, argue that the latter cannot in general be faithfully reconstructed from the former, and show how AERs enable population-level behavioral analytics: reasoning pattern mining, confidence calibration, cross-agent comparison, and counterfactual regression testing via mock replay. We present a domain-agnostic model with extensible domain profiles, a reference implementation and SDK, and outline an evaluation methodology informed by preliminary deployment on a production platformized root cause analysis agent.2026-03-23T08:27:54Z8 pages, 2 tables, preprintNeelmani Visputehttp://arxiv.org/abs/2602.22350v3Engineered Simultaneity: The Physical Impossibility of Consolidated Price Discovery Across Spacelike-Separated Exchanges2026-03-23T07:24:51ZWe define \emph{engineered simultaneity}: the construction of a system that requires temporal comparison of events at spacelike-separated locations, implements this comparison via an implicit simultaneity convention, and represents the result as an objective measurement rather than a conventional choice. We show that the National Best Bid and Offer (NBBO) -- the regulatory cornerstone of U.S. equity markets -- is an instance of engineered simultaneity. The NBBO requires determining ``current'' prices across exchanges whose spatial separation places their price events outside each other's light cones. Special relativity proves that the temporal ordering of such events is frame-dependent: there exist inertial reference frames in which the NBBO differs from the value reported by the Securities Information Processor. The impossibility is not approximate; it is exact and unavoidable within the causal structure of Minkowski spacetime. General relativity compounds the impossibility: gravitational time dilation introduces frame-rate discrepancies between exchanges at different altitudes, and recent work on indefinite causal order in quantum information theory undermines the premise of a fixed causal structure altogether. We formalize the special-relativistic argument using the causal precedence relation, connect it to Lamport's theorem on distributed ordering, and note that approximately \$5~billion per year in latency arbitrage profits are extracted from the gap between the NBBO's implicit simultaneity convention and physical reality.2026-02-25T19:14:52Z9 pages, 2 figures, 2 tablesPaul Borrillhttp://arxiv.org/abs/2504.14145v2DIP: Efficient Large Multimodal Model Training with Dynamic Interleaved Pipeline2026-03-23T07:10:52ZLarge multimodal models (LMMs) have demonstrated excellent capabilities in both understanding and generation tasks with various modalities. While these models can accept flexible combinations of input data, their training efficiency suffers from two major issues: pipeline stage imbalance caused by heterogeneous model architectures, and training data dynamicity stemming from the diversity of multimodal data.
In this paper, we present DIP, a dynamic and modality-aware pipeline scheduling framework designed for LMM training. DIP tackles the challenge of dynamic imbalance via two key techniques: (1) separating computations of different modalities into dedicated pipeline segments to balance workloads within a continuous set of stages; (2) dynamically splitting input data into finer-grained, modality-specific sub-microbatches to balance workloads across these segments. By asynchronously generating pipeline schedules on idle CPU resources during training, DIP dynamically tailors stage executions to each input batch without stalling the training process. We validate DIP on a diverse set of five LMMs, ranging from 12B to 94B parameters and including vision-language and diffusion models. Experimental results show that our system achieves up to 97.3% higher throughput compared to state-of-the-art systems, demonstrating strong adaptability to fluctuating multimodal training workloads.2025-04-19T02:30:11ZTo be published in ASPLOS'26Zhenliang XueHanpeng HuXing ChenYimin JiangYixin SongZeyu MiYibo ZhuDaxin JiangYubin XiaHaibo Chen10.1145/3779212.3790154http://arxiv.org/abs/2603.21600v1Benchmarking Message Brokers for IoT Edge Computing: A Comprehensive Performance Study2026-03-23T05:49:19ZAsynchronous messaging is a cornerstone of modern distributed systems, enabling decoupled communication for scalable and resilient applications. Today's message queue (MQ) ecosystem spans a wide range of designs, from high-throughput streaming platforms to lightweight protocols tailored for edge and IoT environments. Despite this diversity, choosing an appropriate MQ system remains difficult. Existing evaluations largely focus on throughput and latency on fixed hardware, while overlooking CPU and memory footprint and the effects of resource constraints, factors that are critical for edge and IoT deployments. In this paper, we present a systematic performance study of eight prominent message brokers: Mosquitto, EMQX, HiveMQ, RabbitMQ, ActiveMQ Artemis, NATS Server, Redis (Pub/Sub), and Zenoh Router. We introduce mq-bench, a unified benchmarking framework to evaluate these systems under identical conditions, scaling up to 10,000 concurrent client pairs across three VM configurations representative of edge hardware. This study reveals several interesting and sometimes counter-intuitive insights. Lightweight native brokers achieve sub-millisecond latency, while feature-rich enterprise platforms incur 2-3X higher overhead. Under high connection loads, multi-threaded brokers like NATS and Zenoh scale efficiently, whereas the widely-deployed Mosquitto saturates earlier due to its single-threaded architecture. We also find that Java-based brokers consume significantly more memory than native implementations, which has important implications for memory-constrained edge deployments. Based on these findings, we provide practical deployment guidelines that map workload requirements and resource constraints to appropriate broker choices for telemetry, streaming analytics, and IoT use cases.2026-03-23T05:49:19ZAccepted at IEEE/ACM CCGrid 2026Tapajit Chandra PaulPawissanutt LertpongrujikornHai Duc NguyenMohsen Amini Salehihttp://arxiv.org/abs/2602.23036v2LLMServingSim 2.0: A Unified Simulator for Heterogeneous and Disaggregated LLM Serving Infrastructure2026-03-23T05:25:20ZLarge language model (LLM) serving infrastructures are undergoing a shift toward heterogeneity and disaggregation. Modern deployments increasingly integrate diverse accelerators and near-memory processing technologies, introducing significant hardware heterogeneity, while system software increasingly separates computation, memory, and model components across distributed resources to improve scalability and efficiency. As a result, LLM serving performance is no longer determined by hardware or software choices in isolation, but by their runtime interaction through scheduling, data movement, and interconnect behavior. However, understanding these interactions remains challenging, as existing simulators lack the ability to jointly model heterogeneous hardware and disaggregated serving techniques within a unified, runtime-driven framework.
This paper presents LLMServingSim 2.0, a unified system-level simulator designed to make runtime-driven hardware-software interactions in heterogeneous and disaggregated LLM serving infrastructures explicit and analyzable. LLMServingSim 2.0 embeds serving decisions and hardware behavior into a single runtime loop, enabling interaction-aware modeling of batching, routing, offloading, memory, and power. The simulator supports extensible integration of emerging accelerators and memory systems through profile-based modeling, while capturing dynamic serving behavior and system-level effects. We validate LLMServingSim 2.0 against real deployments, showing that it reproduces key performance, memory, and power metrics with an average error of 0.95%, while maintaining simulation times of around 10 minutes even for complex configurations. These results demonstrate that LLMServingSim 2.0 provides a practical bridge between hardware innovation and serving-system design, enabling systematic exploration and co-design for next-generation LLM serving infrastructures.2026-02-26T14:22:17Z14 pages, 11 figuresJaehong ChoHyunmin ChoiGuseul HeoJongse Parkhttp://arxiv.org/abs/2508.00596v4Information-Theoretic Decentralized Secure Aggregation with Passive Collusion Resilience2026-03-22T20:40:44ZIn decentralized federated learning (FL), multiple clients collaboratively learn a shared machine learning (ML) model by leveraging their privately held datasets distributed across the network, through interactive exchange of the intermediate model updates. To ensure data security, cryptographic techniques are commonly employed to protect model updates during aggregation. Despite growing interest in secure aggregation, existing works predominantly focus on protocol design and computational guarantees, with limited understanding of the fundamental information-theoretic limits of such systems. Moreover, optimal bounds on communication and key usage remain unknown in decentralized settings, where no central aggregator is available. Motivated by these gaps, we study the problem of decentralized secure aggregation (DSA) from an information-theoretic perspective. Specifically, we consider a network of $K$ fully-connected users, each holding a private input -- an abstraction of local training data -- who aim to securely compute the sum of all inputs. The security constraint requires that no user learns anything beyond the input sum, even when colluding with up to $T$ other users. We characterize the optimal rate region, which specifies the minimum achievable communication and secret key rates for DSA. In particular, we show that to securely compute one symbol of the desired input sum, each user must (i) transmit at least one symbol to others, (ii) hold at least one symbol of secret key, and (iii) all users must collectively hold no fewer than $K - 1$ independent key symbols. Our results establish the fundamental performance limits of DSA, providing insights for the design of provably secure and communication-efficient protocols in decentralized learning.2025-08-01T12:51:37ZAccepted by IEEE JSACXiang ZhangZhou LiShuangyang LiKai WanDerrick Wing Kwan NgGiuseppe Caire