https://arxiv.org/api/tfpLKlpwjYwHQppPpvzBQi7u0GU 2026-06-10T14:37:53Z 28838 255 15 http://arxiv.org/abs/2605.28095v1 SiDP: Memory-Efficient Data Parallelism for Offline LLM Inference 2026-05-27T07:52:03Z The rapid adoption of large language models (LLMs) has shifted a substantial portion of inference workloads into throughput-oriented offline regimes, where fully utilizing GPU compute requires large batch sizes. However, existing deployments face a structural tension. Data parallelism (DP) scales throughput well but replicates model weights, leaving limited GPU memory for key-value (KV) cache and constraining batch size. Model parallelism reduces per-device weights, but requires fine-grained synchronization that erodes DP's independence and scheduling flexibility. We present SiDP, a memory-efficient data-parallel paradigm for offline LLM inference that treats weights as a bandwidth-backed shared resource inside a DP group. Instead of storing the full model on every GPU, SiDP organizes weights as a distributed pool: each layer is owned by a single GPU, and other replicas access its weights on demand via two complementary execution modes: a Weight-as-a-Service (WaS) mode that streams remote weights over NVLink into a small cache in the large-batch regime, and a Compute-as-a-Service (CaS) mode that ships activations to owners in the small-batch tail. Evaluated on NVIDIA H20, H200, and B200 GPUs with Qwen3-32B, Qwen2.5-72B, and Llama-3.1-70B, SiDP increases usable KV capacity by up to 1.8x under the same configurations, and converts this into up to 1.5x higher end-to-end throughput over baselines (vLLM) for offline workloads. 2026-05-27T07:52:03Z Alan Zhao Cyril Y. He http://arxiv.org/abs/2604.16457v2 Ding-Dong Ditch: Peeking Into Spot Instance Availability 2026-05-27T07:15:16Z Spot instances offer significant cost savings of up to 90% over on-demand prices, making them an attractive resource for large-scale computing workloads. However, understanding their availability dynamics is essential for building systems that tolerate interruptions, and observing this availability directly requires keeping instances running, which incurs costs that scale with the number of monitored instance types and their per-instance price. We propose Ding-Dong Ditch (DDD), a cost-efficient method that collects spot instance availability signals by leveraging the cloud provider's provisioning lifecycle. Since the outcome of a spot request is determined before the instance enters the running state, DDD submits requests and cancels them upon provisioning acceptance, collecting binary availability signals at near-zero instance cost. Submitting multiple concurrent requests per measurement point further yields a quantitative estimate of available capacity. We validate DDD through simultaneous collection of probing signals and actual running instance traces across 68 instance types and 15 regions on both AWS and Azure, totaling 336,033 spot requests. Analysis of 2,635 real-world interruption events reveals that co-interruptions within the same instance type and availability zone occur within three minutes in over 92% of cases, motivating a binary availability formulation. Based on this formulation, we derive three complementary features from DDD signals and demonstrate that their combination achieves an F1-macro score of up to 0.90 for current availability modeling and maintains 0.85 at a 60-minute prediction horizon. A trace-driven simulation using TPC-DS workloads further demonstrates the potential of DDD-based prediction to reduce lost computation compared to an unguided baseline. 2026-04-08T01:35:53Z Accepted at IEEE CLOUD 2026 Kyumin Kim Moohyun Song Taeyoon Kim Kyungyong Lee http://arxiv.org/abs/2605.27963v1 Throughput-Optimized Networks at Scale 2026-05-27T04:56:00Z Datacenter network design plays a critical role in AI training by supporting scaling to thousands of accelerators. An open problem, designing a near-optimal throughput oriented network-topology, routing, and collectives-has not been achieved at scale and with broad applicability to physical/implementation constraints. We address this problem with a compelling use-case, Google's TPU v4/5p supercomputer where the topology may be reconfigured to achieve higher all-to-all throughput, supporting large, parallelized AI training. We show that the existing TPU networks leave terabytes per second of throughput on the table and we fill that gap. This paper presents Throughput Optimized Networks at Scale (TONS), an automated network synthesis framework that meets the high-throughput demands of modern computing. TONS formulates topology synthesis as a linear optimization problem that maximizes a throughput-centric proxy metric, using theory and heuristics to scale to thousands of nodes. We further introduce a deadlock-free routing scheme compatible with limited virtual channels and optical switch faults, enabling the synthesized topologies to realize their predicted throughput gains in simulation. Evaluating uniform random and all-to-all traffic, TONS networks have a geometric mean speedups of 2.1x and 1.6x, respectively, over the best TPU v4/5p torus variants. 2026-05-27T04:56:00Z 12 pages body, 21 pages total, 11 figures, 2 tables Conor James Green Mithuna Thottethodi http://arxiv.org/abs/2605.27918v1 Addressing Variable Heterogeneity in Distributed Multimodal Training with Entrain 2026-05-27T03:44:27Z Multimodal LLM datasets are inherently heterogeneous, with significant data variability. Although each modality exhibits independent variability, sample-level entanglement makes it difficult to balance workloads across both modalities and batches. We present Entrain, a distributed MLLM training framework that addresses both heterogeneity and variability in multimodal training workloads. Entrain challenges the intuition that dynamic data variability requires dynamic model parallelism by shifting the profiling paradigm from micro-level samples to macroscopic batches. We prove that a single, static model-parallel configuration suffices for optimal load balancing under this paradigm. At the microscopic scale, Entrain introduces a hierarchical microbatch assignment algorithm that defers excess workload within each iteration to stabilize variability across microbatches. Evaluations show that Entrain reduces workload variability across microbatches by up to 10.6$\times$, improving end-to-end training throughput by up to 1.40$\times$ over existing baselines. 2026-05-27T03:44:27Z Insu Jang Mosharaf Chowdhury http://arxiv.org/abs/2605.26527v2 A Formal Semantics of C with OpenMP Parallelism (Extended Version) 2026-05-27T02:00:04Z OpenMP is a popular parallelization framework that lets users transform sequential code into parallel code with a few simple annotations. Unfortunately, it is also easy to inadvertently introduce errors by adding OpenMP pragmas into otherwise correct programs, including both logic errors and race conditions. We present a formal semantics for C code with OpenMP directives, building on the C semantics of the CompCert verified compiler and its extension to concurrency. Our semantics captures subtle interactions between OpenMP directives and variable state that have been obscured by previous OpenMP semantics, and provides a basis for detecting undesired behaviors introduced by incorrect annotations: in particular, any successful execution is guaranteed to be free of data races. 2026-05-26T04:19:20Z Ke Du Anshu Sharma Liyi Li William Mansky http://arxiv.org/abs/2605.22661v2 A Generalized Nash Equilibrium-Seeking Scheme for Trauma Resuscitation 2026-05-26T22:29:34Z Trauma resuscitation is a clinical process for treating life-threatening physiological disorders in safety-critical environments, driven by the experience of healthcare workers (HCWs). Designing and optimizing quantifiable metrics that accurately capture HCW decisions may augment current resuscitation procedures with the potential to improve patient outcomes. This motivates our socio-technical formulation of trauma resuscitation as a distributed generalized Nash equilibrium (GNE)-seeking game with coupled inequality constraints. This method is optimized over a time-varying communication graph. We introduce novel insights from clinical experience to model HCWs behavior. This work facilitates the best possible resuscitation outcome given HCWs workloads, schedules, competencies, and limited resources. 2026-05-21T16:02:44Z Promise Ekpo Angelique Taylor Lekan Molu http://arxiv.org/abs/2604.15919v3 Continuous benchmarking: Keeping pace with an evolving ecosystem of models and technologies 2026-05-26T21:43:12Z Drawing on ideas from continuous integration, we present concepts of an automated benchmarking pipeline for high performance applications. Customization and collaboration have been key design goals owing to the requirements of research-software development as a continuous community effort. We have extended our previous conceptual work on systematic benchmarking workflows with the functionality of user-agnostic operations as well as continuous benchmarking. This fosters reproducibility and re-use of benchmarking results to ensure sustainable technological progress. We provide software-engineering solutions to keep pace with the rapid evolution of both large-scale models and high-performance computing systems with a view towards the scientific domains of neuroscience and artificial intelligence. 2026-04-17T10:20:05Z 20 pages, 8 figures Jan Vogelsang Melissa Lober Catherine Mia Schöfmann José Villamar Dennis Terhorst Johanna Senk Hans Ekkehard Plesser Markus Diesmann Susanne Kunkel Anno C. Kurth http://arxiv.org/abs/2605.27691v1 SOLANET: Distributed Neighbor Graph Construction on GPU-Accelerated Systems 2026-05-26T21:13:00Z Neighbor graphs capture relationships among data points and are widely used in data analytics and AI workloads. Many studies have explored approximate construction methods for single-node systems, including GPUs. However, extending this to distributed systems for larger data and further acceleration remains challenging due to irregular computation patterns. We present SOLANET, a GPU-accelerated distributed neighbor graph construction toolkit. SOLANET first constructs local graphs on each GPU after data partitioning and then refines them via approximate nearest neighbor (ANN) searches over remote graphs pulled from other GPUs using MPI one-sided operations. SOLANET also provides a lock-free single-GPU neighbor graph construction algorithm for AMD GPUs. Our single-GPU implementation outperforms a state-of-the-art GPU-based approximate neighbor graph construction implementation across multiple datasets on a single MI300A APU. Furthermore, SOLANET demonstrates 11X speedup from 32 to 512 APUs for 1 billion data points and 6.9x speedup from 64 to 512 APUs for 2 billion points. 2026-05-26T21:13:00Z Keita Iwabuchi Trevor Steil Benjamin W. Priest Grace J. Li Geoffrey Sanders Roger Pearce http://arxiv.org/abs/2605.27678v1 Heterogeneous Parallelism for Multimodal Large Language Model Training 2026-05-26T20:53:06Z Foundation model training is becoming multimodal, from post-training pipelines to large-scale pretraining. As modality coverage broadens, context windows grow, and encoder LLM scales diverge, a single LLM-centric TP/CP/PP/DP/EP layout increasingly limits throughput. This coupling forces encoders to inherit LLM-driven sharding and placement choices that can add communication, limit encoder parallelism, or constrain the LLM schedule; the mismatch is most pronounced at long contexts, where LLM context parallelism is needed for the fused multimodal sequence but encoder inputs remain bounded. We present heterogeneous parallelism for multimodal large language model training, an abstraction that lets modules in one end-to-end graph use independent layouts and rank placements, supporting colocated execution on shared GPUs and non-colocated execution on disjoint rank sets. The key challenge is preserving boundary tensor semantics across independent layouts: forward activations must be materialized for the destination layout, while backward gradients must be routed back to the source layout. We address this with boundary communicators that implement forward and backward layout transforms, plus scheduling extensions for both placement modes. We evaluate optimized homogeneous, colocated heterogeneous, and non-colocated heterogeneous configurations across multimodal workloads and GPU scales to characterize when added layout and placement freedom exposes a better operating point. Across this sweep, colocated heterogeneity improves TFLOPS/GPU by up to 49.3%, while non-colocated heterogeneity improves aggregate token throughput by up to 13.0% and TFLOPS/GPU by up to 9.6%. We validate loss convergence parity against homogeneous baselines and release the system as an open-source Megatron-LM extension. 2026-05-26T20:53:06Z Yashaswi Karnati Kamran Jafari Akash Mehra Li Ding Pranav Prashant Thombre Ali Roshan Ghias Shifang Xu Parth Mannan Yu Yao Hao Wu Eric Harper Ashwath Aithal Nima Tajbakhsh http://arxiv.org/abs/2605.27652v1 Carbon-Aware Mapping and Scheduling for Deadline-Constrained Workflows 2026-05-26T20:11:50Z As datacenters continue to grow in scale, their energy consumption and resulting carbon footprint have become pressing concerns. With the increasing share of renewable energy in a datacenter's mixed energy supply, shifting task execution to periods of high green-power availability is a promising strategy to reduce carbon emissions. However, in heterogeneous computing environments, the power consumption of compute nodes in a datacenter can also vary. In practice, workloads submitted to datacenters are often not isolated tasks, but entire workflows consisting of interdependent tasks with precedence constraints. A further challenge arises from the fact that carbon emission reductions must typically be achieved under strict workflow deadlines. In this work, we show that the problem posed by these challenges for the scheduler is NP-hard and admits no constant-factor approximation even for the uni-processor case. Motivated by this hardness, we present a novel algorithm CWM that combines carbon-aware mapping and scheduling to construct feasible solutions. Our approach integrates dynamic programming with efficient heuristics to exploit renewable energy availability and infrastructure heterogeneity. To assess the quality of the new algorithm, we evaluate it against the state-of-the-art approach CaWoSched and show that CWM achieves significant reductions in terms of carbon emissions in experiments. In particular, we are able to achieve a median carbon cost reduction of 42% over the best version of CaWoSched when the deadline is two times the makespan of a carbon-agnostic baseline. Note that CaWoSched itself already reduces the carbon-agnostic baseline by 36%. 2026-05-26T20:11:50Z 29 pages, 11 figures, Preprint, to appear at Euro-Par'26 Dominik Schweisgut Anne Benoit Yves Robert Henning Meyerhenke http://arxiv.org/abs/2605.23066v2 Orbax: Distributed Checkpointing with JAX 2026-05-26T19:23:57Z In a landscape of high-performance distributed ML systems, JAX has emerged as a framework of choice. However, JAX's modular design philosophy leaves it without a standardized checkpointing solution. In this paper, we introduce Orbax, a modular, JAX-native checkpointing library that abstracts the complexities of distributed accelerator systems while also providing flexibility for user-friendly checkpoint manipulations throughout the ML model lifecycle. We demonstrate performance exceeding comparable PyTorch competitors by up to 3.5$\times$ for saving and 2$\times$ for loading. The library is available at https://github.com/google/orbax. 2026-05-21T21:57:28Z 18 pages, 5 tables, 6 figures Colin Gaffney Shutong Li Daniel Ng Anastasia Petrushkina Niket Kumar Adam Cogdell Mridul Sahu Yaning Liang Nikhil Bansal Justin Pan Angel Mau Abhishek Agrawal Marco Berlot Ruoxin Sang Kiranbir Sodhia Rakesh Iyer http://arxiv.org/abs/2605.27601v1 A Methodology to Assess Power Modeling in Energy-Aware Federated Learning on Heterogeneous Mobile Devices 2026-05-26T19:19:31Z Estimating CPU power on heterogeneous ARM-based commodity devices is challenging due to limited access to CPU's voltage domains. As a result, state-of-the-art energy-aware Federated Learning (FL) frameworks typically rely on simplified approximate power models to estimate computation energy, rather than the more accurate analytical CMOS-based model. To bridge this gap, we propose a reproducible CPU power estimation methodology combined with a rail-to-cluster mapping technique to retrieve cluster-level supply voltage. We evaluate our approach on two commodity Android devices and show that the analytical model predicts CPU power with errors below 10%, whereas the approximate model incurs errors of up to 959%. Using AnycostFL, a state-of-the-art energy-aware FL framework, we show that the analytical model achieves the same 80% model accuracy while consuming 1.4x less energy than the approximate model. These results highlight that approximate models can severely misestimate computation energy and lead to suboptimal decisions. This work facilitates the use of analytical CPU power models on heterogeneous multi-cluster ARM-based mobile SoCs without additional hardware support or external power measurement tools. 2026-05-26T19:19:31Z 19 pages, 3 figures, 7 tables, Accepted for publication in the proceedings of Networked Systems (NETYS 2026), Springer Nature Networked Systems (NETYS 2026), Springer Nature Chaimae Jallouli Karim Boubouh Robert Basmadjian http://arxiv.org/abs/2605.27599v1 The Energy Blind Spot: NVIDIA's Flagship Edge AI Hardware Cannot Support Process-Level Energy Attribution 2026-05-26T19:15:21Z Agentic AI workloads - where a single user goal triggers multi-step orchestration, tool calls, retries, and failure recovery - are being targeted for edge deployment, with NVIDIA, Dell, HP, ASUS, MSI, Acer, and Gigabyte all shipping GB10-based desktop AI systems in 2026. We recently demonstrated that orchestration structure dominates agentic energy cost, with workflows consuming 4.33x more energy per successful goal than linear baselines and OOI reaching 7.63x for multi-step reasoning tasks. Separately, Rajat et al. show that CPU-side processing accounts for up to 90.6% of total latency and 44% of total dynamic energy in agentic workloads. We report a systematic energy-observability audit of the ASUS Ascent GX10 (GB10 SoC) and find that the platform exposes no CPU energy counter, no INA power-rail monitor, no IPMI/BMC, and no SCMI powercap protocol through any supported software interface. The only on-device energy telemetry is instantaneous GPU power via NVML. We further discover that the MediaTek firmware already computes per-rail energy internally via an undocumented ACPI interface (SPBM), but NVIDIA states there are "no plans to expose CPU rail information." On-device per-process energy attribution - as performed on x86 via RAPL - is therefore not reproducible on this platform through supported interfaces. We formalize a hardware requirements specification for energy-attributed AI, propose an interim calibration bridge using external DC metering combined with GPU subtraction, and identify a standards-track path via SCMI powercap. Our findings motivate the low-carbon computing community to demand energy observability as a first-class hardware requirement. 2026-05-26T19:15:21Z Deepak Panigrahy Aakash Tyagi http://arxiv.org/abs/2512.18444v2 Snowveil: A Framework for Decentralised Preference Discovery 2026-05-26T18:28:53Z Aggregating subjective preferences in social choice traditionally assumes a trusted central authority. In contrast, this paper formalises Decentralised Preference Discovery (DPD): the reliable identification of a social choice parameter (e.g. the canonical outcome of an aggregation rule applied to the global preference profile) under conditions of partial information, asynchronous interaction, censorship resistance, and no central coordinator. To address DPD, we propose Snowveil, a gossip-based framework where agents repeatedly sample random peer rankings and update local beliefs to converge on the canonical outcome. Using a potential function, submartingale theory, and concentration bounds, we prove the system reaches this stable state with tunable high probability, in finite expected time. This single-winner process can then be iterated to construct a set of winning candidates for multi-winner scenarios. Snowveil is agnostic to specific aggregation rules, requiring only that the rule satisfies axioms such as Positive Responsiveness, thus offering a formal basis for a wider class of DPD protocols. Demonstrating Snowveil's modularity, we introduce the Constrained Hybrid Borda (CHB), an aggregation rule designed to balance broad consensus with plurality support. We provide an axiomatic analysis of CHB and present empirical results via extensive simulation, validating Snowveil's O(n) scalability. Overall, this work provides a foundation for how a stable consensus emerges from subjective, expressive, and diverse preference profiles in large-scale decentralised systems. 2025-12-20T17:31:55Z Grammateia Kotsialou http://arxiv.org/abs/2605.27540v1 EFaaS: A Quantum-Classical Serverless Entangled Scheduler for Hybrid Variational Algorithms 2026-05-26T18:13:50Z As quantum computing enters the Utility Era, realizing near-term advantage relies heavily on Hybrid Variational Quantum Algorithms (VQAs). These algorithms require a tightly coupled, iterative loop between a classical CPU optimizer and a Quantum Processing Unit (QPU). However, current quantum cloud access models are bottlenecked by decoupled batch-queues that sever this loop, introducing massive Time-to-Next-Shot (TTNS) latency. This delay inflates convergence time from minutes to hours and exposes the computation to quantum hardware drift, degrading algorithmic fidelity. Unlike prior works that rely on resource-wasting static hardware reservations or state-oblivious stateless functions, we propose EFaaS, a novel serverless middleware designed specifically for hybrid quantum workflows. EFaaS fundamentally departs from existing architectures by treating classical parameter optimization and quantum circuit execution as entangled, session-aware events. Our main technical innovations are threefold: (1) a Calibration-Aware placement strategy that dynamically routes circuits to QPUs with warm calibration caches, circumventing cold-start penalties, (2) a Dual-Resource Fair Queuing scheduler that maximizes quantum utilization by strictly prioritizing active iterative loops, and (3) the "EF-QuantumFuture" programming abstraction, a novel primitive enabling classical speculative execution to mask compute latency. Across the evaluated baselines, EFaaS achieves TTNS reductions of 11.4%-94.3%, QDC gains of 2.02%-15.78% points, and convergence speedups of 83.2%-98.3%, while eliminating drift penalties. 2026-05-26T18:13:50Z 12 pages, 10 figures Abolfazl Younesi Nouhaila Innan Alberto Marchisio Muhammad Shafique