https://arxiv.org/api/fQ5wuQta7T/+QdzlqjelBB9ruzI2026-06-10T06:13:33Z2883813515http://arxiv.org/abs/2606.03498v1Demystifying Pipeline Parallelism: First Theory for PipeDream2026-06-02T11:14:57ZTraining modern machine learning models increasingly requires computation to be distributed across many accelerators. Data parallelism remains the default choice and is often paired with tensor-parallel sharding, but model parallelism becomes unavoidable once parameters, activations, or optimizer states no longer fit on a single device. This paper studies pipeline model parallelism through the lens of PipeDream (PD) (Harlap et al., 2018). Our first contribution is theoretical: we introduce Randomized PipeDream (RPD), a stale block-SGD abstraction that yields, to our knowledge, the first clean nonconvex convergence guarantee for a PD-style method. Our second contribution is a scaling diagnosis: we prove that the delay induced by steady-state PD grows as $S^2 - S/2 + O(1)$ for $S$ stages, so the stale-read contribution in the convergence theorem scales as $Θ(γ^2 S^4)$, equivalently as $Θ(S^4/K)$ in the tuned-rate form. Our third contribution is a comparison with LocalSGD, whose periodic model averaging trades weight staleness for synchronization bubbles. In our reported simulated-time experiments, the better-performing method depends on the objective: PD performs better on the quadratic objective and on a small language-modeling training-loss task, while for logistic regression LocalSGD becomes superior as the number of stages increases.2026-06-02T11:14:57Z40 pages, 4 figuresIvan IlinPeter Richtárikhttp://arxiv.org/abs/2606.06515v1DxPTA: An Architecture Design Space Exploration with Optical Dataflow-guided Strategy for HW/SW Co-Design of Photonic Transformer Accelerators2026-06-02T10:57:12ZTransformer-based networks have emerged as prominent AI models with state-of-the-art performance, which potentially pave the way toward artificial general intelligence (AGI). However, their large sizes still hinder their efficient implementation, thus highlighting the need for alternate solutions to enable their energy-efficient acceleration. Recently, state-of-the-art works propose photonic transformer accelerators (PTAs) with significant speedup and energy efficiency improvements over the conventional electronic accelerators. However, their PTA architectures are developed without considering the application constraints (e.g., area, power, energy, and latency). Moreover, their manual design approach also requires huge design time to determine a suitable architecture for the targeted application, hence making this approach not scalable. To address these limitations, we propose DxPTA, a novel design space exploration methodology for enabling efficient hardware/software co-design of the appropriate PTA architecture that meets all constraints. It is achieved by (1) identifying the PTA architecture parameters based on the coherent optical dataflow; (2) analyzing the impact/significance of the parameters; and (3) leveraging this analysis for devising a constraint-aware architecture search algorithm. Experimental results show that, our DxPTA can find the appropriate PTA architectures for different transformer-based models (i.e., DeiT-T/S/B and BERT-B/L). It achieves up to 26mm^2 area, 4.8W power, 39mJ energy, and 6ms latency, for constraints of 50mm^2 area, 5W power, 50mJ energy, and 10ms latency; with 15.2x faster searching time than the exhaustive approach. These results demonstrate the potential of DxPTA methodology for enabling efficient PTA designs for diverse AGI-based applications.2026-06-02T10:57:12Z8 pages, 12 figuresRachmad Vidya Wicaksana PutraSolomon Micheal SerunjogiMahmoud RasrasMuhammad Shafiquehttp://arxiv.org/abs/2606.03464v1Predicting Lakehouse Performance in Clouds: An Empirical Exploration of Query Runtime Variance2026-06-02T10:45:14ZData analytics increasingly runs on distributed lakehouse systems, where platform operators must optimise monetary, resource, and environmental costs. Query Performance Prediction (QPP) helps to balance these costs and supports workload management techniques, such as adaptive resource scaling and low-carbon scheduling. However, runtimes in lakehouses can vary substantially, and the impact of runtime variance on QPP accuracy and workload orchestration has not previously been systematically studied for lakehouse systems.
This paper addresses this gap by investigating the runtime variance observed for distributed lakehouse analytical queries and its impact on QPP. First, we quantify the run-to-run variance using Kubernetes deployments across three public clouds and one private cloud, spanning multiple database scales and three analytical benchmarks. Our results demonstrate that repeated executions of the same query can vary in runtime by nearly twofold. Second, we conduct a factor analysis study assessing key sources of this runtime variance such as data locality, co-tenant load, and caching effects. Third, we examine how variance influences state-of-the-art QPP models, revealing that addressing key sources of variance can reduce prediction error up to 80%. Finally, we demonstrate the downstream implications for low-carbon scheduling as an example of a workload management technique that relies on performance prediction, showing that accounting for runtime variance can lead to a significant reduction in carbon costs.2026-06-02T10:45:14Z11 pages, 5 figures, to appear in the Proceedings of the 19th IEEE International Conference on Cloud Computing (CLOUD)James NurdinWei LiuRichard MccreadieLauritz Thamsenhttp://arxiv.org/abs/2509.26193v2I Like To Move It -- Computation Instead of Data in the Brain2026-06-02T09:15:10ZThe detailed functioning of the human brain remains incompletely understood. Large-scale brain simulations complement experimental research but face substantial computational challenges: the human brain comprises approximately $10^{11}$ neurons connected by $10^{14}$ synapses, collectively forming the connectome. Empirical evidence indicates that modifications of the connectome -- specifically the formation and elimination of synapses, referred to as structural plasticity -- are essential for processes such as learning and memory formation. Connectivity updates can be computed efficiently using a Barnes--Hut-inspired approximation that reduces computational complexity from $O(n^2)$ to $O(n \log n)$, where $n$ denotes the number of neurons. Despite this improvement, communication overhead still limits scalability. Synapse updates rely heavily on remote memory access (RMA), and spike transmission requires all-to-all communication at every simulation time step. We introduce a novel algorithm that reduces communication by migrating computation rather than data. This approach reduces connectivity update time by a factor of 6 and spike transmission time by more than 2 orders of magnitude.2025-09-30T12:50:12ZFabian CzappaMarvin KasterFelix Wolf10.1109/IPDPS65963.2026.00028http://arxiv.org/abs/2606.03364v1BlobShuffle: Cost-Effective Repartitioning in Stream Processing Systems via Object Storage Exemplified with Kafka Streams2026-06-02T09:13:48ZShuffling or repartitioning data streams is an essential operation of state-of-the-art stream processing frameworks to support stateful workloads in a large-scale, distributed setting. In today's cloud deployments, however, shuffling can become a major cost driver due to substantial network traffic across multiple availability zones (AZs) as well as an operational burden when operating a high-throughput, strongly consistent messaging backbone at scale. We present BlobShuffle, a novel approach to cost-effective shuffling for stream processing systems that leverages cloud object storage as an intermediate exchange layer. Instead of sending all shuffled records directly, BlobShuffle groups records into batches, stores these batches in cloud object storage, and forwards only compact notifications. Downstream operators use these notifications to retrieve the relevant batches and extract the corresponding records. BlobShuffle balances cost efficiency and latency through configurable batching and a distributed caching mechanism. BlobShuffle is implemented as an add-on for Kafka Streams that requires only minimal code changes to existing applications, leaves Kafka and the underlying infrastructure unmodified, and preserves Kafka Streams' consistency and correctness guarantees. In a large-scale experimental evaluation on a Kubernetes-based AWS deployment, we show that BlobShuffle can reduce shuffling costs by more than 40x compared to native Kafka Streams shuffling while keeping the 95th percentile shuffle latency below 2 seconds. Moreover, it scales to processing more than 2 GiB/s without encountering a scalability limit in our experiments, indicating that BlobShuffle can economically support shuffle-intensive workloads at large scale.2026-06-02T09:13:48ZSören HenningOtmar ErtlAdriano Vogelhttp://arxiv.org/abs/2605.09114v2Light Cone Consistency: Closure, Ordering, and the Single-Observer Boundary2026-06-02T08:30:14ZEvery distributed system is a message-passing system, and every message-passing system is a growing causal DAG observed by a set of observers. We treat each observer's consistency as two operators on its visible sub-DAG (a causal-closure filter $C$, fixing which dependencies it must have seen, and a fork resolution $O$, ordering the concurrent forks the filter admits) and give the resulting space the structure the flat catalog of named models lacks. The operators are coupled, asymmetrically: an order that refines causality supplies closure its filter never demanded. That coupling yields a decidable readability order (which configuration's data another can read honestly) with a factoring dichotomy: the order splits across the $C$ and $O$ axes exactly when ordering does not refine causality, and refuses to when it does, the cross-axis gap being the closure ordering supplies. On that order sit a consistency ratchet (a level lost under migration is never regained) and a Detection = Prevention bound: a system can tell its order inverted causality only if it retained exactly what would have prevented the inversion.
The classical results land at clean coordinates in the same system, not as new claims: resolving a fork demands retaining the causal history that distinguishes its branches (database folklore, here an impossibility for every message-passing system) and linearizability resolves as a composite of two systems, a store and a global real-time serializer supplying an order no single observer's light cone can. The named models are configurations of $(C, O)$, exact over the standard-safety fragment and generative past it, predicting configurations the catalog has not named. LCC is a formalization of the observer-relative consistency model of Burgess and Gerlits.2026-05-09T18:54:26Z32 pages, 4 figures, 3 tables. Preprint of work submitted to DISC 2026Rob LandersKaben Kramerhttp://arxiv.org/abs/2512.11775v2Hypergraph based Multi-Party Payment Channel2026-06-02T08:22:38ZPublic blockchains inherently offer low throughput and high latency, motivating off-chain scalability solutions such as Payment Channel Networks (PCNs). However, existing PCNs suffer from liquidity fragmentation-funds locked in one channel cannot be reused elsewhere-and channel depletion, both of which limit routing efficiency and reduce transaction success rates. Multi-party channel (MPC) constructions mitigate these issues, but they typically rely on leaders or coordinators, creating single points of failure and providing only limited flexibility for inter-channel payments.
We introduce Hypergraph-based Multi-Party Payment Channels (COALESCE), a new off-chain construction that replaces bilateral channels with collectively funded hyperedges. These hyperedges enable fully concurrent, leaderless intra- and inter-hyperedge payments through verifiable, proposer-ordered DAG updates, offering significantly greater flexibility and concurrency than prior designs. Hence our, design eliminates routing dependencies, avoids directional liquidity lock-up, and does not require central monitoring services such as watchtowers.
Our implementation on a 150-node intra-hyperedge achieves a transaction success rate of approximately 94% under heavy load (larger payment sizes), while full hyperedge evaluation over a 15,000-node network sustains success rates in the range of 85% to 95%, without HTLC expiry or routing failures, highlighting the robustness of COALESCE.2025-12-12T18:37:28ZAyush NainwalAtharva KambleNitin Awatharehttp://arxiv.org/abs/2310.16370v2PartRePer-MPI: Combining Fault Tolerance and Performance for MPI Applications2026-06-02T04:55:59ZAs we have entered Exascale computing, the faults in high-performance systems are expected to increase considerably. To compensate for a higher failure rate, the standard checkpoint/restart technique would need to create checkpoints at a much higher frequency resulting in an excessive amount of overhead which would not be sustainable for many scientific applications. Replication allows for fast recovery from failures by simply dropping the failed processes and using their replicas to continue the regular operation of the application.
In this paper, we have implemented PartRePer-MPI, a novel fault-tolerant MPI library that adopts partial replication of some of the launched MPI processes in order to provide resilience from failures. The novelty of our work is that it combines both fault tolerance, due to the use of the User Level Failure Mitigation (ULFM) framework in the Open MPI library, and high performance, due to the use of communication protocols in the native MPI library that is generally fine-tuned for specific HPC platforms. We have implemented efficient and parallel communication strategies with computational and replica processes, and our library can seamlessly provide fault tolerance support to an existing MPI application. Our experiments using seven NAS Parallel Benchmarks and two scientific applications show that the failure-free overheads in PartRePer-MPI when compared to the baseline MVAPICH2, are only up to 6.4% for the NAS parallel benchmarks and up to 9.7% for the scientific applications.2023-10-25T05:18:48ZThis paper describes a prototype with many flaws such as the virtual address differences across processes which have been addressed in our newer implementation (arXiv:2504.09989). There are significant fundamental differences in these implementations which makes the vast majority of this paper redundant in valueSarthak JoshiSathish Vadhiyarhttp://arxiv.org/abs/2606.03077v1Libra: Efficient Resource Management for Agentic RL Post-Training2026-06-02T03:09:13ZReinforcement learning (RL) has become a standard post-training paradigm for large language models (LLMs), extending beyond preference alignment to complex reasoning and multi-turn agentic behaviors. In agentic RL, the rollout stage generates trajectories while invoking tools, producing long-tailed and non-stationary workloads that challenge conventional resource-management assumptions. Three fundamental challenges arise. First, due to the long-tail distribution, a small fraction of trajectories dominates rollout makespan. Second, rollout and training exhibit strong asymmetry in compute patterns, memory demands, and sensitivity to sequence length. Third, as the RL policy evolves, the trajectory-length distribution drifts over time, rendering any static resource split progressively suboptimal.
We present Libra, which introduces two core mechanisms. The first is a periodic global resource planner that jointly optimizes GPU allocation across rollout and training clusters. It leverages an elastic hybrid pool to enable lightweight, non-blocking worker reallocation between stages. The second is a causality-driven multi-level feedback queue (C-MLFQ) scheduler, which routes requests to heterogeneous rollout buckets based on causal signals derived from tool-return outcomes, rather than relying on fragile length predictions. Evaluated on 48 A800 GPUs, Libra achieves up to 3.0$\times$ higher throughput and converges up to 2.5$\times$ faster in reward compared to the baselines.2026-06-02T03:09:13Z18 pages, 13 figuresKaiwen ChenXin TanJingzong LiHong Xuhttp://arxiv.org/abs/2606.03061v1Brief Announcement: Generative Markov Model for Distributed Computing Systems2026-06-02T02:50:58ZEmerging distributed computing paradigms, such as the computing continuum, are inherently heterogeneous, stochastic, and complex. Efficiently and effectively utilizing all available resources across the continuum demands a unified formal model of the system. To address this gap, we propose a general framework for modeling distributed computing systems as a generative Markov model, factorized over a structured system state. In our model, the state decomposes into high-dimensional variables, each further factorized over its elements, reflecting the sparse dependency structure inherent to distributed systems. This yields a tractable model enabling simulation, inference, and policy learning over otherwise intractable system states, bridging distributed computing with Markov chain theory and reinforcement learning (RL). We demonstrate our framework through a case study of collaborative AI inference, in which a dedicated server combines resources with those volunteered by service users. Our results show that centralized scheduling becomes a bottleneck at scale, while distributing computation across user devices reduces both latency and server resource consumption. These findings highlight the value of adaptive decision-making in distributed computing systems and demonstrate the framework's utility for modeling, simulation, and optimization.2026-06-02T02:50:58ZSubmitted to 40th International Symposium on Distributed Computing (DISC 2026)Alfreds LapkovskisAli BeikmohammadiSindri MagnússonPraveen Kumar Dontahttp://arxiv.org/abs/2606.09870v1Safecloud: A Distributed, Encrypted Storage Cloud for Streaming2026-06-02T02:12:42ZWe present Safecloud, a distributed, encrypted, self-pricing storage and streaming network whose storage and routing nodes never see plaintext and never hold keys. Each file is split into chunks, encrypted on the owner's device, and distributed across Drops (browser tabs storing ciphertext in IndexedDB) and Jets (federated routing servers). Only the owner, or an authorised grantee, can decrypt. We make five contributions: (1) A one-root key hierarchy: every key derives deterministically from a single root via HKDF, and owner and range-scoped grantee derive identical chunk keys (derivation agreement); a subtree key derives its range and nothing else (delegation containment). (2) Convergent content addressing: identical content yields identical ciphertext and identifiers, enabling deduplication without plaintext exposure, with identifiers binding authenticated ciphertext so a keyless Drop verifies integrity (blind verifiability). (3) Three parallel trees over one navigation path (Merkle for integrity, key-derivation for confidentiality, access for authorisation), with sound Merkle-verified retrieval. (4) The key tree doubles as a streaming index: a player derives each segment key in O(1), seeking by derivation, while parallel tracks (video, audio, captions) are independent subtrees unlockable per-track and per-segment, a combination we believe no prior encrypted-storage network offers. (5) Jets and Drops earn Safebux verifiably, kept honest by a one-signature proof-of-storage challenge under chilling-effect Proof-of-Corruption, a zero-sum economy that is significantly cheaper than Filecoin's proof-of-replication sealing (which is slow and provides no confidentiality). We give the architecture, cryptographic construction, a threat model, and an open-source reference implementation, stating precisely what is implemented versus designed.2026-06-02T02:12:42Z7 pages, 2 tables. Reference implementation open-source. Companion to Intercloud (arXiv:2605.22830) and a forthcoming Safecloud 2.0 compute paperGregory Magarshakhttp://arxiv.org/abs/2606.03001v1FOLD: Fuzzy Online Deduplication for Very Large Evolving Datasets via Approximate Nearest Neighbor Search2026-06-02T01:16:26ZFuzzy deduplication is key to constructing large language model training corpora. However, classic Locality-Sensitive Hashing pipelines scale poorly as corpora grow and are ill-suited to continuous ingestion. We present FOLD (Fuzzy Online Deduplication), an online fuzzy deduplication system that delivers high recall and throughput for evolving datasets. FOLD maintains an incrementally updated HNSW index over admitted documents, retrieving a small, high-quality candidate neighborhood for each incoming document instead of repeatedly rebuilding global buckets or rescanning the accumulated corpus. To our knowledge, FOLD is the first online fuzzy deduplication system to use HNSW. However, applying Jaccard similarity out of the box causes score crowding, making graph traversal unreliable within a small number of steps. FOLD addresses this with a bitmap representation that provides a more discriminative, Jaccard-aligned signal during HNSW search. Across four LLM-scale datasets (LM1B, C4, RealNews, and Common Crawl), FOLD stays fast and accurate as the corpus grows: at the largest evaluated scales, it maintains 93-97% recall and achieves up to 2.09x higher throughput than competing alternatives, whose best recall reaches only 76%.2026-06-02T01:16:26ZNelson BorePritish MishraConstantin AdamEyal de LaraOana Balmauhttp://arxiv.org/abs/2506.01969v3FlashMLA-ETAP: Efficient Transpose Attention Pipeline for Accelerating MLA Inference on NVIDIA H20 GPUs2026-06-02T00:53:06ZEfficient inference of Multi-Head Latent Attention (MLA) is challenged by deploying the DeepSeek-R1 671B model on a single Multi-GPU server. This paper introduces FlashMLA-ETAP, a novel framework that enhances MLA inference for the single-instance deployment scenario on NVIDIA H20 GPUs. We propose the Efficient Transpose Attention Pipeline (ETAP), which reconfigures attention computation through transposition to align the KV context length with the \(M\)-dimension in WGMMA operations, significantly reducing redundant computations. FlashMLA-ETAP achieves a 2.78x speedup over FlashMLA at 64K sequence length (batch size 16), with 5.24x and 4.94x improvements over FlashAttention-3 and FlashInfer, respectively, while maintaining numerical stability with a 15.2x lower RMSE (\(1.25 \times 10^{-5}\)) than FlashAttention-3. Furthermore, ETAP's design enables seamless integration into frameworks like FlashAttention-3 and FlashInfer, supported by a detailed theoretical analysis. Our work addresses a critical gap in resource-constrained inference, offering a scalable solution for mid-tier GPUs and paving the way for broader adoption in hardware-aware optimization. Code is available at https://github.com/pengcuo/FlashMLA-ETAP.2025-05-13T17:45:34ZAccepted by ICONIP2025Pengcuo DegeQiuming LuoRui MaoChang Konghttp://arxiv.org/abs/2606.02982v1DriftSched: Adaptive QoS-Aware Scheduling under Runtime Token Drift for Multi-Tenant GPU Inference2026-06-02T00:39:31ZThe rapid growth of large language model (LLM) inference services has increased the demand for efficient multi-tenant GPU scheduling. While modern inference runtimes such as vLLM improve throughput through continuous batching and optimized memory management, accurately estimating the runtime cost of heterogeneous inference requests remains a significant challenge. In practice, observed output lengths often deviate from admission-time estimates, creating runtime token drift that can lead to workload misclassification, queue imbalance, increased tail latency, and degraded Quality-of-Service (QoS).
This paper presents DriftSched, an adaptive QoS-aware scheduling framework for multi-tenant LLM inference serving on NVIDIA L4 GPUs. DriftSched combines workload classification, token-budget estimation, tenant-aware queue management, and runtime feedback-driven drift compensation to improve admission-time scheduling decisions. The framework evaluates FIFO, Priority, Weighted, Shortest-Job-First (SJF), and Aging Priority scheduling policies under heterogeneous multi-tenant workloads.
Experimental results demonstrate measurable runtime token drift across workload categories. Adaptive bias correction reduces workload estimation error by an average of 38.8% (MAE) and 40.5% (RMSE), improving workload classification stability and scheduling accuracy. Among all evaluated schedulers, SJF achieves the best overall performance, reducing median end-to-end latency by approximately 42% and P99 latency by approximately 16% relative to FIFO under sustained GPU contention.
The work contributes an adaptive drift-aware scheduling architecture, a runtime token-drift compensation mechanism, and a reproducible benchmarking framework for evaluating QoS-aware LLM inference scheduling on shared GPU infrastructure.2026-06-02T00:39:31Z17 pages, 22 figures, 7 tablesKathiravan Palaniappanhttp://arxiv.org/abs/2511.19959v2ParaBlock: Communication-Computation Parallel Block Coordinate Federated Learning for Large Language Models2026-06-02T00:32:45ZFederated learning (FL) has been extensively studied as a privacy-preserving training paradigm. Recently, federated block coordinate descent scheme has become a popular option in training large-scale models, as it allows clients to train only a subset of the model locally instead of the entire model. However, in the era of large language models (LLMs), even a single block can contain a significant number of parameters, posing substantial communication latency, particularly for resource-constrained clients. To address this challenge in federated training/fine-tuning LLMs, we propose ParaBlock, a novel approach that establishes two parallel threads for communication and computation to enhance communication efficiency. We theoretically prove that the proposed ParaBlock achieves the same convergence rate as the standard federated block coordinate descent methods. Empirical evaluations on fine-tuning LLMs on general instruction following and mathematical reasoning confirm that ParaBlock not only maintains strong performance but also significantly improves communication efficiency.2025-11-25T06:09:21ZAccepted by TMLRYujia WangYuanpu CaoJinghui Chen