https://arxiv.org/api/xHme5oeD0A0WrdSSOeoo74Q7LkU 2026-04-12T17:09:27Z 27953 570 15 http://arxiv.org/abs/2603.02376v1 CUCo: An Agentic Framework for Compute and Communication Co-design 2026-03-02T20:35:50Z

Custom CUDA kernel development is essential for maximizing GPU utilization in large-scale distributed LLM training and inference, yet manually writing kernels that jointly leverage both computation and communication remains a labor-intensive and error-prone process. Prior work on kernel optimization has focused almost exclusively on computation, leaving communication kernels largely untouched even though they constitute a significant share of total execution time. We introduce CUCo, a training-free agent-driven workflow that automatically generates high-performance CUDA kernels that jointly orchestrate computation and communication. By co-optimizing these traditionally disjoint components, CUCo unlocks new optimization opportunities unavailable to existing approaches, outperforming state-of-the-art baselines and reducing end-to-end latency by up to $1.57\times$.

2026-03-02T20:35:50Z Bodun Hu Yoga Sri Varshan Saurabh Agarwal Aditya Akella http://arxiv.org/abs/2602.22593v2 FLYING SERVING: On-the-Fly Parallelism Switching for Large Language Model Serving 2026-03-02T17:26:15Z

Production LLM serving must simultaneously deliver high throughput, low latency, and sufficient context capacity under non-stationary traffic and mixed request requirements. Data parallelism (DP) maximizes throughput by running independent replicas, while tensor parallelism (TP) reduces per-request latency and pools memory for long-context inference. However, existing serving stacks typically commit to a static parallelism configuration at deployment; adapting to bursts, priorities, or long-context requests is often disruptive and slow. We present Flying Serving, a vLLM-based system that enables online DP-TP switching without restarting engine workers. Flying Serving makes reconfiguration practical by virtualizing the state that would otherwise force data movement: (i) a zero-copy Model Weights Manager that exposes TP shard views on demand, (ii) a KV Cache Adaptor that preserves request KV state across DP/TP layouts, (iii) an eagerly initialized Communicator Pool to amortize collective setup, and (iv) a deadlock-free scheduler that coordinates safe transitions under execution skew. Across three popular LLMs and realistic serving scenarios, Flying Serving improves performance by up to $4.79\times$ under high load and $3.47\times$ under low load while supporting latency- and memory-driven requests.

2026-02-26T03:55:51Z This paper is accepted by the 40th ACM International Conference on Supercomputing (ICS 2026) Shouwei Gao Junqi Yin Feiyi Wang Wenqian Dong http://arxiv.org/abs/2305.04979v2 FedHB: Hierarchical Bayesian Federated Learning 2026-03-02T17:12:02Z

We propose a novel hierarchical Bayesian approach to Federated Learning (FL), where our model reasonably describes the generative process of clients' local data via hierarchical Bayesian modeling: constituting random variables of local models for clients that are governed by a higher-level global variate. Interestingly, the variational inference in our Bayesian model leads to an optimisation problem whose block-coordinate descent solution becomes a distributed algorithm that is separable over clients and allows them not to reveal their own private data at all, thus fully compatible with FL. We also highlight that our block-coordinate algorithm has particular forms that subsume the well-known FL algorithms including Fed-Avg and Fed-Prox as special cases. Beyond introducing novel modeling and derivations, we also offer convergence analysis showing that our block-coordinate FL algorithm converges to an (local) optimum of the objective at the rate of $O(1/\sqrt{t})$, the same rate as regular (centralised) SGD, as well as the generalisation error analysis where we prove that the test error of our model on unseen data is guaranteed to vanish as we increase the training data size, thus asymptotically optimal.

2023-05-08T18:21:41Z Minyoung Kim Timothy Hospedales http://arxiv.org/abs/2603.02075v1 Trident: Adaptive Scheduling for Heterogeneous Multimodal Data Pipelines 2026-03-02T17:00:22Z

The rapid adoption of large language models and multimodal foundation models has made multimodal data preparation pipelines critical AI infrastructure. These pipelines interleave CPU-heavy preprocessing with accelerator-backed (GPU/NPU/TPU) inference and produce massive intermediate artifacts. Achieving high throughput is difficult because workloads are highly non-stationary: regime shifts, input-dependent inference, and transient memory spikes cause rapid performance fluctuations and out-of-memory (OOM) failures. Existing schedulers typically rely on threshold-based autoscaling or assume synchronous, homogeneous operators, leading to poor efficiency. We present Trident, an adaptive scheduling framework for heterogeneous multimodal pipelines on fixed-resource clusters. Trident closes the loop across three coupled layers: (i) an observation layer that estimates per-operator sustainable throughput for asynchronous operators via Gaussian Process regression with anomaly filtering; (ii) an adaptation layer that detects workload shifts online and performs memory-constrained Bayesian optimization to recommend OOM-safe configurations; and (iii) a scheduling layer that solves a mixed-integer linear program to jointly optimize operator parallelism, placement, and configuration transitions under heterogeneous compute and bandwidth constraints, accounting for cold-start overhead via rolling updates. Decisions trigger sample invalidation and model refresh to keep estimates consistent with the active configuration. Implemented on Ray Data, Trident improves end-to-end throughput by up to 2.01x on a document curation (PDF) pipeline and 1.88x on a video curation pipeline over a static baseline, with low overhead suitable for online re-optimization.

2026-03-02T17:00:22Z 22 pages, 3 figures Ding Pan Zhuangzhuang Zhou Long Qian Binhang Yuan http://arxiv.org/abs/2603.02071v1 Subcubic Coin Tossing in Asynchrony without Setup 2026-03-02T16:58:44Z

We consider an asynchronous network of $n$ parties connected to each other via secure channels, up to $t$ of which are byzantine. We study common coin tossing, a task where the parties try to agree on an unpredictable random value, with some chance of failure due to the byzantine parties' influence. Coin tossing is a well known and often studied task due to its use in byzantine agreement. In this work, we present an adaptively secure committee-based method to roughly speaking turn strong but costly common coins into cheaper but lower-quality ones. For all $k > 2$ and $\varepsilon > 0$, we show how to use a strong (very rarely failing) coin that costs $\widetilde{O}(n^k)$ bits of communication to get a cheaper coin that costs $\widetilde{O}(\varepsilon^{-2k}n^{3 - 2/k})$ bits of communication. This latter coin tolerates $\varepsilon n$ fewer byzantine parties than the former, and it fails with an arbitrarily small constant probability. For any $\varepsilon > 0$, our method allows us to get a perfectly secure binary coin that tolerates $t \leq (\frac{1}{4} - \varepsilon)n$ faults with $O(n^{2.5}(\varepsilon^{-8} + \log n))$ messages of size $O(\log n)$, as well as a setup-free cryptographically secure binary coin that tolerates $t \leq (\frac{1}{3} - \varepsilon)n$ faults with $O(n^{7/3}\varepsilon^{-6}κ\log n)$ bits of communication (where $κ= Ω(\log n)$ is a cryptographic security paramater). These coins both have $O(\log n)$ latency. They are to our knowledge the first setup-free coins that cost $o(n^3)$ bits of communication but still succeed with at least constant probability against $t = Θ(n)$ adaptive byzantine faults. As such, they for the first time enable setup-free (and even perfectly secure) asynchronous byzantine agreement with $o(n^3)$ communication against $Θ(n)$ adaptive byzantine faults.

2026-03-02T16:58:44Z 17 pages, preprint Mose Mizrahi Roger Wattenhofer http://arxiv.org/abs/2603.02057v1 Beyond Microservices: Testing Web-Scale RCA Methods on GPU-Driven LLM Workloads 2026-03-02T16:47:28Z

Large language model (LLM) services have become an integral part of search, assistance, and decision-making applications. However, unlike traditional web or microservices, the hardware and software stack enabling LLM inference deployment is of higher complexity and far less field-tested, making it more susceptible to failures that are difficult to resolve. Keeping outage costs and quality of service degradations in check depends on shortening mean time to repair, which in practice is gated by how quickly the fault is identified, located, and diagnosed. Automated root cause analysis (RCA) accelerates failure localization by identifying the system component that failed and tracing how the failure propagated. Numerous RCA methods have been developed for traditional services, using request path tracing, resource metric and log data analysis. Yet, existing RCA methods have not been designed for LLM deployments that present distinct runtime characteristics. In this study, we evaluate the effectiveness of RCA methods on a best-practice LLM inference deployment under controlled failure injections. Across 24 methods (20 metric-based, two trace-based, and two multi-source), we find that multi-source approaches achieve the highest accuracy, metric-based methods show fault-type-dependent performance, and trace-based methods largely fail. These results reveal that existing RCA tools do not generalize to LLM systems, motivating tailored analysis techniques and enhanced observability, for which we formulate guidelines.

2026-03-02T16:47:28Z 13 pages, 8 figures, 1 table Dominik Scheinert Alexander Acker Thorsten Wittkopp Soeren Becker Hamza Yous Karnakar Reddy Ibrahim Farhat Hakim Hacid Odej Kao http://arxiv.org/abs/2603.02621v1 GoldbachGPU: An Open Source GPU-Accelerated Framework for Verification of Goldbach's Conjecture 2026-03-02T15:51:57Z

We present GoldbachGPU, an open-source framework for large-scale computational verification of Goldbach's conjecture using commodity GPU hardware. Prior GPU-based approaches reported a hard memory ceiling near 10^11 due to monolithic prime-table allocation. We show that this limitation is architectural rather than fundamental: a dense bit-packed prime representation provides a 16x reduction in memory footprint, and a segmented double-sieve design removes the VRAM ceiling entirely. By inverting the verification loop and combining a GPU fast-path with a multi-phase primality oracle, the framework achieves exhaustive verification up to 10^12 on a single NVIDIA RTX 3070 (8 GB VRAM), with no counterexamples found. Each segment requires 14 MB of VRAM, yielding O(N) wall-clock time and O(1) memory in N. A rigorous CPU fallback guarantees mathematical completeness, though it was never invoked in practice. An arbitrary-precision checker using GMP and OpenMP extends single-number verification to 10^10000 via a synchronised batch-search strategy. The segmented architecture also exhibits clean multi-GPU scaling on data-centre hardware (tested on 8 x H100). All code is open-source, documented, and reproducible on both commodity and high-end hardware.

2026-03-02T15:51:57Z 11 pages, 7 tables, 2 figures. Accompanies the v1.1.0 release of GoldbachGPU (Zenodo DOI: https://zenodo.org/records/18837081) Isaac Llorente-Saguer http://arxiv.org/abs/2504.14960v3 MoE Parallel Folding: Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training with Megatron Core 2026-03-02T15:01:07Z

Mixture of Experts (MoE) models enhance neural network scalability by dynamically selecting relevant experts per input token, enabling larger model sizes while maintaining manageable computation costs. However, efficient training of large-scale MoE models across thousands of GPUs presents significant challenges due to limitations in existing parallelism strategies. We introduce an end-to-end training framework for large-scale MoE models that utilizes five-dimensional hybrid parallelism: Tensor Parallelism, Expert Parallelism, Context Parallelism, Data Parallelism, and Pipeline Parallelism. Central to our approach is MoE Parallel Folding, a novel strategy that decouples the parallelization of attention and MoE layers in Transformer models, allowing each layer type to adopt optimal parallel configurations. Additionally, we develop a flexible token-level dispatcher that supports both token-dropping and token-dropless MoE training across all five dimensions of parallelism. This dispatcher accommodates dynamic tensor shapes and coordinates different parallelism schemes for Attention and MoE layers, facilitating complex parallelism implementations. Our experiments demonstrate significant improvements in training efficiency and scalability. We achieve up to 49.3% Model Flops Utilization (MFU) for the Mixtral 8x22B model and 39.0% MFU for the Qwen2-57B-A14B model on H100 GPUs, outperforming existing methods. The framework scales efficiently up to 1,024 GPUs and maintains high performance with sequence lengths up to 128K tokens, validating its effectiveness for large-scale MoE model training. The code is available in Megatron-Core.

2025-04-21T08:39:47Z Dennis Liu Zijie Yan Xin Yao Tong Liu Vijay Korthikanti Evan Wu Shiqing Fan Gao Deng Hongxiao Bai Jianbin Chang Ashwath Aithal Michael Andersch Mohammad Shoeybi Jiajie Yao Chandler Zhou David Wu Xipeng Li June Yang http://arxiv.org/abs/2404.06230v2 Aggressive or Imperceptible, or Both: Network Pruning Assisted Hybrid Byzantines in Federated Learning 2026-03-02T13:55:40Z

In federated learning (FL), profiling and verifying each client is inherently difficult, which introduces a significant security vulnerability: malicious clients, commonly referred to as Byzantines, can degrade the accuracy of the global model by submitting poisoned updates during training. To mitigate this, the aggregation process at the parameter server must be robust against such adversarial behaviour. Most existing defences approach the Byzantine problem from an outlier detection perspective, treating malicious updates as statistical anomalies and ignoring the internal structure of the trained neural network (NN). Motivated by this, this work highlights the potential of leveraging side information tied to the NN architecture to design stronger, more targeted attacks. In particular, inspired by insights from sparse NNs, we introduce a hybrid sparse Byzantine attack. The attack consists of two coordinated components: (i) A sparse attack component that selectively manipulates parameters with higher sensitivity in the NN, aiming to cause maximum disruption with minimal visibility; (ii) A slow-accumulating attack component that silently poisons parameters over multiple rounds to evade detection. Together, these components create a strong but imperceptible attack strategy that can bypass common defences. We evaluate the proposed attack through extensive simulations and demonstrate its effectiveness against eight state-of-the-art defence mechanisms.

2024-04-09T11:42:32Z Emre Ozfatura Kerem Ozfatura Baturalp Buyukates Mert Coskuner Alptekin Kupcu Deniz Gunduz http://arxiv.org/abs/2603.08744v1 Extension of ACETONE C code generator for multi-core architectures 2026-03-02T13:53:59Z

As the industry's interest in machine learning has grown in recent years, some solutions have emerged to safely embed them in safety-critical systems, such as the C code generator ACETONE. However, this framework is limited to generating sequential code, which cannot make most of the multi-core architectures. In this paper, we initiate an extension of ACETONE for the generation of parallel code by formally defining our processor assignment problem and surveying the state of the art on existing solutions. In the final paper, we will introduce the completed extension, including the implementation of the scheduling heuristic, the creation of templates implementing synchronization mechanisms, and an evaluation of the worst-case execution time of the framework's layers.

2026-03-02T13:53:59Z 13th European Congress of Embedded Real Time Systems (ERTS), Feb 2026, Toulouse, France Yanis Aït-Aïssa IRIT-TRACES Thomas Carle IRIT-TRACES Sergei Chichin Benjamin Lesage Claire Pagetti http://arxiv.org/abs/2602.15529v2 Tight Communication Bounds for Distributed Algorithms in the Quantum Routing Model 2026-03-02T12:18:37Z

We present new distributed quantum algorithms for fundamental distributed computing problems, namely, leader election, broadcast, Minimum Spanning Tree (MST), and Breadth-First Search (BFS) tree, in arbitrary networks. These algorithms are (essentially) optimal with respect to their communication (message) complexity in the {\em quantum routing model} introduced in [PODC 2025]. The message complexity of our algorithms is $\tilde{O}(n)$ for leader election, broadcast, and MST, and $\tilde{O}(\sqrt{mn})$ for BFS ($n$ and $m$ are the number of nodes and edges of the network, respectively). These message bounds are nearly tight in the quantum routing model since we show almost matching corresponding quantum message lower bounds. Our results significantly improve on the prior work of [PODC 2025], who presented distributed quantum algorithms under the same model that had a message complexity of $\tilde{O}(\sqrt{mn})$ for leader election. Our algorithms demonstrate the significant communication advantage that quantum routing has over classical in distributed computing, since $Ω(m)$ is a well-established classical message lower bound for leader election, broadcast, MST, and BFS that applies even to randomized Monte-Carlo algorithms [JACM 2015]. Thus, our quantum algorithms can, in general, give a quadratic advantage in the communication cost for these fundamental problems. A main technical tool we use to design our distributed algorithms is quantum walks based on electric networks. We posit a framework for using quantum walks in the distributed setting to design communication-efficient distributed quantum algorithms. Our framework can be used as a black box to significantly reduce communication costs and may be of independent interest. Additionally, our lower-bound technique for establishing distributed quantum message lower bounds can also be applied to other problems.

2026-02-17T12:10:12Z Minor modifications compared to v1, intended to provide additional detail for the lower bounds. To prove the Query Complexity to Message Complexity Reduction Lemma (Lemma 7.3), we emphasize that we consider a slight variation of the adjacency array model with stronger queries Fabien Dufoulon Frédéric Magniez Gopal Pandurangan http://arxiv.org/abs/2603.01739v1 CA-AFP: Cluster-Aware Adaptive Federated Pruning 2026-03-02T11:04:25Z

Federated Learning (FL) faces major challenges in real-world deployments due to statistical heterogeneity across clients and system heterogeneity arising from resource-constrained devices. While clustering-based approaches mitigate statistical heterogeneity and pruning techniques improve memory and communication efficiency, these strategies are typically studied in isolation. We propose CA-AFP, a unified framework that jointly addresses both challenges by performing cluster-specific model pruning. In CA-AFP, clients are first grouped into clusters, and a separate model for each cluster is adaptively pruned during training. The framework introduces two key innovations: (1) a cluster-aware importance scoring mechanism that combines weight magnitude, intra-cluster coherence, and gradient consistency to identify parameters for pruning, and (2) an iterative pruning schedule that progressively removes parameters while enabling model self-healing through weight regrowth. We evaluate CA-AFP on two widely used human activity recognition benchmarks, UCI HAR and WISDM, under natural user-based federated partitions. Experimental results demonstrate that CA-AFP achieves a favorable balance between predictive accuracy, inter-client fairness, and communication efficiency. Compared to pruning-based baselines, CA-AFP consistently improves accuracy and lower performance disparity across clients with limited fine-tuning, while requiring substantially less communication than dense clustering-based methods. It also shows robustness to different Non-IID levels of data. Finally, ablation studies analyze the impact of clustering, pruning schedules and scoring mechanism offering practical insights into the design of efficient and adaptive FL systems.

2026-03-02T11:04:25Z Om Govind Jha Harsh Shukla Haroon R. Lone http://arxiv.org/abs/2507.04786v3 Demystifying NCCL: An In-depth Analysis of GPU Communication Protocols and Algorithms 2026-03-02T09:56:07Z

The NVIDIA Collective Communication Library (NCCL) is a critical software layer enabling high-performance collectives on large-scale GPU clusters. Despite being open source with a documented API, its internal design remains largely opaque. The orchestration of communication channels, selection of protocols, and handling of memory movement across devices and nodes are not well understood, making it difficult to analyze performance or identify bottlenecks. This paper presents a comprehensive analysis of NCCL, focusing on its communication protocol variants (Simple, LL, and LL128), mechanisms governing intra-node and inter-node data movement, and ring- and tree-based collective communication algorithms. The insights obtained from this study serve as the foundation for ATLAHS, an application-trace-driven network simulation toolchain capable of accurately reproducing NCCL communication patterns in large-scale AI training workloads. By demystifying NCCL's internal architecture, this work provides guidance for system researchers and performance engineers working to optimize or simulate collective communication at scale.

2025-07-07T09:03:32Z Zhiyi Hu Siyuan Shen Tommaso Bonato Sylvain Jeaugey Cedell Alexander Eric Spada James Dinan Jeff Hammond Torsten Hoefler http://arxiv.org/abs/2603.01661v1 HeRo: Adaptive Orchestration of Agentic RAG on Heterogeneous Mobile SoC 2026-03-02T09:51:01Z

With the increasing computational capability of mobile devices, deploying agentic retrieval-augmented generation (RAG) locally on heterogeneous System-on-Chips (SoCs) has become a promising way to enhance LLM-based applications. However, agentic RAG induces multi-stage workflows with heterogeneous models and dynamic execution flow, while mobile SoCs exhibit strong accelerator affinity, shape sensitivity, and shared-memory bandwidth contention, making naive scheduling ineffective. We present HeRo, a heterogeneous-aware framework for low-latency agentic RAG on mobile SoCs. HeRo builds profiling-based performance models for each sub-stage and model-PU configuration, capturing latency, workload shape, and contention-induced slowdown, and leverages them in a lightweight online scheduler that combines shape-aware sub-stage partitioning, criticality-based accelerator mapping, and bandwidth-aware concurrency control. Experiments on commercial mobile devices show that HeRo reduces end-to-end latency by up to $10.94\times$ over existing deployment strategies, enabling practical on-device agentic RAG.

2026-03-02T09:51:01Z Will appear in DAC'2026 Maoliang Li Jiayu Chen Zihao Zheng Ziqian Li Xinhao Sun Guojie Luo Chenchen Liu Xiang Chen http://arxiv.org/abs/2603.01629v1 TeraPool: A Physical Design Aware, 1024 RISC-V Cores Shared-L1-Memory Scaled-up Cluster Design with High Bandwidth Main Memory Link 2026-03-02T09:05:43Z

Shared L1-memory clusters of streamlined instruction processors (processing elements - PEs) are commonly used as building blocks in modern, massively parallel computing architectures (e.g. GP-GPUs). Scaling out these architectures by increasing the number of clusters incurs computational and power overhead, caused by the requirement to split and merge large data structures in chunks and move chunks across memory hierarchies via the high-latency global interconnect. Scaling up the cluster reduces buffering, copy, and synchronization overheads. However, the complexity of a fully connected cores-to-L1-memory crossbar grows quadratically with PE-count, posing a major physical implementation challenge. We present TeraPool, a physically implementable, >1000 floating-point-capable RISC-V PEs scaled-up cluster design, sharing a Multi-MegaByte >4000-banked L1 memory via a low latency hierarchical interconnect (1-7/9/11 cycles, depending on target frequency). Implemented in 12nm FinFET technology, TeraPool achieves near-gigahertz frequencies (910MHz) typical, 0.80 V/25C. The energy-efficient hierarchical PE-to-L1-memory interconnect consumes only 9-13.5pJ for memory bank accesses, just 0.74-1.1x the cost of a FP32 FMA. A high-bandwidth main memory link is designed to manage data transfers in/out of the shared L1, sustaining transfers at the full bandwidth of an HBM2E main memory. At 910MHz, the cluster delivers up to 1.89 single precision TFLOP/s peak performance and up to 200GFLOP/s/W energy efficiency (at a high IPC/PE of 0.8 on average) in benchmark kernels, demonstrating the feasibility of scaling a shared-L1 cluster to a thousand PEs, four times the PE count of the largest clusters reported in literature.

2026-03-02T09:05:43Z 14 Pages, 14 Figures, 6 Tables. Published in: IEEE Transactions on Computers ( Volume: 74, Issue: 11, November 2025) IEEE Transactions on Computers, vol. 74, no. 11, pp. 3667-3681, Nov. 2025 Yichao Zhang Marco Bertuletti Chi Zhang Samuel Riedel Diyou Shen Bowen Wang Alessandro Vanelli-Coralli Luca Benini 10.1109/TC.2025.3603692