https://arxiv.org/api/G51VdhVsp+8hvKSSl9fUNRA69dE2026-06-09T23:24:39Z288236015http://arxiv.org/abs/2606.04101v2UltraEP: Unleash MoE Training and Inference on Rack-Scale Nodes with Near-Optimal Load Balancing2026-06-05T06:53:34ZLarge-scale expert parallelism (EP) is becoming pivotal for training and serving frontier MoE models, but it also amplifies device-level expert load imbalance into compute stragglers, token all-to-all bottlenecks, and activation-memory spikes. Existing balancers redistribute experts periodically based on historical load, which becomes unreliable for production deployments with non-stationary load patterns.
We present UltraEP, the first exact-load, real-time balancer for large-EP MoE training and serving prefill on rack-scale nodes (RSNs). Built upon the extended scale-up connectivity of RSNs, UltraEP rebalances every microbatch and layer on critical paths, which requires nontrivial co-design of plan solving and expert replication communication to minimize exposed overhead. To this end, UltraEP eagerly reacts to post-gating load with efficient quota-driven planning, and executes the resulting irregular expert-state transfers with RSN-native persistent tile streaming and relay-based fan-out mitigation. Averaged across MoE models from 106B to 671B parameters in training and prefill, UltraEP achieves 94.3% of the force-balanced ideal throughput, delivering 1.49$\times$ improvement over non-balancing, while reducing the final inter-rank imbalance from 1.30$-$4.01 to 1.01$-$1.04. Additionally, we validate UltraEP's scalability and robustness in production MoE training with 2560 GPUs.2026-06-02T18:07:51ZThe authors have identified issues related to information disclosure in the current version of the manuscript and therefore request its withdrawal. A revised version may be prepared at a later dateXinming WeiChao JinTuo DaiYinmin ZhongShan YuChengxu YangBingyang WuZili ZhangJing MaiQianchao ZhuZhouyang LiYuliang LiuGuojie Luohttp://arxiv.org/abs/2606.03163v3OpenAgenet / OAN Yellow Paper: Technical Architecture for Trust-Governed Resource Identity and Discovery2026-06-05T05:42:43ZThis yellow paper describes the technical architecture of OpenAgenet / OAN. OAN is a protocol-neutral trust layer for open Agent interconnection and discoverable AI resource products. It specifies the role architecture, \texttt{did:oan} identity objects, registration workflow, governance-backed Root lifecycle enforcement, Root-verified package model, authorization-aware Discovery, Root-issued infrastructure authorization VCs, signed trusted invocation, verification requirements, state transitions, security properties, implementation boundaries, and deployment considerations. The design is intended to support heterogeneous Agent frameworks and interaction protocols, including MCP, A2A, ANP-like systems, domain-specific Agent protocols, Skills, MCP Servers, and Tool/API resources. OAN does not define the entire business conversation among Agents or the native protocol of every resource; it defines how resource identities become admissible, discoverable, verifiable, and safe to approach before protocol-specific interaction begins.2026-06-02T05:18:14ZJinliang Xuhttp://arxiv.org/abs/2606.06910v1Communication Strategy Selection for Multi-GPU 3D FDTD with Convolutional Perfectly Matched Boundary Layers2026-06-05T05:08:58ZIn this paper we describe a communication-strategy study for multi-GPU three-dimensional finite-difference time-domain computation with convolutional perfectly matched layer boundary conditions using CUDA. The metrics used to determine the most effective implementation include runtime, throughput in millions of output points per second, strong-scaling efficiency, CPML overhead, host-staged versus direct GPU-to-GPU exchange speedup, and enlarged-ghost speedup. On a single NVIDIA Quadro RTX 6000 GPU, the CPML implementation sustains 2,889--3,290 million output points per second with less than 1\% boundary-layer overhead, providing the single-GPU baseline for the multi-GPU study. The results show that direct GPU-to-GPU peer exchange is the dominant optimization with a 2.46--2.76$\times$ speedup over host-staged exchange, while enlarged ghost regions give only modest benefits because the reduced communication frequency is partly offset by redundant computation and additional memory traffic. On NVIDIA Quadro RTX 8000 GPUs, the implementation gives up to a 1.51$\times$ speedup on two GPUs for the tested strong-scaling cases, while four GPUs enable larger grids that approach or exceed single-GPU memory capacity.2026-06-05T05:08:58ZVictory C. Obiekehttp://arxiv.org/abs/2603.15202v3Simple is Better: Multiplication May Be All You Need for LLM Request Scheduling2026-06-05T04:55:52ZHigh-quality LLM request scheduling requires meeting two key objectives: ensuring the routed instance has KVCache to accelerate request execution, and ensuring that the workload is balanced across instances. Achieving both objectives is challenging because pursuing one may compromise the other. Current approaches use various combinators (e.g., linear combinations) to compute a scheduling score that combines indicators for the two objectives. These approaches are complex: they either require significant workload-specific hyperparameter tuning or model-hardware-aware simulator development, yet could still lead to suboptimal performance. In this paper, we show that using a simple multiplication of two carefully chosen indicators: one KVCache-aware (new prefill tokens if routed to an instance) and one load-balancing-aware (current batch size of the instance), as the scheduling score (LMETRIC) can achieve both objectives simultaneously without any hyperparameter tuning. The key idea is that the simply multiplied score considers both objectives in a manner similar to a linear combination, but the original hyperparameters cancel out during comparison, so no tuning is needed to find the best parameters. The two indicators are chosen based on our analysis of LLM characteristics. Our extensive experiments show that this simple approach can reduce TTFT by 92% and 39%, and TPOT by 24% and 51%, compared to vLLM-v1 and an in-production scheduler on real-world workloads covering chatbots and coding agents. We also derive the mathematical conditions under which multiplication may fail, and find that such conditions are extremely rare in practice and can be detected (and mitigated) beforehand. LMETRIC has been deployed in production and canary release confirms its effectiveness2026-03-16T12:43:32ZTo appear in the Proceedings of 20th USENIX Symposium on Operating Systems Design and Implementation (OSDI'26)Dingyan ZhangJinbo HanKaixi ZhangXingda WeiSijie ShenChenguang FangWenyuan YuJingren ZhouRong Chenhttp://arxiv.org/abs/2512.00711v3Cross-Domain Federated Semantic Communication with Global Representation Alignment and Domain-Aware Aggregation2026-06-05T01:56:29ZSemantic communication can significantly improve bandwidth utilization in wireless systems by exploiting the meaning behind raw data. However, the advancements achieved through semantic communication are closely dependent on the development of deep learning (DL) models for joint source-channel coding (JSCC) encoder/decoder techniques, which require a large amount of data for training. To address this data-intensive nature of DL models, federated learning (FL) has been proposed to train a model in a distributed manner, where the server broadcasts the DL model to clients in the network for training with their local data. However, the conventional FL approaches suffer from catastrophic degradation when client data are from different domains. In contrast, in this paper, a novel FL framework is proposed to address this domain shift by constructing the global representation, which aligns with the local features of the clients to preserve the semantics of different data domains. In addition, the dominance problem of client domains with a large number of samples is identified and, then, addressed with a domain-aware aggregation approach. This work is the first to consider the domain shift in training the semantic communication system for the image reconstruction task. Finally, simulation results demonstrate that the proposed approach outperforms the model-contrastive FL (MOON) framework by 0.5 for PSNR values under three domains at an SNR of 1 dB, and this gap continues to widen as the channel quality improves.2025-11-30T03:19:59Z13 pages, 7 figures, 6 tablesLoc X. NguyenJi Su YoonHuy Q. LeYu QiaoAvi Deb RahaEui-Nam HuhWalid SaadYumin ParkZhu HanChoong Seon Honghttp://arxiv.org/abs/2606.06818v1Terastal: Layer-Variant-based Scheduling for Real-Time Multi-DNN Workloads on Heterogeneous Accelerators2026-06-05T01:42:09ZHeterogeneous DNN accelerators improve soft real-time multi-DNN execution by mapping each layer to its preferred accelerator to reduce latency. However, under skewed workloads, large layer-latency differences across accelerators limit scheduling flexibility and increase deadline misses. To address this challenge, we introduce layer variants, customized layer implementations that reduce latency gaps on non-preferred accelerators. We then present Terastal, a soft real-time framework for layer-variant design and scheduling on heterogeneous DNN accelerators. Terastal combines offline heterogeneity-aware virtual budget assignment and layer-variant design, and online scheduling to jointly optimize accelerator mapping and variant selection under timing and accuracy constraints. Experimental results show that Terastal reduces deadline miss rate per model by 40.58%, 30.53%, and 36.27% compared with FCFS, EDF, and DREAM, respectively, while incurring only 2.24% average normalized accuracy loss across models with variants.2026-06-05T01:42:09Z8 pages, 6 figures. Accepted by RTCSA 2026. Author accepted manuscriptSing-Yao WuFengshuo SongEli Bozorgzadehhttp://arxiv.org/abs/2606.06751v1StageFrontier: Synchronization-Aware Stage Accounting for Distributed ML Training2026-06-04T22:22:35ZWhen a distributed training job slows down, the hard part is knowing where to look. Synchronization hides the cause: a stall on one rank shows up as a wait on the others, so a data delay on a single rank can surface as backward time across the group. The cheap dashboards that run all the time -- per-stage averages and maxima -- misread this, double-counting the same exposed delay or burying the slow rank in an average, while full profilers see it clearly but are far too heavy to leave on.
StageFrontier is an always-on signal that closes this gap. Each rank reports only a short ordered vector of coarse stage durations -- data, forward, backward, and so on -- timed with CPU wall-clock, with no synchronized clocks and no kernel tracing. At each stage boundary, StageFrontier takes the cumulative time of whichever rank is furthest along; the increments of this frontier form an exact, additive accounting of the step's exposed time and point to the stage and rank where group-visible delay first appears, telling an operator where to aim a heavy profiler, not which fix to make. The accounting is exact, but the coarse signal alone cannot tell whether a leading stage truly caused the slowdown or merely ran alongside it; StageFrontier labels the windows where that distinction needs more evidence instead of guessing.
A PyTorch implementation adds under 0.2% throughput overhead through 128 ranks on Gloo and NCCL, places injected faults among its top two suspects on all 50 rows of a hidden-rank DDP test, and recovers the same top-stage routing as PyTorch Profiler, HTA, and Nsight Systems once their traces are reduced to the same coarse stages -- from a 0.11 MB summary instead of a 15.81 GB trace.2026-06-04T22:22:35Z21 pagesBoram YoonWei ChenVille Kallioniemihttp://arxiv.org/abs/2505.07833v2Harmonia: End-to-End RAG Serving Optimization2026-06-04T21:46:57ZRetrieval-Augmented Generation (RAG) improves the reliability of large language models by integrating external knowledge, but serving RAG pipelines efficiently is challenging because requests traverse heterogeneous components spanning LLM inference, databases, and CPU-side processing. We present Harmonia, an end-to-end RAG serving framework that addresses these bottlenecks through (i) a flexible pipeline specification interface for composing custom workflows, (ii) heterogeneity-aware deployment that provisions and configures components as a distributed inference system, and (iii) a closed-loop runtime controller that monitors load and execution progress and reduces SLO violations through request prioritization and auto-scaling. Across four RAG applications, Harmonia outperforms commercial alternatives, improving throughput by more than 2.04x while reducing SLO violations by up to 78.4 percent.2025-05-01T18:58:26ZSaurabh AgarwalBodun HuLuis PabonMyungjin LeeJayanth SrinivasaAditya Akellahttp://arxiv.org/abs/2505.23131v2DOPPLER: Dual-Policy Learning for Device Assignment in Asynchronous Dataflow Graphs2026-06-04T20:53:20ZWe study the problem of assigning operations in a dataflow graph to devices to minimize execution time in a work-conserving system, with emphasis on complex machine learning workloads. Prior learning-based methods often struggle due to three key limitations: (1) reliance on bulk-synchronous systems like TensorFlow, which under-utilize devices due to barrier synchronization; (2) lack of awareness of the scheduling mechanism of underlying systems when designing learning-based methods; and (3) exclusive dependence on reinforcement learning, ignoring the structure of effective heuristics designed by experts. In this paper, we propose Doppler, a three-stage framework for training dual-policy networks consisting of 1) a $\mathsf{SEL}$ policy for selecting operations and 2) a $\mathsf{PLC}$ policy for placing chosen operations on devices. Our experiments show that Doppler outperforms all baseline methods across tasks by reducing system execution time and additionally demonstrates sampling efficiency by reducing per-episode training time.2025-05-29T06:04:32Z32 pages, 19 figuresProceedings of the International Conference on Learning Representations (ICLR), 2026Xinyu YaoDaniel BourgeoisAbhinav JainYuxin TangJiawen YaoZhimin DingArlei SilvaChris Jermainehttp://arxiv.org/abs/2606.06687v1Towards Serverless Semi-Decentralized Federated Learning with Heterogeneous Optimizers2026-06-04T20:05:18ZWe investigate cluster formation, involving the number and composition of clusters, in decentralized federated learning (FL) with heterogeneous machine learning (ML) optimizers. While clustering in centralized FL has enabled scalability and resource savings, its value and development in fully decentralized environments have yet to be explored. Optimizing cluster formation in such environments is challenging, especially due to the complex coupling between network graph structures, local data heterogeneity, and different local ML model optimizers. To address these challenges, we propose serverless semi-decentralized FL (SSD-FL), a methodology requiring no persistent server infrastructure. In SSD-FL, cluster formation occurs via a lightweight, one-time device-to-device (D2D) initialization phase, after which actual ML model training (alongside consensus and convergence processes) is fully serverless. Functionally, SSD-FL segments global rounds into intra-cluster and inter-cluster regimes, ensuring global convergence and consensus through novel "effective loss functions" that integrate device-specific ML optimizers with network graph-based regularization. Next, SSD-FL leverages the consensus gap via the Cheeger inequality to develop an iterative clustering algorithm evaluated against our derived convergence and consensus bounds, which incorporate a unique scoring metric to quantify data and optimizer heterogeneity across devices. Finally, experimental evaluation against three categories of decentralized FL methodologies validate that SSD-FL improves both convergence speeds and communication efficiency across various network graphs, datasets, and local optimizer regimes.2026-06-04T20:05:18ZUnder review at IEEE/ACM Transactions on NetworkingSu WangMung ChiangH. Vincent Poorhttp://arxiv.org/abs/2606.06438v1CarbonSim: A Lifecycle-Aware Framework for Evaluating Carbon Tradeoffs in Hardware Upgrade Decisions2026-06-04T17:40:13ZAs the demand for information and communication technologies (ICT) continues to rise, the environmental impact of computing systems is becoming an increasingly critical concern. Although newer hardware often improves performance and energy efficiency, these gains do not always offset the carbon cost of premature replacement, particularly under low-utilization workloads or low-carbon electricity grids. We present CarbonSim, a lifecycle-aware simulation framework for evaluating carbon tradeoffs in hardware upgrade decisions. CarbonSim combines workload execution profiles, machine-level power characteristics, embodied carbon inventories, scheduling policies, and time-varying grid carbon intensity to estimate total emissions under alternative deployment scenarios. The framework supports multiple embodied-carbon accounting strategies, including uniform amortization and front-loaded lifecycle attribution, enabling analysis under different hardware lifespan assumptions. Using heterogeneous CPU generations as calibration platforms, we demonstrate that newer machines do not always minimize total emissions: under lightly loaded workloads or cleaner electricity mixes, extending the useful life of existing hardware can reduce lifecycle carbon despite lower operational efficiency. These results highlight that hardware refresh decisions should be workload-aware, location-aware, and lifecycle-aware.2026-06-04T17:40:13ZKartik HansKaiwen ZhaoStephen Leehttp://arxiv.org/abs/2606.06386v1On GPU Implementation for Multi-Precision Integer Division2026-06-04T16:51:22ZThis paper presents the issues arising in implementing a fast integer division algorithm on general purpose GPUs. The algorithm uses a Newton iteration based on the shifted inverse operation, keeping all arithmetic in the integer domain and relying on data-parallel operators. The principal contribution is an efficient GPU/CUDA implementation for integer precisions from $2^{15}$ to $2^{18}$ -- sizes not supported by \cgbn{} division. We propose algorithmic refinements, define a cost model in terms of multiplications, build on prefix sums and previous work on multi-precision multiplication, and present an evaluation showing near-optimal performance relative to the model for the target precision.2026-06-04T16:51:22ZMartin B. MarchioroAske N. RaahaugeMarc I. LøvenskjoldCosmin E. OanceaStephen M. Watthttp://arxiv.org/abs/2606.06381v1Discrete Incremental Voting: New Bounds for General Graphs and Expanders2026-06-04T16:47:45ZWe analyze the discrete incremental voting process (DIV) introduced by Cooper, Radzik, and Shiraga [OPODIS '23]. In this process, we consider a set $V$ of $n$ nodes connected in an undirected graph $G = (V, E)$ where each node has an integer opinion. In one step a randomly selected node interacts with its randomly selected neighbor and changes its opinion by $1$ in the direction of the neighbour's opinion. The process converges to a unique opinion that, in expectation, is the degree-weighted average of the initial opinions.
We show that if the graph has conductance $Φ(G)$, the ratio of the average to smallest degree is $γ(G)$, and the maximal difference between initial opinions is $K$, then the expected convergence time is ${O}\left({n\left(K\log (Kn)+γ(G) n \right)}/{Φ(G)^2}\right)$. This bound is essentially optimal for a large class of graphs of bounded expansion. We also show that for regular graphs, if the second largest eigenvalue is $o(1/\log^2 n)$ and $K$ is $o\left({n}/{\log^2 n}\right)$, then w.h.p.\ DIV converges to the initial average opinion (rounded up or down).2026-06-04T16:47:45ZPetra BerenbrinkColin CooperThorsten GötteLukas HintzeTomasz Radzikhttp://arxiv.org/abs/2604.24027v2KubePACS: Kubernetes Cluster Using Performant, Highly Available, and Cost Efficient Spot Instances2026-06-04T15:46:23ZCloud users aim to minimize cost while maximizing performance by selecting the most suitable instance types for their workloads. To reduce expenses, spot instances have been widely adopted due to their steep discounts compared to on-demand pricing. However, their use introduces reliability risks due to potential interruptions, and existing research has primarily focused on mitigating this trade-off from a cost or availability perspective alone. Despite the diversity in hardware capabilities among instance types, current provisioning systems tend to ignore performance variation, selecting nodes solely based on minimum resource requirements. In this paper, we present KubePACS, a Kubernetes-native spot instance provisioning system that constructs node pools optimized for both cost and performance while guaranteeing high availability. KubePACS formulates the node selection process as a multi-objective optimization problem, incorporating real-time data such as spot prices, performance benchmarks, and availability scores, including the multi-node Spot Placement Score (SPS). It solves this problem efficiently using an Integer Linear Programming (ILP) approach guided by the Golden Section Search (GSS) algorithm to find the optimal configuration. By integrating with the Karpenter node autoscaler, KubePACS jointly optimizes instance-type selection and node scaling decisions within a standard provisioning workflow. KubePACS also adopts a novel heuristic to support workload-specific preferences by scaling performance metrics for specialized instances. Through extensive evaluation across synthetic and real-world workloads, KubePACS demonstrates on average 55.09% and up to 81.06% higher performance per dollar over state-of-the-art solutions such as Karpenter, SpotVerse, and SpotKube, which only reference the spot instance prices and limited availability data.2026-04-27T04:28:24ZAccepted to the 27th ACM International Middleware Conference (Middleware 2026)Taeyoon KimKyumin KimEnrique Molina-GiménezPedro García-LópezKyungyong Leehttp://arxiv.org/abs/2606.06255v1RadiusFPS: Efficient Farthest Point Sampling on CPUs and GPUs via Spherical Voxel Pruning2026-06-04T14:57:05ZPoint clouds are a primary sensory representation for robotic perception, underpinning LiDAR-based autonomous driving, simultaneous localization and mapping (SLAM), and navigation. Within these pipelines, Farthest Point Sampling (FPS) is the most well-known downsampling operator, as its uniform coverage preserves the geometric structure on which downstream perception relies. However, the large time complexity of classical FPS scales poorly with the million-point-per-second rates of modern 3D sensors, making it a dominant latency bottleneck that conflicts with the real-time and limited onboard compute budgets of robotic systems. Therefore, we propose RadiusFPS, an FPS acceleration framework based on spherical voxel pruning that preserves the standard FPS update rule under the same initialization and tie-breaking policy. By indexing the point cloud with spherical voxels, RadiusFPS derives a conservative geometric bound that prunes redundant distance computations in each iteration, complemented by a coordinate-wise point-skip test that removes residual updates. We further introduce RadiusFPS-G, a warp-level GPU implementation that fuses voxel selection, pruning, and distance update into memory-coalesced kernels, eliminating costly global-memory round-trips. On indoor (S3DIS, ScanNet) and outdoor LiDAR (SemanticKITTI) benchmarks, RadiusFPS-G attains up to 2.5x speedup over GPU-based FPS and matches or exceeds QuickFPS among the evaluated methods while using roughly half its GPU memory, with comparable segmentation accuracy. When coupled with the learning-based FastPoint sampler, the resulting pipeline achieves the fastest End-to-End inference among all evaluated configurations. These properties make high-quality FPS-style sampling practical for latency- and memory-constrained robotic vision.2026-06-04T14:57:05Z28 pages,15 figuresZiyang YuXiang LiQiong ChangJun Miyazaki