https://arxiv.org/api/ww7lGXQegjvoloSaiSh+6qWbbk02026-06-10T15:43:20Z2883827015http://arxiv.org/abs/2602.18673v3When Coordination Is Avoidable: A Monotonicity Analysis of Organizational Tasks2026-05-26T18:00:03ZOrganizations devote substantial resources to coordination, yet which tasks actually require it for correctness remains unclear. The problem is acute in multi-agent AI systems, where coordination cost is directly measurable and can exceed the cost of the work itself. Distributed systems theory provides a precise criterion: coordination is required when a task specification is non-monotonic, meaning that as histories grow, new information can invalidate prior conclusions. Here we show that Thompson's classic taxonomy of interdependence maps to that criterion, yielding a decision rule for when coordination is required for correctness. We formalize the correspondence in a bridge theorem, apply the rule to 65 APQC workflows and (with a calibrated LLM) 13,417 O*NET tasks, and illustrate it in multi-agent AI simulations. Under our decompositions, 74% of workflows and 42% of O*NET tasks are monotonic, implying that up to 24-57% of coordination spending is unnecessary for correctness.2026-02-21T00:55:09Z25 pages, 1 figure, 10 tablesHarang Juhttp://arxiv.org/abs/2605.13779v2MinT: Managed Infrastructure for Training and Serving Millions of LLMs2026-05-26T16:10:31ZWe present MindLab Toolkit (MinT), a managed infrastructure system for Low-Rank Adaptation (LoRA) post-training and online serving. MinT targets a setting where many trained policies are produced over a small number of expensive base-model deployments. Instead of materializing each policy as a merged full checkpoint, MinT keeps the base model resident and moves exported LoRA adapter revisions through rollout, update, export, evaluation, serving, and rollback, hiding distributed training, serving, scheduling, and data movement behind a service interface. MinT scales this path along three axes. Scale Up extends LoRA RL to frontier-scale dense and MoE architectures, including MLA and DSA attention paths, with training and serving validated beyond 1T total parameters. Scale Down moves only the exported LoRA adapter, which can be under 1% of base-model size in rank-1 settings; adapter-only handoff reduces the measured step by 18.3x on a 4B dense model and 2.85x on a 30B MoE, while concurrent multi-policy GRPO shortens wall time by 1.77x and 1.45x without raising peak memory. Scale Out separates durable policy addressability from CPU/GPU working sets: a tensor-parallel deployment supports 10^6-scale addressable catalogs (measured single-engine sweeps through 100K) and thousand-adapter active waves at cluster scale, with cold loading treated as scheduled service work and packed MoE LoRA tensors improving live engine loading by 8.5-8.7x. MinT thus manages million-scale LoRA policy catalogs while training and serving selected adapter revisions over shared 1T-class base models.2026-05-13T16:59:08Z30 pages, technical reportMind Lab :Song CaoVic CaoAndrew ChenKaijie ChenCleon ChengSteven ChiangKaixuan FanHera FengHuan FengArthur FuJun GaoHongquan GuAaron GuanNolan HoMutian HongHailee HouPeixuan HuaCharles HuangMiles JiangNora JiangYuyi JiangQiuyu JinFancy KongAndrew LeiKyrie LeiAlexy LiLucian LiRay LiTheo LiZhihui LiJiayi LinKairus LiuKieran LiuLogan LiuXiang LiuIrvine LuMaeve LuoRunze LvPony MaVerity NiuAnson QiuVincent WangRio YangMaxwell YaoCarrie YeRegis YeWenlin YeJosh YingDanney ZengYuhan ZhanAnya ZhangDi ZhangRuijia ZhangSueky ZhangYa ZhangWei ZhaoAda ZhouChanghai ZhouYuhua ZhouXinyue ZhuMurphy Zhuanghttp://arxiv.org/abs/2605.24461v2Provisioning to Runtime Optimization of a 100 MW-Scale AI Cluster2026-05-26T16:02:20ZThe electric power supply for AI data centers is now the most significant bottleneck in the race toward Artificial General Intelligence, surpassing even the constraint of AI accelerator availability. To our knowledge, this paper is the first to describe the end-to-end power management process for a hyper-scale AI datacenter; from early power planning to accommodate next-generation accelerators 6--12 months before their general availability, to tuning power settings after large scale deployment, and finally to dynamic, runtime power management for evolving workloads. We present detailed power measurements for a 150 MW datacenter hosting a cluster of 83K GB200 GPUs. We share insights from building this state-of-the-art AI cluster. We hope this work encourages practitioners across the industry to share their own experiences as well.2026-05-23T08:18:01ZEhsan K. ArdestaniLeonardo PigaJovan StojkovicPavan BalajiMustafa OzdalMikel Jimenez FernandezMihaela DimovskaLuka TadicHao ShenDevika VishwanathRicha MishraMelaku MihretValentin AndreiMauricio CespedesJulien PrigentJames MonahanTyler GrafBin LiCharles MarquezShobhit KanaujiaKaushik VeeraraghavanChunqiang Tanghttp://arxiv.org/abs/2601.21972v5Learning Decentralized LLM Collaboration with Multi-Agent Actor Critic2026-05-26T15:41:11ZRecent work has explored optimizing LLM collaboration through Multi-Agent Reinforcement Learning (MARL). However, most MARL fine-tuning approaches rely on predefined execution protocols, which often require centralized execution. Decentralized LLM collaboration is more appealing in practice, as agents can run inference in parallel with flexible deployments. Also, current approaches use Monte Carlo methods for fine-tuning, which suffer from high variance and thus require more samples to train effectively. Actor-critic methods are prevalent in MARL for dealing with these issues; thus, we developed Multi-Agent Actor-Critic (MAAC) methods to optimize decentralized LLM collaboration. In this paper, we analyze when and why these MAAC methods are beneficial. We propose 2 MAAC approaches, \textbf{CoLLM-CC} with a \textbf{C}entralized \textbf{C}ritic and \textbf{CoLLM-DC} with \textbf{D}ecentralized \textbf{C}ritics. Our experiments across writing, coding, and game-playing domains show that Monte Carlo methods and CoLLM-DC can achieve performance comparable to CoLLM-CC in short-horizon and dense-reward settings. However, they both underperform CoLLM-CC on long-horizon or sparse-reward tasks, where Monte Carlo methods require substantially more samples and CoLLM-DC struggles to converge.2026-01-29T16:50:30ZShuo LiuTianle ChenRyan AmiriChristopher Amatohttp://arxiv.org/abs/2603.10768v2HuntMS: A Framework for Microservice Geo-Distribution for Carbon and Cost Reduction2026-05-26T14:45:27ZMicroservices are a dominant architecture in cloud computing, offering scalability and modularity, but also posing complex deployment challenges. As data centers contribute significantly to global carbon emissions, carbon-aware scheduling has emerged as a promising mitigation strategy. However, most existing solutions target batch, high-performance, or serverless workloads and assume access to global-scale infrastructure. Such an assumption does not hold for many national or regional small to medium-sized enterprises (SMEs) with microservice applications, which represent the real-world majority. In this paper, we present HuntMS, an Adaptive Carbon and Efficiency-aware placement for microservices that considers carbon, cost, and latency constraints. HuntMS dynamically places microservices across geographically constrained regions using a scalable optimization strategy that leverages insight-based search space pruning techniques. Evaluation on a real-world deployment shows that HuntMS quickly adapts to real-time changes in workload and carbon intensity and reduces carbon emissions by 37.4% and operational cost by 3.6%, on average, compared to a static deployment within a single country, while consistently meeting SLOs. In this way, HuntMS enables carbon- and cost-aware microservice deployment for latency-sensitive applications in regionally limited infrastructures for SMEs.2026-03-11T13:45:58ZGeorgia ChristofidiFrancisco Álvarez-TerribasIoannis RoumposNicolas KourtellisJesus Omaña IglesiasThaleia Dimitra Doudalihttp://arxiv.org/abs/2605.27106v1Autonomic Federated-Market Orchestration for the Edge-Cloud Continuum2026-05-26T14:44:21ZThe edge-cloud computing continuum demands self-management mechanisms that scale across autonomous administrative domains while honouring tenant- and operator-specified data sovereignty. We present Neural Pub/Sub, a federated-broker autonomic substrate whose self-organising behaviour emerges from market-based price signals rather than centralised control. Its MAPE-K control loop closes over per-broker health and load monitoring, marginal-cost clearing-price analysis, placement planning over a polymatroidal feasibility region, federated cross-domain dispatch, and shared peer subscription summaries with bounded-staleness price signals. The Plan step is anchored in a Walrasian convergence proposition: under gross-substitutes valuations on tree and series-parallel service-dependency DAGs, decentralised price-based allocation matches the welfare of a centralised oracle. We evaluate the substrate on a 4-VM, 4-domain, 48-worker federated edge-cloud testbed (single data centre, 50 ms emulated WAN) in a 1005-run campaign augmented by a fair-process-count sharded-oracle comparator. The federated market dominates a single-process oracle by 2-4% with 45 of 45 per-seed wins (sign-test p ~ 2.8e-14, Hodges-Lehmann median -39.6 ms); against a four-shard centralised orchestrator at equal process count the gap stays within +/-1.5% across all nine (pipeline, load) cells. Round-robin completion rate collapses 98.8% -> 22.4% -> 3.3% across arrival rates 5/10/15 pps while the market preserves completion; the advantage decomposes into three Walrasian properties (information completeness, admission control, price discovery). Federation withstands broker death and network partition (completion rate >= 98.7% across 75 cells), and sovereignty enforcement adds no measurable runtime overhead across 60 governance-grid runs. Heterogeneous-domain stressors and cross-site WAN deployment remain future work.2026-05-26T14:44:21Z35 pages, 5 figures (combined main paper + electronic supplement, folded into one document for arXiv)Lauri LovénRoberto MorabitoAbhishek KumarSusanna PirttikangasJukka RiekkiSasu Tarkomahttp://arxiv.org/abs/2605.27081v1ReMoE: Boosting Expert Reuse through Router Fine-Tuning in Memory-Constrained MoE LLM Inference2026-05-26T14:32:56ZFine-grained Mixture-of-Experts (MoE) models sparsely activate only a subset of experts per token, reducing activated computation while maintaining high model capacity. However, in memory-constrained inference scenarios, only a small set of experts can be cached. Experts not in the cache must be fetched from slow external storage (e.g., UFS), leading to frequent evictions and substantial I/O overhead.
We propose ReMoE, a router fine-tuning framework designed to boost token-wise expert reuse. ReMoE biases the router toward recently selected experts, producing temporally stable routing that better matches cache locality constraints. By increasing short-horizon expert reuse, ReMoE reduces expert fetches from storage without adding inference-time computation.
Experiments on DeepSeek and Qwen models show that ReMoE improves expert reuse by 26% while maintaining downstream task performance. Real-system evaluations further confirm these benefits, improving output throughput by 8.4% under vLLM GPU-CPU expert offloading and reducing TPOT by 43.6-49.8% under llama.cpp on Jetson Orin NX, corresponding to a 1.77-1.99$\times$ decode speedup across diverse workloads. Checkpoints and usage instructions are available at https://github.com/BUAA-OSCAR/ReMoE.2026-05-26T14:32:56ZAccepted at the 43rd International Conference on Machine Learning (ICML 2026)Xiongwei ZhuXiaojian LiaoTianyang JiangYusen ZhangLiang WangLimin Xiaohttp://arxiv.org/abs/2605.05774v2SuperPaymaster: Eliminating Centralized Signer Authority via Asset-Oriented Abstraction to Reconcile Usability and Decentralization in Account Abstraction2026-05-26T14:12:07ZMost production ERC-4337 Paymasters rely on Process-Oriented Abstraction (POA): a centralized off-chain server signs each sponsorship request, acting as a potential censorship bottleneck. We propose Asset-Oriented Abstraction (AOA), encapsulating payment capability in a persistent, user-owned on-chain asset -- the Gas Card -- rather than an off-chain signing process. Following the Design Science Research (DSR) methodology, we implement SuperPaymaster on Optimism Mainnet, anchoring sponsorship validity in on-chain Soulbound Token state and deterministic policy rules, removing the off-chain signer as a validity gate. We evaluate gas costs via single-UserOp ERC-20 transfers on Optimism Mainnet (n = 50 per system). In pure L2 execution gas (txGasUsed; actualGasUsed = txGasUsed + PVG), SuperPaymaster (167,830) is lower than both evaluated POA baselines: Alchemy Gas Manager (205,951) and Pimlico ERC-20 paymaster (328,937). It still pays a ~32,000-gas on-chain verification overhead versus Alchemy, but reduces gas by 49% versus Pimlico by replacing on-chain token liquidation with an internal balance update. In total billed gas, SuperPaymaster (286,818) exceeds Alchemy (257,299) due to higher bundler PVG overhead, not paymaster architecture. Code structural analysis and on-chain Mainnet evidence confirm that sponsorship validity requires no off-chain signing server: validatePaymasterUserOp reads only on-chain state. These findings suggest that AOA can mitigate the usability-decentralization-efficiency trade-offs in gas payment.2026-05-07T07:07:00Z56 pages, 13 figures. Includes an Optimism Mainnet measurement study (n=50 per system) and a GOMS cognitive analysisHuifeng JiaoNathapon Udomlertsakulhttp://arxiv.org/abs/2512.22173v2Accelerating discovery across scientific disciplines through reproducible workflows with AiiDAlab2026-05-26T13:59:08ZWith ever-increasing computational capabilities, robust and automated research workflows have become essential for orchestrating large numbers of interdependent simulations. However, significant technical expertise is still required to configure execution environments, define calculation inputs, interpret outputs, and manage the complexity of parallel code execution on remote machines. To address these challenges, we developed AiiDAlab, a Jupyter-based web platform powered by the AiiDA computational infrastructure that provides a framework for managing and automating computational workflows while ensuring reproducibility through full provenance tracking. Through a collection of open-source user-friendly applications, AiiDAlab enables scientists to set up, execute, and analyze complex computational workflows without interacting directly with the underlying technical details, allowing them to focus on their research questions. In this paper, we discuss how AiiDAlab has matured over the past few years, expanding beyond computational materials science and its AiiDA origins. We present recent developments towards integrating with electronic laboratory notebooks (ELNs) for FAIR-compliant data management, adoption in large-scale facilities for secure access to experimental data and analytical tools, and applications in educational settings. Together with community-driven efforts to simplify onboarding, improve access to computational resources, and support large-scale data workflows, these advancements position AiiDAlab as a powerful platform for accelerating scientific discovery and fostering collaboration across disciplines.2025-12-18T08:34:08ZYakutovich, Hollas, Bainglass, and Yu are co-first authorsDigit. Discov. 5, 2310-2324 (2026)Aliaksandr V. YakutovichDaniel HollasEdan BainglassJusong YuCorsin BattagliaMiki BonacciLucas Fernandez VilanovaStephan HenneAnders KaestnerMichel KenzelmannGraham KimbellJakob LassFabio LopesDaniel G. MazzoneAndres Ortega-GuerreroXing WangNicola MarzariCarlo A. PignedoliGiovanni Pizzi10.1039/D5DD00567Ahttp://arxiv.org/abs/2503.22452v4On the Solvability of Byzantine-tolerant Reliable Communication in Dynamic Networks2026-05-26T13:32:37ZA reliable communication primitive guarantees the delivery, integrity, and authorship of messages exchanged between correct processes of a distributed system. We investigate the necessary and sufficient conditions for reliable communication in dynamic networks, where the network topology evolves over time despite the presence of a limited number of Byzantine faulty processes that may behave arbitrarily (i.e., in the globally bounded Byzantine failure model). We identify classes of dynamic networks where such conditions are satisfied, and extend our analysis to message losses, local computation with unbounded finite delay, and authenticated messages.2025-03-28T14:05:33ZSilvia BonomiDIAG UNIROMAGiovanni FarinaUNICUSANOSébastien TixeuilNPAhttp://arxiv.org/abs/2606.07574v1Accelerating Birkhoff Projection for Manifold-Constrained Hyper-Connections2026-05-26T13:11:08ZManifold-constrained hyper-connections (mHCs) have recently been proposed as a principled extension of hyper-connections, where the residual mixing matrices are constrained to be doubly stochastic via projection onto the Birkhoff polytope. In practical mHC implementations, this constraint is enforced by Sinkhorn-Knopp iterations, and the backward pass relies on unrolling the iterative solver. This design introduces substantial computation and memory overhead, and may also yield inaccurate projections when the algorithm converges slowly on challenging inputs, undermining the intended norm-control and stability guarantees of mHCs.
In this work, we focus on the practically important 4x4 Birkhoff projection setting and develop an end-to-end acceleration framework. By leveraging the dual formulation, we reduce the problem to a three-dimensional unconstrained convex problem and solve it with Newton's method, achieving fast convergence and high accuracy. For the backward pass, we replace the unrolled differentiation with implicit differentiation, yielding exact gradients without storing intermediate states. To exploit massive parallelism, we design a warp-level CUDA kernel that uses only register-level primitives, avoiding global and shared memory I/O.
Extensive experiments against representative open-source baselines demonstrate that the proposed solver yields substantially more reliable doubly stochastic projections -- especially when the input magnitude is large -- and achieves significant end-to-end speedups (including the backward pass), reaching over 20x acceleration at large batch sizes while maintaining orders of magnitude smaller marginal errors.2026-05-26T13:11:08ZChenrui WangYixuan Qiuhttp://arxiv.org/abs/2605.26975v1Nonlinear spectral clustering with C++ GraphBLAS2026-05-26T13:00:36ZNonlinear reformulations of the spectral clustering method have gained a lot of recent attention due to their increased numerical benefits and their solid mathematical background. However, the estimation of the multiple nonlinear eigenvectors is associated with an increased computational cost. We present an implementation of a direct multiway spectral clustering algorithm in the $p$-norm, for $p\in(1,2]$, using a novel C++ GraphBLAS API. The key operations are expressed in linear algebraic terms and are executed over the resulting sparse matrices and dense vectors, parameterized in the algebra pertinent to the computation. We demonstrate the effectiveness and accuracy of our shared-memory algorithm on several artificial test cases. Our numerical examples and comparative results against competitive methods indicate that the proposed implementation attains high quality clusters in terms of the balanced graph cut metric. The strong scaling capabilities of our algorithm are showcased on a range of datasets with up to $8$ million nodes and $48$ million edges.2026-05-26T13:00:36ZOutstanding short paper award, IEEE High Performance Extreme Computing Conference (HPEC), 25 - 29 September 2023Dimosthenis PasadakisOlaf SchenkVerner VlacicAlbert-Jan Yzelmanhttp://arxiv.org/abs/2605.26960v1Extreme-Scale Interconnection Networks2026-05-26T12:50:20ZExtreme-scale data centers are the backbone of next-generation computing, enabling breakthroughs in science, artificial intelligence, and global innovation through unprecedented processing power and scalability. This work examines leaf-spine network topologies that offer extreme scalability--connecting a vast number of endpoints--while delivering strong performance at low cost. It takes as a starting point two alternatives to the widely used Fat-Tree topology: the Orthogonal Fat-Tree and the Random Folded Clos. The resulting Multipass Random Leaf-Spine (MRLS) networks inherit their advantages and surpass Fat-Trees in both throughput and flexibility. To fully leverage the topological properties of these networks, various non-minimal routing strategies are considered. An exhaustive evaluation using an interconnection network simulator provides insight into the trade-offs and scalability of these topologies under realistic conditions, positioning them as a promising solution for extreme-scale systems. The MRLS achieves a 50% speedup against a Fat-Tree for an All2All collective comprising 100k endpoints, and 100% against Dragonfly networks for the same collective.2026-05-26T12:50:20ZAlejandro CanoCristina BrinzaCristóbal CamareroCarmen MartínezRamón Beividehttp://arxiv.org/abs/2605.26930v1Revisiting Bruck: Phase-Efficient All-to-All Communication in Reconfigurable Networks2026-05-26T12:24:04ZAll-to-All communication is a key performance bottleneck for distributed machine learning (ML) and high-performance computing (HPC) workloads, where dense traffic increasingly stresses scale-up interconnects. While these ML and HPC workloads have driven unprecedented infrastructure demand, optical reconfigurable networks (ORNs) offer a promising path forward. By adapting the physical topology to the active workload, they improve communication cost and bandwidth utilization. However, their benefit is critically contingent on whether the collective consists of structured phases that can be served by sparse and reusable topology states.
In this paper, we revisit Bruck's All-to-All implementation and demonstrate the benefits of topology optimization in which both communication pattern and reconfiguration strategy are co-designed. We present ReTri, a bidirectional All-to-All schedule for ORNs. ReTri uses balanced ternary block propagation to complete All-to-All in $\lceil \log_3 n\rceil$ phases. The induced reconfiguration strategy from ReTri's pairwise bidirectional exchanges allow reconfiguration delays to be amortized across multiple phases. Preliminary simulations show that ReTri improves completion time by up to $10\times$ over static All-to-All, even for millisecond-scale reconfiguration delays, and improving reconfigurable Bruck by up to $2.1\times$.2026-05-26T12:24:04ZAnton JuerssStefan Schmidhttp://arxiv.org/abs/2604.28059v2NeuroRing: Scaling Spiking Neural Networks via Multi-FPGA Bidirectional Ring Topologies and Stream-Dataflow Architectures2026-05-26T08:45:02ZSpiking neural networks (SNNs) are a promising paradigm for energy-efficient event-driven computation, but large-scale SNN execution remains challenging because sparse spike communication and synchronization can dominate runtime. Existing solutions across CPU, GPU, ASIC, and FPGA platforms offer different trade-offs between programmability, efficiency, and scalability. To address this gap, we present NeuroRing, a modular and scalable SNN accelerator based on a stream-dataflow architecture and a bidirectional ring topology, implemented in High-Level Synthesis (HLS) on FPGAs. NeuroRing supports modular single- and multi-FPGA deployment and is compatible with existing SNN workflows through integration with the NEST simulator. We evaluate NeuroRing on the cortical microcircuit benchmark and a Sudoku constraint-satisfaction workload. Results show that NeuroRing preserves the key activity statistics of the NEST reference model, achieves faster-than-real-time execution of the full-scale cortical microcircuit with a real-time factor (RTF) of 0.83, exhibits meaningful strong and weak scaling, and provides competitive energy efficiency on two programmable FPGAs. These results position NeuroRing as a flexible and scalable platform for both neuroscience simulation and broader event-driven applications.2026-04-30T16:04:26ZAccepted at Euro-Par 2026Muhammad Ihsan Al HafizArtur Podobas