https://arxiv.org/api/ww7lGXQegjvoloSaiSh+6qWbbk0 2026-06-10T15:43:20Z 28838 270 15 http://arxiv.org/abs/2602.18673v3 When Coordination Is Avoidable: A Monotonicity Analysis of Organizational Tasks 2026-05-26T18:00:03Z Organizations devote substantial resources to coordination, yet which tasks actually require it for correctness remains unclear. The problem is acute in multi-agent AI systems, where coordination cost is directly measurable and can exceed the cost of the work itself. Distributed systems theory provides a precise criterion: coordination is required when a task specification is non-monotonic, meaning that as histories grow, new information can invalidate prior conclusions. Here we show that Thompson's classic taxonomy of interdependence maps to that criterion, yielding a decision rule for when coordination is required for correctness. We formalize the correspondence in a bridge theorem, apply the rule to 65 APQC workflows and (with a calibrated LLM) 13,417 O*NET tasks, and illustrate it in multi-agent AI simulations. Under our decompositions, 74% of workflows and 42% of O*NET tasks are monotonic, implying that up to 24-57% of coordination spending is unnecessary for correctness. 2026-02-21T00:55:09Z 25 pages, 1 figure, 10 tables Harang Ju http://arxiv.org/abs/2605.13779v2 MinT: Managed Infrastructure for Training and Serving Millions of LLMs 2026-05-26T16:10:31Z We present MindLab Toolkit (MinT), a managed infrastructure system for Low-Rank Adaptation (LoRA) post-training and online serving. MinT targets a setting where many trained policies are produced over a small number of expensive base-model deployments. Instead of materializing each policy as a merged full checkpoint, MinT keeps the base model resident and moves exported LoRA adapter revisions through rollout, update, export, evaluation, serving, and rollback, hiding distributed training, serving, scheduling, and data movement behind a service interface. MinT scales this path along three axes. Scale Up extends LoRA RL to frontier-scale dense and MoE architectures, including MLA and DSA attention paths, with training and serving validated beyond 1T total parameters. Scale Down moves only the exported LoRA adapter, which can be under 1% of base-model size in rank-1 settings; adapter-only handoff reduces the measured step by 18.3x on a 4B dense model and 2.85x on a 30B MoE, while concurrent multi-policy GRPO shortens wall time by 1.77x and 1.45x without raising peak memory. Scale Out separates durable policy addressability from CPU/GPU working sets: a tensor-parallel deployment supports 10^6-scale addressable catalogs (measured single-engine sweeps through 100K) and thousand-adapter active waves at cluster scale, with cold loading treated as scheduled service work and packed MoE LoRA tensors improving live engine loading by 8.5-8.7x. MinT thus manages million-scale LoRA policy catalogs while training and serving selected adapter revisions over shared 1T-class base models. 2026-05-13T16:59:08Z 30 pages, technical report Mind Lab : Song Cao Vic Cao Andrew Chen Kaijie Chen Cleon Cheng Steven Chiang Kaixuan Fan Hera Feng Huan Feng Arthur Fu Jun Gao Hongquan Gu Aaron Guan Nolan Ho Mutian Hong Hailee Hou Peixuan Hua Charles Huang Miles Jiang Nora Jiang Yuyi Jiang Qiuyu Jin Fancy Kong Andrew Lei Kyrie Lei Alexy Li Lucian Li Ray Li Theo Li Zhihui Li Jiayi Lin Kairus Liu Kieran Liu Logan Liu Xiang Liu Irvine Lu Maeve Luo Runze Lv Pony Ma Verity Niu Anson Qiu Vincent Wang Rio Yang Maxwell Yao Carrie Ye Regis Ye Wenlin Ye Josh Ying Danney Zeng Yuhan Zhan Anya Zhang Di Zhang Ruijia Zhang Sueky Zhang Ya Zhang Wei Zhao Ada Zhou Changhai Zhou Yuhua Zhou Xinyue Zhu Murphy Zhuang http://arxiv.org/abs/2605.24461v2 Provisioning to Runtime Optimization of a 100 MW-Scale AI Cluster 2026-05-26T16:02:20Z The electric power supply for AI data centers is now the most significant bottleneck in the race toward Artificial General Intelligence, surpassing even the constraint of AI accelerator availability. To our knowledge, this paper is the first to describe the end-to-end power management process for a hyper-scale AI datacenter; from early power planning to accommodate next-generation accelerators 6--12 months before their general availability, to tuning power settings after large scale deployment, and finally to dynamic, runtime power management for evolving workloads. We present detailed power measurements for a 150 MW datacenter hosting a cluster of 83K GB200 GPUs. We share insights from building this state-of-the-art AI cluster. We hope this work encourages practitioners across the industry to share their own experiences as well. 2026-05-23T08:18:01Z Ehsan K. Ardestani Leonardo Piga Jovan Stojkovic Pavan Balaji Mustafa Ozdal Mikel Jimenez Fernandez Mihaela Dimovska Luka Tadic Hao Shen Devika Vishwanath Richa Mishra Melaku Mihret Valentin Andrei Mauricio Cespedes Julien Prigent James Monahan Tyler Graf Bin Li Charles Marquez Shobhit Kanaujia Kaushik Veeraraghavan Chunqiang Tang http://arxiv.org/abs/2601.21972v5 Learning Decentralized LLM Collaboration with Multi-Agent Actor Critic 2026-05-26T15:41:11Z Recent work has explored optimizing LLM collaboration through Multi-Agent Reinforcement Learning (MARL). However, most MARL fine-tuning approaches rely on predefined execution protocols, which often require centralized execution. Decentralized LLM collaboration is more appealing in practice, as agents can run inference in parallel with flexible deployments. Also, current approaches use Monte Carlo methods for fine-tuning, which suffer from high variance and thus require more samples to train effectively. Actor-critic methods are prevalent in MARL for dealing with these issues; thus, we developed Multi-Agent Actor-Critic (MAAC) methods to optimize decentralized LLM collaboration. In this paper, we analyze when and why these MAAC methods are beneficial. We propose 2 MAAC approaches, \textbf{CoLLM-CC} with a \textbf{C}entralized \textbf{C}ritic and \textbf{CoLLM-DC} with \textbf{D}ecentralized \textbf{C}ritics. Our experiments across writing, coding, and game-playing domains show that Monte Carlo methods and CoLLM-DC can achieve performance comparable to CoLLM-CC in short-horizon and dense-reward settings. However, they both underperform CoLLM-CC on long-horizon or sparse-reward tasks, where Monte Carlo methods require substantially more samples and CoLLM-DC struggles to converge. 2026-01-29T16:50:30Z Shuo Liu Tianle Chen Ryan Amiri Christopher Amato http://arxiv.org/abs/2603.10768v2 HuntMS: A Framework for Microservice Geo-Distribution for Carbon and Cost Reduction 2026-05-26T14:45:27Z Microservices are a dominant architecture in cloud computing, offering scalability and modularity, but also posing complex deployment challenges. As data centers contribute significantly to global carbon emissions, carbon-aware scheduling has emerged as a promising mitigation strategy. However, most existing solutions target batch, high-performance, or serverless workloads and assume access to global-scale infrastructure. Such an assumption does not hold for many national or regional small to medium-sized enterprises (SMEs) with microservice applications, which represent the real-world majority. In this paper, we present HuntMS, an Adaptive Carbon and Efficiency-aware placement for microservices that considers carbon, cost, and latency constraints. HuntMS dynamically places microservices across geographically constrained regions using a scalable optimization strategy that leverages insight-based search space pruning techniques. Evaluation on a real-world deployment shows that HuntMS quickly adapts to real-time changes in workload and carbon intensity and reduces carbon emissions by 37.4% and operational cost by 3.6%, on average, compared to a static deployment within a single country, while consistently meeting SLOs. In this way, HuntMS enables carbon- and cost-aware microservice deployment for latency-sensitive applications in regionally limited infrastructures for SMEs. 2026-03-11T13:45:58Z Georgia Christofidi Francisco Álvarez-Terribas Ioannis Roumpos Nicolas Kourtellis Jesus Omaña Iglesias Thaleia Dimitra Doudali http://arxiv.org/abs/2605.27106v1 Autonomic Federated-Market Orchestration for the Edge-Cloud Continuum 2026-05-26T14:44:21Z The edge-cloud computing continuum demands self-management mechanisms that scale across autonomous administrative domains while honouring tenant- and operator-specified data sovereignty. We present Neural Pub/Sub, a federated-broker autonomic substrate whose self-organising behaviour emerges from market-based price signals rather than centralised control. Its MAPE-K control loop closes over per-broker health and load monitoring, marginal-cost clearing-price analysis, placement planning over a polymatroidal feasibility region, federated cross-domain dispatch, and shared peer subscription summaries with bounded-staleness price signals. The Plan step is anchored in a Walrasian convergence proposition: under gross-substitutes valuations on tree and series-parallel service-dependency DAGs, decentralised price-based allocation matches the welfare of a centralised oracle. We evaluate the substrate on a 4-VM, 4-domain, 48-worker federated edge-cloud testbed (single data centre, 50 ms emulated WAN) in a 1005-run campaign augmented by a fair-process-count sharded-oracle comparator. The federated market dominates a single-process oracle by 2-4% with 45 of 45 per-seed wins (sign-test p ~ 2.8e-14, Hodges-Lehmann median -39.6 ms); against a four-shard centralised orchestrator at equal process count the gap stays within +/-1.5% across all nine (pipeline, load) cells. Round-robin completion rate collapses 98.8% -> 22.4% -> 3.3% across arrival rates 5/10/15 pps while the market preserves completion; the advantage decomposes into three Walrasian properties (information completeness, admission control, price discovery). Federation withstands broker death and network partition (completion rate >= 98.7% across 75 cells), and sovereignty enforcement adds no measurable runtime overhead across 60 governance-grid runs. Heterogeneous-domain stressors and cross-site WAN deployment remain future work. 2026-05-26T14:44:21Z 35 pages, 5 figures (combined main paper + electronic supplement, folded into one document for arXiv) Lauri Lovén Roberto Morabito Abhishek Kumar Susanna Pirttikangas Jukka Riekki Sasu Tarkoma http://arxiv.org/abs/2605.27081v1 ReMoE: Boosting Expert Reuse through Router Fine-Tuning in Memory-Constrained MoE LLM Inference 2026-05-26T14:32:56Z Fine-grained Mixture-of-Experts (MoE) models sparsely activate only a subset of experts per token, reducing activated computation while maintaining high model capacity. However, in memory-constrained inference scenarios, only a small set of experts can be cached. Experts not in the cache must be fetched from slow external storage (e.g., UFS), leading to frequent evictions and substantial I/O overhead. We propose ReMoE, a router fine-tuning framework designed to boost token-wise expert reuse. ReMoE biases the router toward recently selected experts, producing temporally stable routing that better matches cache locality constraints. By increasing short-horizon expert reuse, ReMoE reduces expert fetches from storage without adding inference-time computation. Experiments on DeepSeek and Qwen models show that ReMoE improves expert reuse by 26% while maintaining downstream task performance. Real-system evaluations further confirm these benefits, improving output throughput by 8.4% under vLLM GPU-CPU expert offloading and reducing TPOT by 43.6-49.8% under llama.cpp on Jetson Orin NX, corresponding to a 1.77-1.99$\times$ decode speedup across diverse workloads. Checkpoints and usage instructions are available at https://github.com/BUAA-OSCAR/ReMoE. 2026-05-26T14:32:56Z Accepted at the 43rd International Conference on Machine Learning (ICML 2026) Xiongwei Zhu Xiaojian Liao Tianyang Jiang Yusen Zhang Liang Wang Limin Xiao http://arxiv.org/abs/2605.05774v2 SuperPaymaster: Eliminating Centralized Signer Authority via Asset-Oriented Abstraction to Reconcile Usability and Decentralization in Account Abstraction 2026-05-26T14:12:07Z Most production ERC-4337 Paymasters rely on Process-Oriented Abstraction (POA): a centralized off-chain server signs each sponsorship request, acting as a potential censorship bottleneck. We propose Asset-Oriented Abstraction (AOA), encapsulating payment capability in a persistent, user-owned on-chain asset -- the Gas Card -- rather than an off-chain signing process. Following the Design Science Research (DSR) methodology, we implement SuperPaymaster on Optimism Mainnet, anchoring sponsorship validity in on-chain Soulbound Token state and deterministic policy rules, removing the off-chain signer as a validity gate. We evaluate gas costs via single-UserOp ERC-20 transfers on Optimism Mainnet (n = 50 per system). In pure L2 execution gas (txGasUsed; actualGasUsed = txGasUsed + PVG), SuperPaymaster (167,830) is lower than both evaluated POA baselines: Alchemy Gas Manager (205,951) and Pimlico ERC-20 paymaster (328,937). It still pays a ~32,000-gas on-chain verification overhead versus Alchemy, but reduces gas by 49% versus Pimlico by replacing on-chain token liquidation with an internal balance update. In total billed gas, SuperPaymaster (286,818) exceeds Alchemy (257,299) due to higher bundler PVG overhead, not paymaster architecture. Code structural analysis and on-chain Mainnet evidence confirm that sponsorship validity requires no off-chain signing server: validatePaymasterUserOp reads only on-chain state. These findings suggest that AOA can mitigate the usability-decentralization-efficiency trade-offs in gas payment. 2026-05-07T07:07:00Z 56 pages, 13 figures. Includes an Optimism Mainnet measurement study (n=50 per system) and a GOMS cognitive analysis Huifeng Jiao Nathapon Udomlertsakul http://arxiv.org/abs/2512.22173v2 Accelerating discovery across scientific disciplines through reproducible workflows with AiiDAlab 2026-05-26T13:59:08Z With ever-increasing computational capabilities, robust and automated research workflows have become essential for orchestrating large numbers of interdependent simulations. However, significant technical expertise is still required to configure execution environments, define calculation inputs, interpret outputs, and manage the complexity of parallel code execution on remote machines. To address these challenges, we developed AiiDAlab, a Jupyter-based web platform powered by the AiiDA computational infrastructure that provides a framework for managing and automating computational workflows while ensuring reproducibility through full provenance tracking. Through a collection of open-source user-friendly applications, AiiDAlab enables scientists to set up, execute, and analyze complex computational workflows without interacting directly with the underlying technical details, allowing them to focus on their research questions. In this paper, we discuss how AiiDAlab has matured over the past few years, expanding beyond computational materials science and its AiiDA origins. We present recent developments towards integrating with electronic laboratory notebooks (ELNs) for FAIR-compliant data management, adoption in large-scale facilities for secure access to experimental data and analytical tools, and applications in educational settings. Together with community-driven efforts to simplify onboarding, improve access to computational resources, and support large-scale data workflows, these advancements position AiiDAlab as a powerful platform for accelerating scientific discovery and fostering collaboration across disciplines. 2025-12-18T08:34:08Z Yakutovich, Hollas, Bainglass, and Yu are co-first authors Digit. Discov. 5, 2310-2324 (2026) Aliaksandr V. Yakutovich Daniel Hollas Edan Bainglass Jusong Yu Corsin Battaglia Miki Bonacci Lucas Fernandez Vilanova Stephan Henne Anders Kaestner Michel Kenzelmann Graham Kimbell Jakob Lass Fabio Lopes Daniel G. Mazzone Andres Ortega-Guerrero Xing Wang Nicola Marzari Carlo A. Pignedoli Giovanni Pizzi 10.1039/D5DD00567A http://arxiv.org/abs/2503.22452v4 On the Solvability of Byzantine-tolerant Reliable Communication in Dynamic Networks 2026-05-26T13:32:37Z A reliable communication primitive guarantees the delivery, integrity, and authorship of messages exchanged between correct processes of a distributed system. We investigate the necessary and sufficient conditions for reliable communication in dynamic networks, where the network topology evolves over time despite the presence of a limited number of Byzantine faulty processes that may behave arbitrarily (i.e., in the globally bounded Byzantine failure model). We identify classes of dynamic networks where such conditions are satisfied, and extend our analysis to message losses, local computation with unbounded finite delay, and authenticated messages. 2025-03-28T14:05:33Z Silvia Bonomi DIAG UNIROMA Giovanni Farina UNICUSANO Sébastien Tixeuil NPA http://arxiv.org/abs/2606.07574v1 Accelerating Birkhoff Projection for Manifold-Constrained Hyper-Connections 2026-05-26T13:11:08Z Manifold-constrained hyper-connections (mHCs) have recently been proposed as a principled extension of hyper-connections, where the residual mixing matrices are constrained to be doubly stochastic via projection onto the Birkhoff polytope. In practical mHC implementations, this constraint is enforced by Sinkhorn-Knopp iterations, and the backward pass relies on unrolling the iterative solver. This design introduces substantial computation and memory overhead, and may also yield inaccurate projections when the algorithm converges slowly on challenging inputs, undermining the intended norm-control and stability guarantees of mHCs. In this work, we focus on the practically important 4x4 Birkhoff projection setting and develop an end-to-end acceleration framework. By leveraging the dual formulation, we reduce the problem to a three-dimensional unconstrained convex problem and solve it with Newton's method, achieving fast convergence and high accuracy. For the backward pass, we replace the unrolled differentiation with implicit differentiation, yielding exact gradients without storing intermediate states. To exploit massive parallelism, we design a warp-level CUDA kernel that uses only register-level primitives, avoiding global and shared memory I/O. Extensive experiments against representative open-source baselines demonstrate that the proposed solver yields substantially more reliable doubly stochastic projections -- especially when the input magnitude is large -- and achieves significant end-to-end speedups (including the backward pass), reaching over 20x acceleration at large batch sizes while maintaining orders of magnitude smaller marginal errors. 2026-05-26T13:11:08Z Chenrui Wang Yixuan Qiu http://arxiv.org/abs/2605.26975v1 Nonlinear spectral clustering with C++ GraphBLAS 2026-05-26T13:00:36Z Nonlinear reformulations of the spectral clustering method have gained a lot of recent attention due to their increased numerical benefits and their solid mathematical background. However, the estimation of the multiple nonlinear eigenvectors is associated with an increased computational cost. We present an implementation of a direct multiway spectral clustering algorithm in the $p$-norm, for $p\in(1,2]$, using a novel C++ GraphBLAS API. The key operations are expressed in linear algebraic terms and are executed over the resulting sparse matrices and dense vectors, parameterized in the algebra pertinent to the computation. We demonstrate the effectiveness and accuracy of our shared-memory algorithm on several artificial test cases. Our numerical examples and comparative results against competitive methods indicate that the proposed implementation attains high quality clusters in terms of the balanced graph cut metric. The strong scaling capabilities of our algorithm are showcased on a range of datasets with up to $8$ million nodes and $48$ million edges. 2026-05-26T13:00:36Z Outstanding short paper award, IEEE High Performance Extreme Computing Conference (HPEC), 25 - 29 September 2023 Dimosthenis Pasadakis Olaf Schenk Verner Vlacic Albert-Jan Yzelman http://arxiv.org/abs/2605.26960v1 Extreme-Scale Interconnection Networks 2026-05-26T12:50:20Z Extreme-scale data centers are the backbone of next-generation computing, enabling breakthroughs in science, artificial intelligence, and global innovation through unprecedented processing power and scalability. This work examines leaf-spine network topologies that offer extreme scalability--connecting a vast number of endpoints--while delivering strong performance at low cost. It takes as a starting point two alternatives to the widely used Fat-Tree topology: the Orthogonal Fat-Tree and the Random Folded Clos. The resulting Multipass Random Leaf-Spine (MRLS) networks inherit their advantages and surpass Fat-Trees in both throughput and flexibility. To fully leverage the topological properties of these networks, various non-minimal routing strategies are considered. An exhaustive evaluation using an interconnection network simulator provides insight into the trade-offs and scalability of these topologies under realistic conditions, positioning them as a promising solution for extreme-scale systems. The MRLS achieves a 50% speedup against a Fat-Tree for an All2All collective comprising 100k endpoints, and 100% against Dragonfly networks for the same collective. 2026-05-26T12:50:20Z Alejandro Cano Cristina Brinza Cristóbal Camarero Carmen Martínez Ramón Beivide http://arxiv.org/abs/2605.26930v1 Revisiting Bruck: Phase-Efficient All-to-All Communication in Reconfigurable Networks 2026-05-26T12:24:04Z All-to-All communication is a key performance bottleneck for distributed machine learning (ML) and high-performance computing (HPC) workloads, where dense traffic increasingly stresses scale-up interconnects. While these ML and HPC workloads have driven unprecedented infrastructure demand, optical reconfigurable networks (ORNs) offer a promising path forward. By adapting the physical topology to the active workload, they improve communication cost and bandwidth utilization. However, their benefit is critically contingent on whether the collective consists of structured phases that can be served by sparse and reusable topology states. In this paper, we revisit Bruck's All-to-All implementation and demonstrate the benefits of topology optimization in which both communication pattern and reconfiguration strategy are co-designed. We present ReTri, a bidirectional All-to-All schedule for ORNs. ReTri uses balanced ternary block propagation to complete All-to-All in $\lceil \log_3 n\rceil$ phases. The induced reconfiguration strategy from ReTri's pairwise bidirectional exchanges allow reconfiguration delays to be amortized across multiple phases. Preliminary simulations show that ReTri improves completion time by up to $10\times$ over static All-to-All, even for millisecond-scale reconfiguration delays, and improving reconfigurable Bruck by up to $2.1\times$. 2026-05-26T12:24:04Z Anton Juerss Stefan Schmid http://arxiv.org/abs/2604.28059v2 NeuroRing: Scaling Spiking Neural Networks via Multi-FPGA Bidirectional Ring Topologies and Stream-Dataflow Architectures 2026-05-26T08:45:02Z Spiking neural networks (SNNs) are a promising paradigm for energy-efficient event-driven computation, but large-scale SNN execution remains challenging because sparse spike communication and synchronization can dominate runtime. Existing solutions across CPU, GPU, ASIC, and FPGA platforms offer different trade-offs between programmability, efficiency, and scalability. To address this gap, we present NeuroRing, a modular and scalable SNN accelerator based on a stream-dataflow architecture and a bidirectional ring topology, implemented in High-Level Synthesis (HLS) on FPGAs. NeuroRing supports modular single- and multi-FPGA deployment and is compatible with existing SNN workflows through integration with the NEST simulator. We evaluate NeuroRing on the cortical microcircuit benchmark and a Sudoku constraint-satisfaction workload. Results show that NeuroRing preserves the key activity statistics of the NEST reference model, achieves faster-than-real-time execution of the full-scale cortical microcircuit with a real-time factor (RTF) of 0.83, exhibits meaningful strong and weak scaling, and provides competitive energy efficiency on two programmable FPGAs. These results position NeuroRing as a flexible and scalable platform for both neuroscience simulation and broader event-driven applications. 2026-04-30T16:04:26Z Accepted at Euro-Par 2026 Muhammad Ihsan Al Hafiz Artur Podobas