https://arxiv.org/api/YrJo+fRrjX3RPGLZRNhZJW7tKak2026-06-10T10:47:43Z2883819515http://arxiv.org/abs/2606.00714v1The Cartan-Topos Protocol: A Unified Geometric and Categorical Framework for Resilient Multi-Agent Coordination2026-05-30T12:57:43ZMulti-agent coordination faces a fundamental divide between continuous Euclidean consensus, which fails under non-integrable constraints, and discrete symbolic logic, which collapses under open-world assumptions. This report presents a unified geometric and categorical framework bridging these paradigms. Agent states are modeled on homogeneous manifolds (Lie groups, Grassmannians) with consensus achieved via Riemannian center-of-mass flows. Clifford-algebraic representations (rotors, motors) enable singularity-free SE(3) pose synchronization. Network interactions are formalized as cellular sheaves, where heterogeneous stalks connected by linear restriction maps replace uniform weights; the sheaf Laplacian drives diffusion toward globally consistent sections. The Cartan connection encodes logical holonomy directly into restriction maps. Asynchronous nonlinear sheaf diffusion guarantees linear convergence to Dirichlet energy minimizers under bounded delays. Sheaf-Theoretic Planning (STP) models time as a Grothendieck topos, using intuitionistic logic and abductive repair for resilient temporal reasoning. Applications include discourse sheaves for opinion dynamics and knowledge sheaves for graph embedding. This synthesis establishes geometric consensus as a universal foundation for resilient multi-agent systems across physical, epistemic, and temporal domains.2026-05-30T12:57:43ZManuel HernándezEduardo Sánchez-Sotohttp://arxiv.org/abs/2606.07621v1HASA: Subnet Allocation for Compute-Constrained Model-Heterogeneous Federated Learning2026-05-30T11:21:48ZEdge services increasingly use federated learning to personalize on-device models while keeping sensitive data local. In practice, deployments must handle heterogeneity in both client resources and local data distributions. Model-heterogeneous federated learning lowers client cost by allowing each client to train a subnet of a shared supernet, but most subnet-allocation policies are driven by device constraints and do not explicitly account for statistical heterogeneity. This paper proposes Heterogeneity-Aware Subnet Allocation (HASA), a train-only rule that assigns subnet widths based on client heterogeneity scores computed from local training data while enforcing a fixed size-weighted compute budget. This design enables budget-matched comparisons with alternative allocation policies. On an article-title next-word prediction benchmark with seven clients, HASA improves unweighted mean client test accuracy over uniform allocation across 10 matched seeds, increasing mean client test accuracy from 13.82 percent to 14.32 percent, and improves worst-client accuracy on average. In a matched-budget comparison with representative partial-training baselines, HASA achieves the strongest worst-client and tail-client accuracy on this benchmark. A directionality ablation shows that assigning smaller subnets to more heterogeneous clients degrades both mean and tail performance. A cross-domain image-classification study further shows that the effectiveness of heterogeneity-aware allocation depends on how well the heterogeneity score reflects clients' need for additional model width.2026-05-30T11:21:48ZAmir Hossein ShahdadianAhmed M. AbdelmoniemMahdi TaheriSamira NazariChristian Herglotzhttp://arxiv.org/abs/2606.07620v1SENTRY: Statistical Reliability Analysis of Vision Transformers Under Soft Errors2026-05-30T11:14:26ZWith the growth of Vision Transformers in safety-critical domains like autonomous systems and medical imaging, ensuring their reliability against soft errors is paramount. While ViTs offer state-of-the-art accuracy, their massive parameter counts render exhaustive fault injection campaigns infeasible. To bridge this gap, a statistical fault injection framework is presented, leveraging finite-population sampling theory to provide formal reliability guarantees. It is demonstrated that failure rates are bounded within a 1% margin at 99\% confidence using only a few thousand samples, regardless of model scale. This methodology achieves up to a 10,700 times reduction in experimental cost compared to exhaustive approaches, while preserving the ability to localize vulnerabilities across architectural components. Through extensive evaluation of different architectures like ViT-Tiny and ViT-Small, a highly non-uniform reliability landscape is uncovered. It is shown that while only 3% of FP32 bit-flips result in failure, the vast majority of these events lead to catastrophic accuracy collapse. Specific vulnerabilities are localized to normalization layers and critical exponent bits within the IEEE-754 format, providing a mathematical foundation and actionable insights for the design of hardened, edge-deployed ViT architectures.2026-05-30T11:14:26ZPramit Kumar BhaduriMahdi TaheriSamira NazariMaksim JenihhinChristian HerglotzMichael Hubnerhttp://arxiv.org/abs/2606.00601v1ScanWeaver: Compiler-Driven Parallelization of Affine Recurrences via Associative Scan Lowering2026-05-30T07:58:02ZSelective state-space models such as Mamba highlight the practical importance of input-dependent scan recurrences, which preserve linear-time sequence modeling while improving language modeling capabilities. However, these recurrences introduce stricter sequential dependencies than classical structured SSMs, limiting parallel execution on modern accelerators.
We present \textbf{ScanWeaver}, a compiler framework that transforms recurrence-based computations into associative scan representations and lowers them end-to-end to executable GPU programs. We use Mamba-style selective scan as a motivating example of a broader class of affine recurrences that arise in modern ML workloads. Rather than targeting a single model family, ScanWeaver elevates this recurrence structure to a first-class compiler abstraction, enabling systematic MLIR-based lowering to compiler-generated Blelloch scan execution on GPUs.
Across forward selective-scan workloads with matched local recurrence semantics, we validate affine recurrence decomposition, Blelloch lowering, MLIR GPU lowering, executable artifact generation, and actual GPU execution from generated MLIR. We benchmark the resulting ScanWeaver GPU execution against PyTorch and CUDA sequential baselines, and include the Mamba kernel as a fused production baseline for systems context.2026-05-30T07:58:02Z11 pages, 1 figureQiying WuPavel Zolnikovhttp://arxiv.org/abs/2606.00552v1Edge-Based QoS-Aware Adaptive Task Placement: A Closed-Loop Control in Multi-Robot Systems2026-05-30T05:54:44ZMulti-robot systems (MRS) increasingly offload compute-intensive perception tasks to edge nodes to meet strict time-sensitive Quality-of-Service (QoS) constraints. However, static task orchestration on a shared edge node can severely degrade QoS due to network latency, jitter, and edge-resource contention. We present a pilot edge-centric MRS testbed using Raspberry Pi nodes to evaluate a camera-to-manipulator pipeline under three modes: local execution, static offloading, and a QoS-aware Adaptive Task Placement (ATP) controller. ATP scores candidate placements using a multi-metric cost (normalized latency, CPU utilization, and switching overhead) over two-second control windows. The closed-loop visual servoing testbed is instrumented with sub-millisecond clock synchronization, network emulation, and detailed monitoring of multiple metrics across nodes to capture realistic jitter. Experimental results under compute-stress and network-fault scenarios show that static edge offloading reduces on-board CPU load but amplifies tail latency and deadline misses. In contrast, the QoS-aware ATP controller, by switching task placement based on measured latency and utilization thresholds, consistently lowers deadline violations and tail latency. Overall, the results position ATP as a practical edge-side control primitive for MRS and concrete design guidelines for Cloud-Edge Robotics deployments within the broader cloud-fog automation, while motivating QoS-aware multi-objective workload orchestration for industrial cyber-physical systems.2026-05-30T05:54:44Z6 pages, 2 figure, 1 algorithm, accepted as a regular paper on the 24th IEEE International Conference on Industrial Informatics (INDIN), 26-29 July, 2026, Melbourne, AustraliaThien TranJonathan KuaThuong HoangMinh TranHonghao LyuJiong Jinhttp://arxiv.org/abs/2512.01646v3StarDist: A Code Generator for Distributed Graph Algorithms2026-05-30T04:42:05ZWe introduce StarDist, a Domain Specific Language for generating high-performant distributed graph algorithms in the message passing model. Our analysis-transformation framework optimizes graph traversal based on graph property access patterns, reduces global lock acquisitions on distributed structures, and minimizes message queues used in reduction operations. We provide a network optimized communication runtime for reduction operations that couples with our analysis framework and optimizes the propagation of updates based on vertex residency. StarDist is able to identify monotonic reduction blocks and is able to fuse reduction iterations over graphs into \textit{pulses}. We evaluate StarDist
using three fundamental graph algorithms belonging to the CONGEST model: single-source shortest paths, weakly connected components, and PageRank computation, using a suite comprising both real-world and synthetic graphs across varying densities of topological compaction. Our results illustrate that the code generated with StarDist outperforms the distributed frameworks DRONE and D-Galois by an average of 19$\times$ and 7$\times$, respectively on our high communication setup and by 1.4$\times$ and 1.92$\times$ respectively on our high congestion network setup when averaged across all three algorithms.2025-12-01T13:18:32ZBarenya Kumar NandyRupesh Nasrehttp://arxiv.org/abs/2606.00501v1Joint Optimization of Qubit Leasing and Quantum Circuit Distribution2026-05-30T03:27:45ZWe consider an agent, who would like to execute a given quantum circuit using resources leased from a set of quantum computers (QCs) connected by a quantum network. For this purpose, the agent needs to make the following four key decisions: (i) how many qubits to lease from each QC, (ii) at which QCs to store different circuit qubits in different time slots, (iii) at which QC to execute each gate in the circuit, and (iv) how to move qubits between QCs, choosing between migration and teleportation. We refer to this problem facing the agent as the joint qubit leasing and quantum circuit distribution (JQLQCD) problem, and provide a comprehensive integer linear programming (ILP) formulation for it. We show that the JQLQCD problem is NP-complete. Next, we identify several special cases in which the problem can be optimally solved in closed form or via polynomial-time algorithms. Also, we propose a greedy algorithm with local search refinement to solve large instances of the general JQLQCD problem. Finally, we evaluate the performance of the proposed greedy algorithm using extensive numerical computations.2026-05-30T03:27:45ZAnoushka DeyGaurav S. Kasbekarhttp://arxiv.org/abs/2603.29002v3Understand and Accelerate Memory Processing Pipeline for Large Language Model Inference2026-05-30T00:45:53ZModern large language models (LLMs) increasingly depends on efficient long-context processing and generation mechanisms, including sparse attention, retrieval-augmented generation (RAG), and compressed contextual memory, to support complex reasoning. We show that these optimizations can be unified into a four-step memory processing pipeline: Prepare Memory, Compute Relevancy, Retrieval, and Apply to Inference. Through systematic profiling, we identify a 22%-97% memory processing overhead in LLM inference and strong heterogeneity in its computational characteristics. Motivated by this insight, we argue that \textbf{heterogeneous systems} are well-suited to accelerate memory processing and thus end-to-end inference. We demonstrate this approach on a GPU-FPGA system by offloading sparse, irregular, and memory-bounded operations to FPGAs while retaining compute-intensive operations on GPUs. Evaluated on an AMD MI210 GPU and an Alveo U55C FPGA, our system is up to $2.2\times$ faster and achieves up to $4.7\times$ less energy across multiple LLM inference optimizations than the GPU baseline (similar results hold on NVIDIA A100). These results establish heterogeneous systems as a practical direction for efficient LLM memory processing and inform future heterogeneous hardware design.2026-03-30T21:03:39ZAccepted by ICML 2026. Code: https://github.com/OswaldHe/HeteroLLMZifan HeRui MaYizhou SunJason Conghttp://arxiv.org/abs/2606.02627v1Streami: An MPI Data-Parallel Library to Compute Field Lines on GPUs2026-05-29T20:55:35ZWe present Streami, an extensible GPU-accelerated library for the computation of field lines in fluid flows on high-performance computers. Streami acts as a thin layer used for both post-hoc or in-situ analysis and can interface with existing MPI applications. We discuss Streami's application programming interface, key design decisions that led to Streami's high performance and extensibility, as well as extensions to support different fluid flow field representations. We also present a sample application for rapid prototyping and interactive seed point placement. Streami is released under a permissive open-source software license.2026-05-29T20:55:35ZStefan ZellmannMilan JarosAndrea ParisIngo WaldTatiana von Landesbergerhttp://arxiv.org/abs/2606.00348v1Augur: Pre-Execution Energy Prediction for Workflow Tasks in Heterogeneous Clusters2026-05-29T20:38:10ZScientific workflows are widely used to process large quantities of data, leading to significant energy consumption and carbon emissions. To reduce this environmental impact, energy and carbon-aware scheduling approaches could be employed. However, such methods require runtime and energy predictions, which are typically only available for workflows that have been executed previously. Meanwhile, scientists may execute new or modified workflows, use workflows with different input data, or run them on alternative infrastructure. To address this critical gap, we propose Augur, a novel method to predict the energy consumption of scientific workflow tasks prior to execution. By efficiently profiling both the available cluster infrastructure and the workflow at hand, Augur is capable of predicting the overall energy consumption of the workflow with a median prediction error of $16.3\pm15.3\%$ compared to Ichnos, an energy estimation method that uses fitted power models, and $18.2\pm14.7\%$ compared to Intel RAPL, as observed in our experimental evaluation on public and private cloud infrastructure. Relying on only minimal historical execution data, Augur outperforms two state-of-the-art methods in predicting both task runtime and total workflow energy, providing a robust foundation for energy-efficient and carbon-aware scientific data analysis.2026-05-29T20:38:10ZAccepted at 2026 IEEE International Conference on Cloud Computing (CLOUD)Kathleen WestVasilis BountrisPhilipp ThammUlf LeserYehia ElkhatibLauritz Thamsenhttp://arxiv.org/abs/2601.08082v3Hierarchical Recursive Precision for Accelerating Symmetric Linear Solves on MXUs2026-05-29T20:18:39ZSymmetric positive-definite system solvers based on Cholesky factorization are fundamental to many scientific applications, such as climate modeling. We present a portable, nested recursive mixed-precision solver designed for Matrix Processing Units (MXUs), including NVIDIA Tensor Cores (H200) and AMD Matrix Cores (MI300X), that assigns low-precision FP16 arithmetic to large off-diagonal blocks, while preserving high precision on diagonal blocks to ensure numerical stability. The solver is implemented in Julia, providing a high-level, hardware-agnostic interface. We demonstrate up to a 5.07x speedup relative to the diagonal-precision vendor baseline, with 100x better accuracy than pure half precision on H200, providing higher accuracy than low-precision at higher speed than high-precision. Positive performance trends are also observed on MI300X, demonstrating broad applicability across GPUs.2026-01-12T23:46:20Z10 pages, 11 figuresVicki CarricaRabab AlomairyEvelyne RingootAlan Edelmanhttp://arxiv.org/abs/2606.00287v1Leveraging the Learning Curve: Reusing Existing Architectural Patterns to Design and Implement MAS2026-05-29T19:20:00ZRecent advancements in AI have led to the development of specialized systems related to multi-agent systems (MAS). However, the inherently collaborative nature of agents is often overlooked, and many of these specialized systems are used as components by other AI systems. From a software engineering perspective, this context can benefit from aligning the architectural characteristics of distributed systems with the inherently distributed nature of MAS. We propose that introducing a minimal set of agent-related concepts into the Distributed Systems (DS) domain can improve the engineering of modern MAS by leveraging techniques from DS engineering with established agent theory. In this study, we recapitulated the common origins of MAS and DS by drawing architectural parallels to establish a unified engineering approach. We then defined a minimal set of agent concepts to perform two practical studies on leveraging MAS development. First, we incorporated these concepts into a DS architectural pattern to design a distributed MAS. We then used these concepts in a graduate course to teach MAS engineering to students with no prior knowledge of agent theory. The learning outcomes from both courses included successful MAS implementation using DS tools and techniques. Although more than two-thirds of these students had no practical experience in developing distributed systems, the average final grade in both courses was above 80\%, thus validating our approach. Finally, we discuss how this study supports the development of advanced systems using modern AI techniques consistently with established agent-related research while leveraging established DS techniques and concepts.2026-05-29T19:20:00ZAuthor's accepted manuscript of an article published in IEEE Access. 17 pages, 6 figures. IEEE Access, vol. 13, pp. 45809-45825, 2025. Copyright 2025 IEEE. Personal use of this material is permitted. The final version is available at https://doi.org/10.1109/ACCESS.2025.3546526IEEE Access, vol. 13, pp. 45809-45825, 2025Arthur CasalsAnarosa A. F. Brandão10.1109/ACCESS.2025.3546526http://arxiv.org/abs/2606.00271v1HeLoCo: Efficient asynchronous low-communication training under data and device heterogeneity2026-05-29T19:02:10ZDistributed Low-Communication (DiLoCo) training reduces communication overhead by allowing workers to perform multiple local optimization steps before sending pseudo-gradients to a global outer update. Its asynchronous variant further improves hardware utilization by removing synchronization barriers, but at the cost of stale pseudo-gradients computed from outdated model states. As a result, these updates can become misaligned with the current global optimization direction, particularly in heterogeneous systems. This issue becomes even more pronounced when data are non-IID, a setting that has not been well studied in asynchronous low-communication training. To address this limitation, we propose \textbf{HeLoCo}, a direction-aware correction method for asynchronous low-communication training that uses outer momentum as a reference for the current optimization trajectory and selectively adjusts incoming pseudo-gradients before the outer update. Updates that remain aligned are preserved, while directionally conflicting components are corrected. On multilingual language-model training with heterogeneous workers and non-IID data, HeLoCo consistently improves validation loss. It outperforms existing asynchronous DiLoCo-based baselines by up to 7.5\% at a fixed token budget, exceeds asynchronous momentum look-ahead by up to 3.3\% at a fixed wall-clock budget, and surpasses the synchronous baseline by up to 22.1\% under severe system heterogeneity. Our analysis further shows how staleness, worker speed, and data heterogeneity shape update quality and convergence in highly decentralized and heterogeneous training setups.2026-05-29T19:02:10ZAbdullah Al AsifPatrick DiemJuan Pablo MuñozFelix WolfAli JannesariArya Mazaherihttp://arxiv.org/abs/2603.28768v2CRAFT: Fine-Grained Cost-Aware Expert Replication For Efficient Mixture-of-Experts Serving2026-05-29T18:36:29ZMixture-of-Experts (MoE) has recently emerged as the mainstream architecture for efficiently scaling large language models while maintaining near-constant computational cost. Expert parallelism distributes parameters by partitioning experts across devices, but this introduces token-level load imbalance during inference. Expert replication is a widely adopted load-balancing technique in serving frameworks that alleviates load imbalance in large-scale deployments by replicating experts with high loads. In this work, we demonstrate that existing replication schemes often over-replicate, with many replicas providing marginal improvement. Replicas consume substantial GPU memory, which may lead to resource contention and throughput degradation. We present CRAFT, an efficient expert replication framework that maximizes load balance under a given memory budget by performing fine-grained, per-layer replication based on the estimated replication benefit. CRAFT can be seamlessly integrated into existing serving frameworks without any additional training or model changes. Our evaluation shows that CRAFT increases end-to-end serving throughput by $1.14\times$ on average (up to $1.2\times$) over existing replication techniques in large-scale deployments with models ranging from hundreds of billions to a trillion parameters.2026-01-12T19:25:01Z22 pages, 15 figuresProceedings of the Ninth Conference on Machine Learning and Systems (MLSys 2026)Adrian ZhaoZhenkun CaiZhenyu SongLingfan YuHaozheng FanJun WuYida WangNandita Vijaykumarhttp://arxiv.org/abs/2605.31463v1PithTrain: A Compact and Agent-Native MoE Training System2026-05-29T15:52:58ZMixture-of-Experts (MoE) has become the dominant architecture for frontier language models. To meet this demand, production frameworks have built optimized MoE training stacks over years of engineering effort. Yet evolving these stacks for new architectures and system optimizations remains expensive. With the rise of AI coding agents, they could automate parts of training-framework development and accelerate this evolution. But applying them to these existing frameworks carries hidden costs, invisible to today's throughput-only evaluations. We name this missing dimension agent-task efficiency (ATE): the cost of using coding agents to understand, operate, and extend a framework. Grounded in four agent-native design principles, we build PithTrain, a compact, agent-native MoE training framework. We further introduce ATE-Bench, covering real-world training-framework tasks. Our evaluation shows PithTrain matches the throughput of production frameworks, and on ATE-Bench, PithTrain enables higher agent-task efficiency, with up to 62% fewer Agent Turns and 64% less Active GPU Time.2026-05-29T15:52:58ZRuihang LaiHao KangHaozhan TangAkaash R. ParthasarathyZichun YuJunru ShaoTodd C. MowryChenyan XiongTianqi Chen