https://arxiv.org/api/tfpLKlpwjYwHQppPpvzBQi7u0GU2026-04-07T14:34:01Z2791325515http://arxiv.org/abs/2603.20317v1Which Workloads Belong in Orbit? A Workload-First Framework for Orbital Data Centers Using Semantic Abstraction2026-03-19T22:25:47ZSpace-based compute is becoming plausible as launch costs fall and data-intensive AI workloads grow. This paper proposes a workload-centric framework for deciding which tasks belong in orbit versus terrestrial cloud, along with a phased adoption model tied to orbital data center maturity. We ground the framework with in-orbit semantic-reduction prototypes. An Earth-observation pipeline on Sentinel-2 imagery from Seattle and Bengaluru (formerly Bangalore) achieves 99.7-99.99% payload reduction by converting raw imagery to compact semantic artifacts. A multi-pass stereo reconstruction prototype reduces ~306 MB to ~1.57 MB of derived 3D representations (99.49% reduction). These results support a workload-first view in which semantic abstraction, not raw compute scale, drives early workload suitability.2026-03-19T22:25:47ZDurgendra Narayan Singhhttp://arxiv.org/abs/2603.19472v1Non-trivial automata networks do exist that solve the global majority problem with the local majority rule2026-03-19T21:11:02ZThe global majority problem, often referred to as the Density Classification Task, is a classical benchmark in the context of probing the computational capabilities of automata networks. It poses the simple yet challenging problem of determining, by totally local means, whether an arbitrary initial configuration of binary states can evolve to a final, homogeneous global configuration that reflects the initial global majority. Although it is known that in the specific case of cellular automata with periodic boundaries no rule is able to solve the problem, in other formulations solutions are known and, in others, the problem is still open. Aligned with the latter, here we explore the possibility of solving the problem with automata networks, operating only with the local majority rule, with a focus on identifying non-trivial cases where it can be solved and explaining why they do so.2026-03-19T21:11:02ZPedro Paulo BalbiKévin PerrotMarius RollandEurico Ruivohttp://arxiv.org/abs/2603.19431v1SWARM+: Scalable and Resilient Multi-Agent Consensus for Fully-Decentralized Data-Aware Workload Management2026-03-19T19:51:02ZDistributed scientific workflows increasingly span heterogeneous compute clusters, edge resources, and geo-distributed data repositories. In these environments, a centralized orchestrator is an architectural bottleneck -- introducing a single point of failure, limiting scalability, and constraining adaptability to changing resource availability or failures. Decentralized multi-agent coordination offers a compelling alternative: autonomous agents representing distributed resources collaboratively negotiate workload assignment (e.g., job selection) through peer-to-peer consensus, making decisions based on local compute capacity, data locality, and network conditions. However, scaling such systems for production workloads requires addressing challenges in coordination, resilience, and data-aware optimization. This work presents SWARM+, which builds on our prior work that demonstrated the feasibility of multi-agent decentralized consensus for distributed job selection. SWARM+ addresses three main problems: scalability of consensus for large numbers of agents, resilience of workload management under agent failure, and efficiency of job scheduling for highly distributed resources and data-intensive workloads. For each problem, we propose novel algorithms and evaluate them in the distributed FABRIC testbed. The results show that SWARM+ (a) scales to 1000 distributed agents with nearly equal workload distribution across the hierarchy levels and reduced coordination overhead due to hierarchical consensus, (b) is resilient to agent failures, maintaining >99% job completion rate under single agent failure, and demonstrating graceful system degradation, with at most 7.5% impact under 50% agent failures, and (c) achieves 97-98% improvement over baseline SWARM for both selection time and scheduling latency metrics.2026-03-19T19:51:02ZKomal TharejaKrishnan RaghavanAnirban MandalEwa Deelmanhttp://arxiv.org/abs/2603.19418v1Speculative Policy Orchestration: A Latency-Resilient Framework for Cloud-Robotic Manipulation2026-03-19T19:24:14ZCloud robotics enables robots to offload high-dimensional motion planning and reasoning to remote servers. However, for continuous manipulation tasks requiring high-frequency control, network latency and jitter can severely destabilize the system, causing command starvation and unsafe physical execution.
To address this, we propose Speculative Policy Orchestration (SPO), a latency-resilient cloud-edge framework. SPO utilizes a cloud-hosted world model to pre-compute and stream future kinematic waypoints to a local edge buffer, decoupling execution frequency from network round-trip time. To mitigate unsafe execution caused by predictive drift, the edge node employs an $ε$-tube verifier that strictly bounds kinematic execution errors. The framework is coupled with an Adaptive Horizon Scaling mechanism that dynamically expands or shrinks the speculative pre-fetch depth based on real-time tracking error.
We evaluate SPO on continuous RLBench manipulation tasks under emulated network delays. Results show that even when deployed with learned models of modest accuracy, SPO reduces network-induced idle time by over 60% compared to blocking remote inference. Furthermore, SPO discards approximately 60% fewer cloud predictions than static caching baselines. Ultimately, SPO enables fluid, real-time cloud-robotic control while maintaining bounded physical safety.2026-03-19T19:24:14Z9 pages, 7 figures, conference submissionChanh NguyenShutong JinFlorian T. PokornyErik Elmrothhttp://arxiv.org/abs/2603.19406v1The Bilateral Efficiency of Ethernet: Recalibrating Metcalfe and Boggs After Fifty Years2026-03-19T18:57:48ZIn July 1976, Metcalfe and Boggs published their foundational paper on Ethernet in Communications of the ACM. Their efficiency model -- E = (P/C)/(P/C + W*T) -- measures the fraction of Ether time carrying good forward packets under contention. For fifty years this model has defined how the networking community thinks about Ethernet performance. We argue that the model, while correct for its intended purpose, measures only the forward channel and is silent on the question that matters for modern distributed systems: bilateral transaction efficiency -- the fraction of link time that produces committed agreements between sender and receiver.
We show that Metcalfe and Boggs themselves understood this distinction intuitively. Their EFTP "end-dally" protocol (Section 7.2.2 of the original paper) is a three-phase bilateral handshake that attempts to achieve mutual knowledge of transfer completion -- precisely the property that their efficiency model cannot capture. We connect this observation to the Open Atomic Ethernet's bilateral transaction primitive, to the back-to-back Shannon channel formulation with Perfect Information Feedback, and to the Two-State Vector Formalism (TSVF) from physics, which provides the theoretical framework for understanding why both boundary conditions -- sender and receiver -- must be specified for a transaction to have definite value.
The correction to Table 1 of Metcalfe and Boggs is not a different set of numbers. It is a different question.2026-03-19T18:57:48Z10 pages, ACM sigconf format. 50th anniversary of Metcalfe and Boggs (1976)Paul Borrillhttp://arxiv.org/abs/2603.19163v1cuGenOpt: A GPU-Accelerated General-Purpose Metaheuristic Framework for Combinatorial Optimization2026-03-19T17:19:21ZCombinatorial optimization problems arise in logistics, scheduling, and resource allocation, yet existing approaches face a fundamental trade-off among generality, performance, and usability. We present cuGenOpt, a GPU-accelerated general-purpose metaheuristic framework that addresses all three dimensions simultaneously.
At the engine level, cuGenOpt adopts a "one block evolves one solution" CUDA architecture with a unified encoding abstraction (permutation, binary, integer), a two-level adaptive operator selection mechanism, and hardware-aware resource management. At the extensibility level, a user-defined operator registration interface allows domain experts to inject problem-specific CUDA search operators. At the usability level, a JIT compilation pipeline exposes the framework as a pure-Python API, and an LLM-based modeling assistant converts natural-language problem descriptions into executable solver code.
Experiments across five thematic suites on three GPU architectures (T4, V100, A800) show that cuGenOpt outperforms general MIP solvers by orders of magnitude, achieves competitive quality against specialized solvers on instances up to n=150, and attains 4.73% gap on TSP-442 within 30s. Twelve problem types spanning five encoding variants are solved to optimality. Framework-level optimizations cumulatively reduce pcb442 gap from 36% to 4.73% and boost VRPTW throughput by 75-81%.
Code: https://github.com/L-yang-yang/cugenopt2026-03-19T17:19:21Z28 pages, 9 figures. Code available at https://github.com/L-yang-yang/cugenoptYuyang Liuhttp://arxiv.org/abs/2603.19101v1FedTrident: Resilient Road Condition Classification Against Poisoning Attacks in Federated Learning2026-03-19T16:26:13ZFL has emerged as a transformative paradigm for ITS, notably camera-based Road Condition Classification (RCC). However, by enabling collaboration, FL-based RCC exposes the system to adversarial participants launching Targeted Label-Flipping Attacks (TLFAs). Malicious clients (vehicles) can relabel their local training data (e.g., from an actual uneven road to a wrong smooth road), consequently compromising global model predictions and jeopardizing transportation safety. Existing countermeasures against such poisoning attacks fail to maintain resilient model performance near the necessary attack-free levels in various attack scenarios due to: 1) not tailoring poisoned local model detection to TLFAs, 2) not excluding malicious vehicular clients based on historical behavior, and 3) not remedying the already-corrupted global model after exclusion. To close this research gap, we propose FedTrident, which introduces: 1) neuron-wise analysis for local model misbehavior detection (notably including attack goal identification, critical feature extraction, and GMM-based model clustering and filtering); 2) adaptive client rating for client exclusion according to the local model detection results in each FL round; and 3) machine unlearning for corrupted global model remediation once malicious clients are excluded during FL. Extensive evaluation across diverse FL-RCC models, tasks, and configurations demonstrates that FedTrident can effectively thwart TLFAs, achieving performance comparable to that in attack-free scenarios and outperforming eight baseline countermeasures by 9.49% and 4.47% for the two most critical metrics. Moreover, FedTrident is resilient to various malicious client rates, data heterogeneity levels, complicated multi-task, and dynamic attacks.2026-03-19T16:26:13ZSheng LiuPanos Papadimitratoshttp://arxiv.org/abs/2603.19099v1Why Synchronized Time is a Fiction: Daylight Saving Time, Leap Seconds, and the Guillotine Sharpened for Nothing2026-03-19T16:25:13ZCivilization maintains an elaborate infrastructure devoted to the maintenance of synchronized time. Governments mandate daylight saving time. Standards bodies insert leap seconds into Coordinated Universal Time. Engineers debate leap milliseconds and leap nanoseconds. The Global Positioning System applies relativistic corrections at the nanosecond level. All of these adjustments attempt to preserve an assumption: that a single global time exists and that clocks can be made to agree upon it.
This paper argues that this assumption constitutes a category mistake in the sense of Ryle (1949). We show that special and general relativity prohibit absolute simultaneity, that the one-way speed of light is conventionally defined rather than measured, and that recent experiments on indefinite causal order demonstrate nature admits correlations with no well-defined temporal sequence. We trace the consequences of this category mistake through distributed computing, where it manifests as the Forward-In-Time-Only (FITO) assumption that underlies Lamport's logical clocks (1978), the impossibility results of Fischer-Lynch-Paterson (1985), and the CAP theorem (2000). From this perspective, daylight saving time and leap seconds are not corrections to time but corrections to conventions -- they sharpen the guillotine of synchronization in preparation for executing something that does not exist.2026-03-19T16:25:13Z18 pages, 24 referencesPaul Borrillhttp://arxiv.org/abs/2603.19016v1Literature Study on Operational Data Analytics Frameworks in Large-scale Computing Infrastructures2026-03-19T15:20:30ZBy 2025, there are zettabytes of data generated every year. The size and complexity of modern large-scale computing infrastructures like High-Performance Computing (HPC) systems continue to evolve and become complex, leaving us wondering about their manageability and sustainability concerns. Because of this reason, those complex systems are provided with fine-grained monitoring and Operational Data Analytics (ODA) capabilities to optimise their efficiency. In this literature study, we list the fundamental pillars of the large-scale computing infrastructures which enable its ODA capabilities, and conduct a study of the popular ODA frameworks operating in various such environments (predominantly HPC). Based on that, we propose a more holistic ODA framework matching the various layers of a large-scale graph-processing distributed ecosystem proposed by Sherif Sak et al, that extends the ODA functionalities presented in an existing novel ODA framework proposed by Netti et al. We compare the holistic ODA framework proposed by us to some of the state-of-the-art frameworks that we study as part of this literature to highlight the novelty, which would hopefully draw more attention to perform extensive research in this field. As part of creating awareness, we highlight the significant operational efficiencies observed as a result of the implementation of the state-of-the-art ODA frameworks to make the study appear beneficial for the readers, and lastly, discuss the trending research work ongoing in this field.2026-03-19T15:20:30ZShekhar SumanXiaoyu ChuAlexandru Iosuphttp://arxiv.org/abs/2603.18897v1Act While Thinking: Accelerating LLM Agents via Pattern-Aware Speculative Tool Execution2026-03-19T13:36:50ZLLM-powered agents are emerging as a dominant paradigm for autonomous task solving. Unlike standard inference workloads, agents operate in a strictly serial "LLM-tool" loop, where the LLM must wait for external tool execution at every step. This execution model introduces severe latency bottlenecks. To address this problem, we propose PASTE, a Pattern-Aware Speculative Tool Execution method designed to hide tool latency through speculation. PASTE is based on the insight that although agent requests are semantically diverse, they exhibit stable application level control flows (recurring tool-call sequences) and predictable data dependencies (parameter passing between tools). By exploiting these properties, PASTE improves agent serving performance through speculative tool execution. Experimental results against state of the art baselines show that PASTE reduces average task completion time by 48.5% and improves tool execution throughput by 1.8x.2026-03-19T13:36:50ZYifan SuiHan ZhaoRui MaZhiyuan HeHao WangJianxun LiYuqing Yanghttp://arxiv.org/abs/2604.00028v1Sequence-Aware Split Heuristic to Mitigate SM Underutilization in FlashAttention-3 Low-Head-Count Decoding2026-03-19T11:44:20ZThe standard FlashAttention-3 heuristic exhibits a GPU occupancy bottleneck in low-head-count decoding configurations because it disables sequence splitting based on sequence length alone, underutilizing the Streaming Multiprocessors of Hopper GPUs. Our proposed sequence-aware split policy mitigates this by allowing sequence-level parallelism in low-head-count regimes, improving hardware utilization to deliver roughly a 21 to 24% improvement in decoder kernel efficiency on metadata-enabled inference paths, with no observed regressions.2026-03-19T11:44:20ZMartí Llopart FontJavier HernandoCristina España-Bonethttp://arxiv.org/abs/2507.06542v3On the Surprising Effectiveness of a Single Global Merging in Decentralized Learning2026-03-19T10:05:02ZDecentralized learning provides a scalable alternative to parameter-server-based training, yet its performance is often hindered by limited peer-to-peer communication. In this paper, we study how communication should be scheduled over time, including determining when and how frequently devices synchronize. Counterintuitive empirical results show that concentrating communication budgets in the later stages of decentralized training remarkably improves global test performance. Surprisingly, we uncover that fully connected communication at the final step, implemented by a single global merging, can significantly improve the performance of decentralized learning under high data heterogeneity. Our theoretical contributions, which explain these phenomena, are the first to establish that the globally merged model of decentralized SGD can match the convergence rate of parallel SGD. Technically, we reinterpret part of the discrepancy among local models, which were previously considered as detrimental noise, as constructive components essential for matching this rate. This work provides evidence that decentralized learning is able to generalize under high data heterogeneity and limited communication, while offering broad new avenues for model merging research.2025-07-09T04:56:56ZWe discover and theoretically explain why and when a single global parameter merging in decentralized learning can recover the performance of federated learning, even in highly heterogeneous and communication-constrained environmentsICLR 2026 (Oral Presentation)Tongtian ZhuTianyu ZhangMingze WangZhanpeng ZhouCan Wanghttp://arxiv.org/abs/2603.18695v1High-Performance Portable GPU Primitives for Arbitrary Types and Operators in Julia2026-03-19T09:53:32ZPortable GPU frameworks such as Kokkos and RAJA reduce the burden of cross-architecture development but typically incur measurable overhead on fundamental parallel primitives relative to vendor-optimized libraries. We present KernelForge.jl, a Julia library that implements scan, mapreduce, and matrix-vector primitives through a two-layer portable architecture: KernelIntrinsics.jl provides backend-agnostic abstractions for warp-level shuffles, memory fences, and vectorized memory access, while KernelForge.jl builds high-performance algorithms exclusively on top of these interfaces. Evaluated on an NVIDIA A40 and an AMD MI300X, KernelForge.jl matches or exceeds CUB kernel execution time on scan and mapreduce on the A40, and matches cuBLAS throughput on matrix-vector operations across most tested configurations-demonstrating, as a proof of concept, that portable JIT-compiled abstractions can achieve vendor-level throughput without sacrificing generality.2026-03-19T09:53:32ZEmmanuel PilliatENSAIhttp://arxiv.org/abs/2603.28783v1Comprehensive Plugin-Based Monitoring of Nexflow Workflow Executions2026-03-19T09:41:22ZNextflow is a workflow management system commonly used in fields like bioinformatics and earth observation. It coordinates distributed data processing of various tools as an acyclic sequence of tasks while using, containerization (e.g., Docker), orchestration (e.g., Kubernetes), or batch processing (e.g., SLURM). Monitoring such workflow executions can be challenging but aids performance analysis, debugging, and data provenance. Besides Nexflow's basic built-in monitoring, the wf-commons tool for creating wf-instances is widely regarded as the standard in the Nextflow community. The monitoring plugin we develpoed provides a more detailed and flexible alternative compatible with wf-instances while removing the need for a custom Nextflow fork by using Nextflow's plug-in mechanism (version 21.10), optional direct .jar file changes of static artifacts without recompilation and allows online monitoring during execution.2026-03-19T09:41:22ZAccepted as poster to the SCA/HPC Asia 2026 Conference. The poster is provided as ancillary fileSami KharmaTobias WiesFlorian Schintkehttp://arxiv.org/abs/2506.02009v2STRATUS: A Multi-agent System for Autonomous Reliability Engineering of Modern Clouds2026-03-19T04:55:11ZIn cloud-scale systems, failures are the norm. A distributed computing cluster exhibits hundreds of machine failures and thousands of disk failures; software bugs and misconfigurations are reported to be more frequent. The demand for autonomous, AI-driven reliability engineering continues to grow, as existing humanin-the-loop practices can hardly keep up with the scale of modern clouds. This paper presents STRATUS, an LLM-based multi-agent system for realizing autonomous Site Reliability Engineering (SRE) of cloud services. STRATUS consists of multiple specialized agents (e.g., for failure detection, diagnosis, mitigation), organized in a state machine to assist system-level safety reasoning and enforcement. We formalize a key safety specification of agentic SRE systems like STRATUS, termed Transactional No-Regression (TNR), which enables safe exploration and iteration. We show that TNR can effectively improve autonomous failure mitigation. STRATUS significantly outperforms state-of-the-art SRE agents in terms of success rate of failure mitigation problems in AIOpsLab and ITBench (two SRE benchmark suites), by at least 1.5 times across various models. STRATUS shows a promising path toward practical deployment of agentic systems for cloud reliability.2025-05-27T19:15:19Z10 pages for main textYinfang ChenJiaqi PanJackson ClarkYiming SuNoah ZheutlinBhavya BhavyaRohan AroraYu DengSaurabh JhaTianyin Xu