https://arxiv.org/api/tfpLKlpwjYwHQppPpvzBQi7u0GU 2026-04-07T14:34:01Z 27913 255 15 http://arxiv.org/abs/2603.20317v1 Which Workloads Belong in Orbit? A Workload-First Framework for Orbital Data Centers Using Semantic Abstraction 2026-03-19T22:25:47Z Space-based compute is becoming plausible as launch costs fall and data-intensive AI workloads grow. This paper proposes a workload-centric framework for deciding which tasks belong in orbit versus terrestrial cloud, along with a phased adoption model tied to orbital data center maturity. We ground the framework with in-orbit semantic-reduction prototypes. An Earth-observation pipeline on Sentinel-2 imagery from Seattle and Bengaluru (formerly Bangalore) achieves 99.7-99.99% payload reduction by converting raw imagery to compact semantic artifacts. A multi-pass stereo reconstruction prototype reduces ~306 MB to ~1.57 MB of derived 3D representations (99.49% reduction). These results support a workload-first view in which semantic abstraction, not raw compute scale, drives early workload suitability. 2026-03-19T22:25:47Z Durgendra Narayan Singh http://arxiv.org/abs/2603.19472v1 Non-trivial automata networks do exist that solve the global majority problem with the local majority rule 2026-03-19T21:11:02Z The global majority problem, often referred to as the Density Classification Task, is a classical benchmark in the context of probing the computational capabilities of automata networks. It poses the simple yet challenging problem of determining, by totally local means, whether an arbitrary initial configuration of binary states can evolve to a final, homogeneous global configuration that reflects the initial global majority. Although it is known that in the specific case of cellular automata with periodic boundaries no rule is able to solve the problem, in other formulations solutions are known and, in others, the problem is still open. Aligned with the latter, here we explore the possibility of solving the problem with automata networks, operating only with the local majority rule, with a focus on identifying non-trivial cases where it can be solved and explaining why they do so. 2026-03-19T21:11:02Z Pedro Paulo Balbi Kévin Perrot Marius Rolland Eurico Ruivo http://arxiv.org/abs/2603.19431v1 SWARM+: Scalable and Resilient Multi-Agent Consensus for Fully-Decentralized Data-Aware Workload Management 2026-03-19T19:51:02Z Distributed scientific workflows increasingly span heterogeneous compute clusters, edge resources, and geo-distributed data repositories. In these environments, a centralized orchestrator is an architectural bottleneck -- introducing a single point of failure, limiting scalability, and constraining adaptability to changing resource availability or failures. Decentralized multi-agent coordination offers a compelling alternative: autonomous agents representing distributed resources collaboratively negotiate workload assignment (e.g., job selection) through peer-to-peer consensus, making decisions based on local compute capacity, data locality, and network conditions. However, scaling such systems for production workloads requires addressing challenges in coordination, resilience, and data-aware optimization. This work presents SWARM+, which builds on our prior work that demonstrated the feasibility of multi-agent decentralized consensus for distributed job selection. SWARM+ addresses three main problems: scalability of consensus for large numbers of agents, resilience of workload management under agent failure, and efficiency of job scheduling for highly distributed resources and data-intensive workloads. For each problem, we propose novel algorithms and evaluate them in the distributed FABRIC testbed. The results show that SWARM+ (a) scales to 1000 distributed agents with nearly equal workload distribution across the hierarchy levels and reduced coordination overhead due to hierarchical consensus, (b) is resilient to agent failures, maintaining >99% job completion rate under single agent failure, and demonstrating graceful system degradation, with at most 7.5% impact under 50% agent failures, and (c) achieves 97-98% improvement over baseline SWARM for both selection time and scheduling latency metrics. 2026-03-19T19:51:02Z Komal Thareja Krishnan Raghavan Anirban Mandal Ewa Deelman http://arxiv.org/abs/2603.19418v1 Speculative Policy Orchestration: A Latency-Resilient Framework for Cloud-Robotic Manipulation 2026-03-19T19:24:14Z Cloud robotics enables robots to offload high-dimensional motion planning and reasoning to remote servers. However, for continuous manipulation tasks requiring high-frequency control, network latency and jitter can severely destabilize the system, causing command starvation and unsafe physical execution. To address this, we propose Speculative Policy Orchestration (SPO), a latency-resilient cloud-edge framework. SPO utilizes a cloud-hosted world model to pre-compute and stream future kinematic waypoints to a local edge buffer, decoupling execution frequency from network round-trip time. To mitigate unsafe execution caused by predictive drift, the edge node employs an $ε$-tube verifier that strictly bounds kinematic execution errors. The framework is coupled with an Adaptive Horizon Scaling mechanism that dynamically expands or shrinks the speculative pre-fetch depth based on real-time tracking error. We evaluate SPO on continuous RLBench manipulation tasks under emulated network delays. Results show that even when deployed with learned models of modest accuracy, SPO reduces network-induced idle time by over 60% compared to blocking remote inference. Furthermore, SPO discards approximately 60% fewer cloud predictions than static caching baselines. Ultimately, SPO enables fluid, real-time cloud-robotic control while maintaining bounded physical safety. 2026-03-19T19:24:14Z 9 pages, 7 figures, conference submission Chanh Nguyen Shutong Jin Florian T. Pokorny Erik Elmroth http://arxiv.org/abs/2603.19406v1 The Bilateral Efficiency of Ethernet: Recalibrating Metcalfe and Boggs After Fifty Years 2026-03-19T18:57:48Z In July 1976, Metcalfe and Boggs published their foundational paper on Ethernet in Communications of the ACM. Their efficiency model -- E = (P/C)/(P/C + W*T) -- measures the fraction of Ether time carrying good forward packets under contention. For fifty years this model has defined how the networking community thinks about Ethernet performance. We argue that the model, while correct for its intended purpose, measures only the forward channel and is silent on the question that matters for modern distributed systems: bilateral transaction efficiency -- the fraction of link time that produces committed agreements between sender and receiver. We show that Metcalfe and Boggs themselves understood this distinction intuitively. Their EFTP "end-dally" protocol (Section 7.2.2 of the original paper) is a three-phase bilateral handshake that attempts to achieve mutual knowledge of transfer completion -- precisely the property that their efficiency model cannot capture. We connect this observation to the Open Atomic Ethernet's bilateral transaction primitive, to the back-to-back Shannon channel formulation with Perfect Information Feedback, and to the Two-State Vector Formalism (TSVF) from physics, which provides the theoretical framework for understanding why both boundary conditions -- sender and receiver -- must be specified for a transaction to have definite value. The correction to Table 1 of Metcalfe and Boggs is not a different set of numbers. It is a different question. 2026-03-19T18:57:48Z 10 pages, ACM sigconf format. 50th anniversary of Metcalfe and Boggs (1976) Paul Borrill http://arxiv.org/abs/2603.19163v1 cuGenOpt: A GPU-Accelerated General-Purpose Metaheuristic Framework for Combinatorial Optimization 2026-03-19T17:19:21Z Combinatorial optimization problems arise in logistics, scheduling, and resource allocation, yet existing approaches face a fundamental trade-off among generality, performance, and usability. We present cuGenOpt, a GPU-accelerated general-purpose metaheuristic framework that addresses all three dimensions simultaneously. At the engine level, cuGenOpt adopts a "one block evolves one solution" CUDA architecture with a unified encoding abstraction (permutation, binary, integer), a two-level adaptive operator selection mechanism, and hardware-aware resource management. At the extensibility level, a user-defined operator registration interface allows domain experts to inject problem-specific CUDA search operators. At the usability level, a JIT compilation pipeline exposes the framework as a pure-Python API, and an LLM-based modeling assistant converts natural-language problem descriptions into executable solver code. Experiments across five thematic suites on three GPU architectures (T4, V100, A800) show that cuGenOpt outperforms general MIP solvers by orders of magnitude, achieves competitive quality against specialized solvers on instances up to n=150, and attains 4.73% gap on TSP-442 within 30s. Twelve problem types spanning five encoding variants are solved to optimality. Framework-level optimizations cumulatively reduce pcb442 gap from 36% to 4.73% and boost VRPTW throughput by 75-81%. Code: https://github.com/L-yang-yang/cugenopt 2026-03-19T17:19:21Z 28 pages, 9 figures. Code available at https://github.com/L-yang-yang/cugenopt Yuyang Liu http://arxiv.org/abs/2603.19101v1 FedTrident: Resilient Road Condition Classification Against Poisoning Attacks in Federated Learning 2026-03-19T16:26:13Z FL has emerged as a transformative paradigm for ITS, notably camera-based Road Condition Classification (RCC). However, by enabling collaboration, FL-based RCC exposes the system to adversarial participants launching Targeted Label-Flipping Attacks (TLFAs). Malicious clients (vehicles) can relabel their local training data (e.g., from an actual uneven road to a wrong smooth road), consequently compromising global model predictions and jeopardizing transportation safety. Existing countermeasures against such poisoning attacks fail to maintain resilient model performance near the necessary attack-free levels in various attack scenarios due to: 1) not tailoring poisoned local model detection to TLFAs, 2) not excluding malicious vehicular clients based on historical behavior, and 3) not remedying the already-corrupted global model after exclusion. To close this research gap, we propose FedTrident, which introduces: 1) neuron-wise analysis for local model misbehavior detection (notably including attack goal identification, critical feature extraction, and GMM-based model clustering and filtering); 2) adaptive client rating for client exclusion according to the local model detection results in each FL round; and 3) machine unlearning for corrupted global model remediation once malicious clients are excluded during FL. Extensive evaluation across diverse FL-RCC models, tasks, and configurations demonstrates that FedTrident can effectively thwart TLFAs, achieving performance comparable to that in attack-free scenarios and outperforming eight baseline countermeasures by 9.49% and 4.47% for the two most critical metrics. Moreover, FedTrident is resilient to various malicious client rates, data heterogeneity levels, complicated multi-task, and dynamic attacks. 2026-03-19T16:26:13Z Sheng Liu Panos Papadimitratos http://arxiv.org/abs/2603.19099v1 Why Synchronized Time is a Fiction: Daylight Saving Time, Leap Seconds, and the Guillotine Sharpened for Nothing 2026-03-19T16:25:13Z Civilization maintains an elaborate infrastructure devoted to the maintenance of synchronized time. Governments mandate daylight saving time. Standards bodies insert leap seconds into Coordinated Universal Time. Engineers debate leap milliseconds and leap nanoseconds. The Global Positioning System applies relativistic corrections at the nanosecond level. All of these adjustments attempt to preserve an assumption: that a single global time exists and that clocks can be made to agree upon it. This paper argues that this assumption constitutes a category mistake in the sense of Ryle (1949). We show that special and general relativity prohibit absolute simultaneity, that the one-way speed of light is conventionally defined rather than measured, and that recent experiments on indefinite causal order demonstrate nature admits correlations with no well-defined temporal sequence. We trace the consequences of this category mistake through distributed computing, where it manifests as the Forward-In-Time-Only (FITO) assumption that underlies Lamport's logical clocks (1978), the impossibility results of Fischer-Lynch-Paterson (1985), and the CAP theorem (2000). From this perspective, daylight saving time and leap seconds are not corrections to time but corrections to conventions -- they sharpen the guillotine of synchronization in preparation for executing something that does not exist. 2026-03-19T16:25:13Z 18 pages, 24 references Paul Borrill http://arxiv.org/abs/2603.19016v1 Literature Study on Operational Data Analytics Frameworks in Large-scale Computing Infrastructures 2026-03-19T15:20:30Z By 2025, there are zettabytes of data generated every year. The size and complexity of modern large-scale computing infrastructures like High-Performance Computing (HPC) systems continue to evolve and become complex, leaving us wondering about their manageability and sustainability concerns. Because of this reason, those complex systems are provided with fine-grained monitoring and Operational Data Analytics (ODA) capabilities to optimise their efficiency. In this literature study, we list the fundamental pillars of the large-scale computing infrastructures which enable its ODA capabilities, and conduct a study of the popular ODA frameworks operating in various such environments (predominantly HPC). Based on that, we propose a more holistic ODA framework matching the various layers of a large-scale graph-processing distributed ecosystem proposed by Sherif Sak et al, that extends the ODA functionalities presented in an existing novel ODA framework proposed by Netti et al. We compare the holistic ODA framework proposed by us to some of the state-of-the-art frameworks that we study as part of this literature to highlight the novelty, which would hopefully draw more attention to perform extensive research in this field. As part of creating awareness, we highlight the significant operational efficiencies observed as a result of the implementation of the state-of-the-art ODA frameworks to make the study appear beneficial for the readers, and lastly, discuss the trending research work ongoing in this field. 2026-03-19T15:20:30Z Shekhar Suman Xiaoyu Chu Alexandru Iosup http://arxiv.org/abs/2603.18897v1 Act While Thinking: Accelerating LLM Agents via Pattern-Aware Speculative Tool Execution 2026-03-19T13:36:50Z LLM-powered agents are emerging as a dominant paradigm for autonomous task solving. Unlike standard inference workloads, agents operate in a strictly serial "LLM-tool" loop, where the LLM must wait for external tool execution at every step. This execution model introduces severe latency bottlenecks. To address this problem, we propose PASTE, a Pattern-Aware Speculative Tool Execution method designed to hide tool latency through speculation. PASTE is based on the insight that although agent requests are semantically diverse, they exhibit stable application level control flows (recurring tool-call sequences) and predictable data dependencies (parameter passing between tools). By exploiting these properties, PASTE improves agent serving performance through speculative tool execution. Experimental results against state of the art baselines show that PASTE reduces average task completion time by 48.5% and improves tool execution throughput by 1.8x. 2026-03-19T13:36:50Z Yifan Sui Han Zhao Rui Ma Zhiyuan He Hao Wang Jianxun Li Yuqing Yang http://arxiv.org/abs/2604.00028v1 Sequence-Aware Split Heuristic to Mitigate SM Underutilization in FlashAttention-3 Low-Head-Count Decoding 2026-03-19T11:44:20Z The standard FlashAttention-3 heuristic exhibits a GPU occupancy bottleneck in low-head-count decoding configurations because it disables sequence splitting based on sequence length alone, underutilizing the Streaming Multiprocessors of Hopper GPUs. Our proposed sequence-aware split policy mitigates this by allowing sequence-level parallelism in low-head-count regimes, improving hardware utilization to deliver roughly a 21 to 24% improvement in decoder kernel efficiency on metadata-enabled inference paths, with no observed regressions. 2026-03-19T11:44:20Z Martí Llopart Font Javier Hernando Cristina España-Bonet http://arxiv.org/abs/2507.06542v3 On the Surprising Effectiveness of a Single Global Merging in Decentralized Learning 2026-03-19T10:05:02Z Decentralized learning provides a scalable alternative to parameter-server-based training, yet its performance is often hindered by limited peer-to-peer communication. In this paper, we study how communication should be scheduled over time, including determining when and how frequently devices synchronize. Counterintuitive empirical results show that concentrating communication budgets in the later stages of decentralized training remarkably improves global test performance. Surprisingly, we uncover that fully connected communication at the final step, implemented by a single global merging, can significantly improve the performance of decentralized learning under high data heterogeneity. Our theoretical contributions, which explain these phenomena, are the first to establish that the globally merged model of decentralized SGD can match the convergence rate of parallel SGD. Technically, we reinterpret part of the discrepancy among local models, which were previously considered as detrimental noise, as constructive components essential for matching this rate. This work provides evidence that decentralized learning is able to generalize under high data heterogeneity and limited communication, while offering broad new avenues for model merging research. 2025-07-09T04:56:56Z We discover and theoretically explain why and when a single global parameter merging in decentralized learning can recover the performance of federated learning, even in highly heterogeneous and communication-constrained environments ICLR 2026 (Oral Presentation) Tongtian Zhu Tianyu Zhang Mingze Wang Zhanpeng Zhou Can Wang http://arxiv.org/abs/2603.18695v1 High-Performance Portable GPU Primitives for Arbitrary Types and Operators in Julia 2026-03-19T09:53:32Z Portable GPU frameworks such as Kokkos and RAJA reduce the burden of cross-architecture development but typically incur measurable overhead on fundamental parallel primitives relative to vendor-optimized libraries. We present KernelForge.jl, a Julia library that implements scan, mapreduce, and matrix-vector primitives through a two-layer portable architecture: KernelIntrinsics.jl provides backend-agnostic abstractions for warp-level shuffles, memory fences, and vectorized memory access, while KernelForge.jl builds high-performance algorithms exclusively on top of these interfaces. Evaluated on an NVIDIA A40 and an AMD MI300X, KernelForge.jl matches or exceeds CUB kernel execution time on scan and mapreduce on the A40, and matches cuBLAS throughput on matrix-vector operations across most tested configurations-demonstrating, as a proof of concept, that portable JIT-compiled abstractions can achieve vendor-level throughput without sacrificing generality. 2026-03-19T09:53:32Z Emmanuel Pilliat ENSAI http://arxiv.org/abs/2603.28783v1 Comprehensive Plugin-Based Monitoring of Nexflow Workflow Executions 2026-03-19T09:41:22Z Nextflow is a workflow management system commonly used in fields like bioinformatics and earth observation. It coordinates distributed data processing of various tools as an acyclic sequence of tasks while using, containerization (e.g., Docker), orchestration (e.g., Kubernetes), or batch processing (e.g., SLURM). Monitoring such workflow executions can be challenging but aids performance analysis, debugging, and data provenance. Besides Nexflow's basic built-in monitoring, the wf-commons tool for creating wf-instances is widely regarded as the standard in the Nextflow community. The monitoring plugin we develpoed provides a more detailed and flexible alternative compatible with wf-instances while removing the need for a custom Nextflow fork by using Nextflow's plug-in mechanism (version 21.10), optional direct .jar file changes of static artifacts without recompilation and allows online monitoring during execution. 2026-03-19T09:41:22Z Accepted as poster to the SCA/HPC Asia 2026 Conference. The poster is provided as ancillary file Sami Kharma Tobias Wies Florian Schintke http://arxiv.org/abs/2506.02009v2 STRATUS: A Multi-agent System for Autonomous Reliability Engineering of Modern Clouds 2026-03-19T04:55:11Z In cloud-scale systems, failures are the norm. A distributed computing cluster exhibits hundreds of machine failures and thousands of disk failures; software bugs and misconfigurations are reported to be more frequent. The demand for autonomous, AI-driven reliability engineering continues to grow, as existing humanin-the-loop practices can hardly keep up with the scale of modern clouds. This paper presents STRATUS, an LLM-based multi-agent system for realizing autonomous Site Reliability Engineering (SRE) of cloud services. STRATUS consists of multiple specialized agents (e.g., for failure detection, diagnosis, mitigation), organized in a state machine to assist system-level safety reasoning and enforcement. We formalize a key safety specification of agentic SRE systems like STRATUS, termed Transactional No-Regression (TNR), which enables safe exploration and iteration. We show that TNR can effectively improve autonomous failure mitigation. STRATUS significantly outperforms state-of-the-art SRE agents in terms of success rate of failure mitigation problems in AIOpsLab and ITBench (two SRE benchmark suites), by at least 1.5 times across various models. STRATUS shows a promising path toward practical deployment of agentic systems for cloud reliability. 2025-05-27T19:15:19Z 10 pages for main text Yinfang Chen Jiaqi Pan Jackson Clark Yiming Su Noah Zheutlin Bhavya Bhavya Rohan Arora Yu Deng Saurabh Jha Tianyin Xu