https://arxiv.org/api/q+Hc6EGdv36HxGlf48f84Tb1L+w2026-04-12T13:48:13Z2795354015http://arxiv.org/abs/2602.09937v2Why Do AI Agents Systematically Fail at Cloud Root Cause Analysis?2026-03-04T10:13:10ZFailures in large-scale cloud systems incur substantial financial losses, making automated Root Cause Analysis (RCA) essential for operational stability. Recent efforts leverage Large Language Model (LLM) agents to automate this task, yet existing systems exhibit low detection accuracy even with capable models, and current evaluation frameworks assess only final answer correctness without revealing why the agent's reasoning failed. This paper presents a process level failure analysis of LLM-based RCA agents. We execute the full OpenRCA benchmark across five LLM models, producing 1,675 agent runs, and classify observed failures into 12 pitfall types across intra-agent reasoning, inter-agent communication, and agent-environment interaction. Our analysis reveals that the most prevalent pitfalls, notably hallucinated data interpretation and incomplete exploration, persist across all models regardless of capability tier, indicating that these failures originate from the shared agent architecture rather than from individual model limitations. Controlled mitigation experiments further show that prompt engineering alone cannot resolve the dominant pitfalls, whereas enriching the inter-agent communication protocol reduces communication-related failures by up to 15 percentage points. The pitfall taxonomy and diagnostic methodology developed in this work provide a foundation for designing more reliable autonomous agents for cloud RCA.2026-02-10T16:14:05ZTaeyoon KimWoohyeok ParkHoyeong YunKyungyong Leehttp://arxiv.org/abs/2603.03899v1A framework to reason about consistency and atomicity guarantees in a sparsely-connected, partially-replicated peer-to-peer system2026-03-04T09:59:25ZFor an offline-first collaborative application to operate in true peer-to-peer fashion, its collaborative features must function even in environments where internet connectivity is limited or unavailable. Each peer may only be interested in a subset of the application data relevant to its workload, and this subset can overlap in different ways with those of other peers. Limitations imposed by access control and mesh network technologies often result in peers being sparsely connected. Reasoning about consistency in these systems is hard, especially when considering transactional updates that may alter different sets of data in the same transaction. We present \textsc{IntersectionAtomicity} and \textsc{IntersectionCC} as models to reason about offline-first collaborative applications that are sparsely-connected and rely on partially replicating different subsets of a broader set of data. We then use these models to propose a set of guidelines to help developers design their application with atomicity and consistency guarantees.2026-03-04T09:59:25Z7 pages, 1 figureSreeja S. NairNicholas E. MarinoNick PascucciRussell BrownArthur P. R. SilvaTim CummingsConnor M. Powerhttp://arxiv.org/abs/2512.03685v2Distributed Quantum Computing with Fan-Out Operations and Qudits: the Case of Distributed Global Gates2026-03-04T07:43:07ZMuch recent work on distributed quantum computing have focused on the use of entangled pairs and distributed two qubit gates. But there has also been work on efficient schemes for achieving multipartite entanglement between nodes in a single shot, removing the need to generate multipartite entangled states using many entangled pairs. This paper looks at how multipartite entanglement resources (e.g., GHZ states) can be useful for distributed fan-out operations; we also consider the use of qudits of dimension four for distributed quantum circuit compression. In particular, we consider how such fan-out operations and qudits can be used to implement circuits which are challenging for distributed quantum computation, involving pairwise qubit interactions, i.e., what has been called global gates (a.k.a. global Mølmer-Sørensen gates). Such gates have been explored to possibly yield more efficient computations via reduced circuit depth, and can be carried out efficiently in some types of quantum hardware (e.g., trapped-ion quantum computers); we consider this as an exploration of an ``extreme'' case for distribution given the global qubit-qubit interactions. We also conclude with some implications for future work on quantum circuit compilation and quantum data centre design.2025-12-03T11:26:47Z8 pages, 10 figures; preliminary version (if mistakes found - please contact the author); accepted at QCNC 2026Seng W. Lokehttp://arxiv.org/abs/2603.03743v1The Semantic Arrow of Time, Part II: The Semantics of Open Atomic Ethernet2026-03-04T05:29:23ZThis is the second of five papers comprising The Semantic Arrow of Time. Part I established that computing's arrow of time is semantic rather than thermodynamic, and that the Forward-In-Time-Only (FITO) assumption constitutes a category mistake. This paper develops the constructive alternative. We present the semantics of Open Atomic Ethernet (OAE) links as a concrete realization of a non-FITO protocol architecture. The key insight is that causal order is not assumed a priori but created through transaction structure: the link state machine progresses through TENTATIVE to REFLECTING to COMMITTED, with the option to abort at any point before commitment. Delivery does not imply commitment; commitment requires reflective acknowledgment -- proof that information has round-tripped and been semantically validated by both endpoints. We formalize this through three frameworks. First, the OAE link state machine, a six-state finite automaton whose normative invariants guarantee that semantic corruption cannot occur at the link level. Second, Indefinite Logical Timestamps (ILT), a four-valued causal structure that admits a genuinely indefinite relation between concurrent events, resolving only after symmetric link-level exchange. Third, the Slowdown Theorem applied to links, which establishes that round-trip measurement is the minimum interaction required to establish causal order. We show that ILT is strictly more expressive than Definite Causal Order systems for reversible link protocols. We connect these results to the Knowledge Balance Principle from quantum information theory. The paper concludes with a comparative analysis showing that OAE achieves infinite consensus number while RDMA, NVLink, and UALink remain limited to finite consensus numbers due to their FITO semantics.2026-03-04T05:29:23ZPaul Borrillhttp://arxiv.org/abs/2603.03738v1Exploring Challenges in Developing Edge-Cloud-Native Applications Across Multiple Business Domains2026-03-04T05:15:33ZAs the convergence of cloud computing and advanced networking continues to reshape modern software development, edge-cloud-native paradigms have become essential for enabling scalable, resilient, and agile digital services that depend on high-performance, low-latency, and reliable communication. This study investigates the practical challenges of developing, deploying, and maintaining edge-cloud-native applications through in-depth interviews with professionals from diverse domains, including IT, finance, healthcare, education, and industry. Despite significant advancements in cloud technologies, practitioners, particularly those from non-technical backgrounds-continue to encounter substantial complexity stemming from fragmented toolchains, steep learning curves, and operational overhead of managing distributed networking and computing, ensuring consistent performance across hybrid environments, and navigating steep learning curves at the cloud-network boundary. Across sectors, participants consistently prioritized productivity, Quality of Service, and usability over conventional concerns such as cost or migration. These findings highlight the need for operationally simplified, SLA-aware, and developer-friendly platforms that streamline the full application lifecycle. This study contributes a practice-informed perspective to support the alignment of edge-cloud-native systems with the realities and needs of modern enterprises, offering critical insights for the advancement of seamless cloud-network convergence.2026-03-04T05:15:33ZPawissanutt LertpongrujikornHai Duc NguyenJuahn KwonMohsen Amini Salehihttp://arxiv.org/abs/2603.03736v1The Ghost in the Datacenter: Link Flapping, Topology Knowledge Failures, and the FITO Category Mistake2026-03-04T05:12:40ZEvery link disconnection or flap in a datacenter corrupts the network's self-knowledge -- its graph. We call this corruption a ghost: a node that appears reachable but is not, a link that reports "up" but silently drops traffic, or an IP address that resolves to a partitioned machine. Ghosts arise at every scale -- chiplet-to-chiplet (PCIe, UCIe), GPU-to-GPU (NVLink, NVSwitch), node-to-node (Ethernet, Thunderbolt), and cluster-to-cluster (IP, BGP) -- because all these protocols inherit Shannon's forward-in-time-only (FITO) channel model and use Timeout And Retry (TAR) as their failure detector. TAR cannot distinguish "slow" from "dead," which is precisely the ambiguity that Fischer--Lynch--Paterson proved unresolvable in asynchronous systems. We survey the problem using production data from Meta (419 interruptions in 54 days of LLaMA 3 training), ByteDance (38,236 explicit and 5,948 implicit failures in three months), Google (TPUv4 optical circuit switching), and Alibaba (0.057% NIC--ToR link failures per month). At 2025 cluster scale (${\sim}3$ million GPUs, ${>}10$ million optical links), a link flap occurs every 48 seconds. We show that every existing mitigation -- Phi Accrual failure detectors, SWIM, BFD, OSPF/ISIS fast convergence, SmartNIC offload, lossless Ethernet (RoCE/PFC), and Kubernetes pod eviction -- still creates ghosts because each is fundamentally timeout-based. We connect ghosts to gray failures (Huang et al., HotOS 2017) and metastable failures (Bronson et al., HotOS 2021; validated across 22 failures at 11 organizations, OSDI 2022). We argue that Open Atomic Ethernet eliminates ghosts at the link layer through a Reliable Link Failure Detector, Perfect Information Feedback, triangle failover, and atomic token transfer -- making topology knowledge transactional.2026-03-04T05:12:40ZPaul Borrillhttp://arxiv.org/abs/2603.03731v1HyperParallel: A Supernode-Affinity AI Framework2026-03-04T05:03:33ZThe emergence of large-scale, sparse, multimodal, and agentic AI models has coincided with a shift in hardware toward supernode architectures that integrate hundreds to thousands of accelerators with ultra-low-latency interconnects and unified memory pools. However, existing AI frameworks are not designed to exploit these architectures efficiently, leading to high programming complexity, load imbalance, and poor memory utilization. In this paper, we propose a supernode-affinity AI framework that treats the supernode as a single logical computer and embeds hardware-aware orchestration into the framework. Implemented in MindSpore, our HyperParallel architecture comprises HyperOffload for automated hierarchical memory management, HyperMPMD for fine-grained MPMD parallelism across heterogeneous workloads, and HyperShard for declarative parallel strategy specification. Together, these techniques significantly improve training and inference efficiency while reducing parallel programming and system tuning overhead, demonstrating the necessity of supernode affinity for next-generation AI frameworks.2026-03-04T05:03:33ZXin ZhangBeilei SunTeng SuQinghua ZhangChong BaoLei ChenXuefeng Jinhttp://arxiv.org/abs/2603.03592v1SENTINEL: Stagewise Integrity Verification for Pipeline Parallel Decentralized Training2026-03-03T23:51:10ZDecentralized training introduces critical security risks when executed across untrusted, geographically distributed nodes. While existing Byzantine-tolerant literature addresses data parallel (DP) training through robust aggregation methods, pipeline parallelism (PP) presents fundamentally distinct challenges. In PP, model layers are distributed across workers where the activations and their gradients flow between stages rather than being aggregated, making traditional DP approaches inapplicable. We propose SENTINEL, a verification mechanism for PP training without computation duplication. SENTINEL employs lightweight momentum-based monitoring using exponential moving averages (EMAs) to detect corrupted inter-stage communication. Unlike existing Byzantine-tolerant approaches for DP that aggregate parameter gradients across replicas, our approach verifies sequential activation/gradient transmission between layers. We provide theoretical convergence guarantees for this new setting that recovers classical convergence rates when relaxed to standard training. Experiments demonstrate successful training of up to 4B-parameter LLMs across untrusted distributed environments with up to 176 workers while maintaining model convergence and performance.2026-03-03T23:51:10Z70 pages, 22 figures, 20 tablesHadi Mohaghegh DolatabadiThalaiyasingam AjanthanSameera RamasingheChamin P Hewa KoneputugodageGil AvrahamYan ZuoVioletta ShevchenkoAlexander Longhttp://arxiv.org/abs/2510.12469v2Proof of Cloud: Data Center Execution Assurance for Confidential VMs2026-03-03T21:53:51ZConfidential Virtual Machines (CVMs) protect data in use by running workloads within hardware-enforced Trusted Execution Environments (TEEs). However, existing CVM attestation mechanisms only certify what code is running, not where it is running. Commercial TEEs mitigate passive physical attacks through memory encryption but explicitly exclude active hardware tampering (memory interposers, physical side channels, ...). Yet current attestations provide no cryptographic evidence that a CVM executes on hardware residing within a trusted data center where such attacks would not take place. This gap enables proxy attacks in which valid attestations are combined across machines to falsely attest trusted execution.
To bridge this gap, we introduce Data Center Execution Assurance (DCEA), a design that generates a cryptographic Proof of Cloud by binding CVM attestation to platform-level Trusted Platform Module (TPM) evidence. DCEA combines two independent roots of trust. First, the TEE manufacturer, and second, the infrastructure provider, by cross-linking runtime TEE measurements with the vTPM-measured boot CVM state. This binding ensures that CVM execution, vTPM quotes, and platform provenance all originate from the same physical chassis.
We formalize the environment's provenance and show that DCEA prevents advanced relay attacks, including a novel mix-and-match proxy attack. Using the AGATE framework in the Universal Composability model, we prove that DCEA emulates an ideal location-aware TEE even under a malicious host software stack. We implement DCEA on Google Cloud bare-metal Intel TDX instances using Intel TXT and evaluate its performance, demonstrating practical overheads and deployability. DCEA refines the CVM threat model and enables verifiable execution-location guarantees for privacy-sensitive workloads.2025-10-14T13:01:48ZFilip RezabekMoe MahhoukAndrew MillerQuintus KilbournGeorg CarleJonathan Passerat-Palmbachhttp://arxiv.org/abs/2603.03470v1Bisynchronous FIFOs and the FITO Category Mistake: Silicon-Proven Interaction Primitives for Distributed Coordination2026-03-03T19:28:59ZBisynchronous FIFOs -- hardware buffers that mediate data transfer between independent clock domains without a shared global timebase -- have been designed, formally verified, and commercially deployed in silicon for over four decades. We survey this literature from Chapiro's 1984 GALS thesis through Cummings's Gray-code pointer techniques, Chelcea and Nowick's mixed-timing interfaces, Greenstreet's STARI protocol, and the 2015 NVIDIA pausible bisynchronous FIFO, and argue that this body of work constitutes a silicon-proven existence proof against the Forward-In-Time-Only (FITO) assumption that pervades distributed systems. The central claim is that interaction-based synchronization primitives -- handshakes, mutual exclusion, and causal flow control -- can replace timestamp-based coordination at the most demanding levels of digital engineering, directly undermining the FITO assumption in protocols such as PTP, TSN, and conventional Ethernet. We draw a structural parallel between on-chip bisynchronous coordination and the Open Atomic Ethernet (OAE) architecture, and identify the handshake -- not the timestamp -- as the fundamental primitive for coordination between independent causal domains.2026-03-03T19:28:59ZPaul Borrillhttp://arxiv.org/abs/2603.00766v2Black Hole Search: Dynamics, Distribution, and Emergence2026-03-03T19:25:54ZA black hole is a malicious node in a graph that destroys resources entering into it without leaving any trace. The problem of Black Hole Search (BHS) using mobile agents requires that at least one agent survives and terminates after locating the black hole. Recently, this problem has been studied on 1-bounded 1-interval connected dynamic graphs \cite{BHS_gen}, where there is a footprint graph, and at most one edge can disappear from the footprint in a round, provided that the graph remains connected. In this setting, the authors in \cite{BHS_gen} proposed an algorithm that solves the BHS problem when all agents start from a single node (rooted initial configuration). They also proved that at least $2δ_{BH} + 1$ agents are necessary to solve the problem when agents are initially placed arbitrarily across the nodes of the graph (scattered initial configuration), where $δ_{BH}$ denotes the degree of the black hole. In this work, we present an algorithm that solves the BHS problem using $2δ_{BH} + 17$ initially scattered agents. Our result matches asymptotically with the rooted algorithm of \cite{BHS_gen} under the same model assumptions.
Further, we study the Eventual Black Hole Search (\textsc{Ebhs}) problem, in which the black hole may appear at any node and at any time during the execution of the algorithm, destroying all agents located on that node at the time of its appearance. However, the black hole cannot emerge at the home base in round~0, where the home base is the node at which all agents are initially co-located. Once the black hole appears, it remains active at that node for the rest of the execution. This problem has been studied on static rings~\cite{Bonnet25}; here we extend it to arbitrary static graphs and provide a solution using four agents. Moreover, it does not require any knowledge of global parameters or additional model assumptions.2026-02-28T18:22:48ZTanvir KaurAshish SaxenaPartha Sarathi MandalKaushik Mondalhttp://arxiv.org/abs/2603.03089v1Serverless Abstractions for Short-Running, Lightweight Streams2026-03-03T15:31:42ZServerless computing and stream processing represent two dominant paradigms for event-driven data processing, yet both make assumptions that render them inefficient for short-running, lightweight, and unpredictable streams that require stateful processing. We propose stream functions as a novel extension of the Function-as-a-Serivce model that treat short streams as the unit of execution, state, and scaling. Stream functions process streams via an iterator-based interface, enabling seamless inter-event logic while retaining the elasticity and scale-to-zero capabilities offered by serverless platforms. Our evaluation shows that stream functions reduce the processing overhead by ~99 % compared to a mature stream process- ing engine in a video-processing use case. By providing comparable performance to serverless functions with stream semantics, stream functions provide an effective and efficient abstractions for a class of workloads underserved by existing models.2026-03-03T15:31:42ZAccepted for publication at the 4th Workshop on SErverless Systems, Applications and MEthodologies (SESAME '26)Natalie CarlNiklas KowallikConstantin StahlTrever SchirmerTobias PfandzelterDavid Bermbachhttp://arxiv.org/abs/2603.03023v1Dynamic Contract Analysis for Parallel Programming Models2026-03-03T14:15:29ZParallel programming in high-performance computing depends on low-level APIs such as MPI, requiring users to manage synchronization and resources manually. Several correctness checking tools exist to help bug-free code development, though most target a single programming model, limiting their applicability. Our previous work, the static analysis tool CoVer, leverages a contract-based approach enabling users to specify custom error-checking rules and support emerging or unconventional programming models without requiring extensive new tooling. However, static analysis cannot fully reason about runtime-dependent behavior such as pointer aliasing or indirect control flow. To address this, we present CoVer-Dynamic, a dynamic analysis extension that reuses CoVer's contract language to provide a unified static-dynamic verification framework. By enforcing the same contracts at runtime, CoVer-Dynamic improves classification accuracy and eliminates false positives on standardized MPI and OpenSHMEM benchmarks, while detecting errors beyond static analysis only. Our evaluation shows that CoVer-Dynamic consistently outperforms the state-of-the-art correctness checker MUST, averaging a 2x speedup. Finally, our results show limitations in the expressiveness of the contract language, motivating future work to support more error classes.2026-03-03T14:15:29ZA peer-reviewed version is to be published by IEEE as part of the IPDPS HIPS workshop proceedings. This is the originally submitted articleYussur Mustafa OrajiAlexander HückChristian Bischofhttp://arxiv.org/abs/2512.08725v2Spatio-Temporal Shifting to Reduce Carbon, Water, and Land-Use Footprints of Cloud Workloads2026-03-03T14:14:03ZIn this paper, we investigate the potential of spatial and temporal cloud workload shifting to reduce carbon, water, and land use footprints. Specifically, we perform a simulation study leveraging publicly available data on the cloud infrastructure of major providers (AWS and Azure) as well as real-world workload traces (big data analytics and FaaS) and grid mix data to consider two different scenarios. Our simulation results indicate that spatial shifting can substantially lower carbon, water, and land use footprints. In the FaaS applications, shifting the spatiotemporal workload achieves carbon savings of up to 85%, water savings of around 50%, and reductions in land use of up to 45%, all while optimizing for the respective factors. Mixed optimization yields results comparable to those of land use alone. For big data workloads, spatiotemporal shifting delivers reductions of up to 45% in carbon emissions, 40% in water consumption, and nearly 40% in land use when optimized for the respective factors. Temporal shifting also decreases the footprint, though to a lesser extent. When applied together, the two strategies yield the greatest overall reduction, driven mainly by spatial shifting with temporal adjustments providing an additional, incremental benefit. Sensitivity analysis demonstrates that such shifting is robust to prediction errors in grid mix data and to variations across different seasons.2025-12-09T15:39:06ZThis is a pre-print of our paper currently under reviewGiulio AttenniYoussef MoawadNovella BartoliniLauritz Thamsenhttp://arxiv.org/abs/2603.03007v1Breaking the Prototype Bias Loop: Confidence-Aware Federated Contrastive Learning for Highly Imbalanced Clients2026-03-03T14:01:08ZLocal class imbalance and data heterogeneity across clients often trap prototype-based federated contrastive learning in a prototype bias loop: biased local prototypes induced by imbalanced data are aggregated into biased global prototypes, which are repeatedly reused as contrastive anchors, accumulating errors across communication rounds. To break this loop, we propose Confidence-Aware Federated Contrastive Learning (CAFedCL), a novel framework that improves the prototype aggregation mechanism and strengthens the contrastive alignment guided by prototypes. CAFedCL employs a confidence-aware aggregation mechanism that leverages predictive uncertainty to downweight high-variance local prototypes. In addition, generative augmentation for minority classes and geometric consistency regularization are integrated to stabilize the structure between classes. From a theoretical perspective, we provide an expectation-based analysis showing that our aggregation reduces estimation variance, thereby bounding global prototype drift and ensuring convergence. Extensive experiments under varying levels of class imbalance and data heterogeneity demonstrate that CAFedCL consistently outperforms representative federated baselines in both accuracy and client fairness.2026-03-03T14:01:08ZTian-Shuang WuShen-Huan LyuNing ChenYi-Xiao HeBing TangBaoliu YeQingfu Zhang