Notarized Agents: Receiver-Attested Confidential Receipts for AI Agent Actions

2026-06-02T20:21:58Z

Current AI agent observability is structurally compromised: the entity producing the activity log is the same entity whose activity is being logged. A compromised or buggy agent can omit, alter, or fabricate its own traces, and the operator running the agent has no independent way to detect tampering. We propose a class of protocols that resolves this by inverting the trust boundary: the service that receives an agent's call signs a receipt of what it observed using its own key, encrypts the receipt to the agent's owner, and publishes it to a public transparency log. The owner reconstructs a tamper-evident trail without trusting the agent or its operator. We instantiate the class as Sello, a protocol combining four properties absent in any current system: (P1) receiver-side signing, (P2) HPKE encryption to an owner public key bound to the authorization token via JWS, (P3) publication to a witness-cosigned Merkle log, and (P4) owner-side discovery by token reference. We describe the protocol, analyze its security under an adversary that controls the agent and its operator, present microbenchmarks of the cryptographic operations, and situate Sello among adjacent receipt-protocol work (Signet, AgentROA, Agent Passport System, draft-farley-acta, SCITT). We discuss known limitations including the suppression attack, service collusion, and the adoption-incentive problem.

EvalStop: Using World Feedback to Detect and Correct Reward Overoptimization in Multi-Tenant RLHF Platforms

2026-06-02T19:03:39Z

Cloud LLM fine-tuning platforms increasingly serve RLHF workloads, where a learned reward model is optimized as a proxy for human quality. As Gao et al. (2023) showed, this proxy diverges from world feedback (downstream eval metrics) under sustained optimization pressure, a phenomenon known as reward overoptimization. Existing platform schedulers ignore this divergence: non-clairvoyant schedulers optimize JCT without any quality signal, SLAQ-style quality-aware schedulers use training loss (a weaker proxy that drops monotonically through hacking), and classical per-job early stopping requires human monitoring and does not free shared GPUs. We propose EvalStop, a composable scheduling primitive that terminates jobs on k consecutive eval-score declines, releases GPUs, preserves the best checkpoint, and delegates to any base scheduler. We frame scheduler-level early stopping as a detection problem and evaluate it in a discrete-event simulator whose RLHF workload mixes reward-hacking and structurally healthy runs, with ground-truth labels hidden from schedulers. On RLHF-heavy workloads (80% RLHF, 64 GPUs), EvalStop achieves precision 98% / recall 99% / FPR 1.5% while improving JCT by 9% and cutting wasted compute by 22% over SRTF-Est (p<0.05). Trivial fixed-progress and loss-plateau competitors either incur 65% FPR on healthy RLHF or miss over half of true hacking cases. Gains compose across every base scheduler tested (9-25% JCT) and detection quality stays stable under eval noise (precision at least 91% at noise std <= 0.05) and hacking base rate (precision at least 89% across 20-80% hacking fractions).

Fides: Secure and Scalable Asynchronous DAG Consensus via Trusted Components

2026-06-02T18:27:38Z

DAG-based BFT consensus has attracted growing interest in distributed data management systems for consistent replication in untrusted settings due to its high throughput and resilience to asynchrony. However, existing protocols still suffer from high communication overhead and long commit latency. In parallel, introducing minimal hardware trust has proven effective in reducing the complexity of BFT consensus. Inspired by these works, we present Fides, an asynchronous DAG-based BFT consensus protocol that, to our knowledge, is among the first to leverage TEEs to enhance both scalability and efficiency. Fides tolerates a minority of Byzantine replicas and achieves $O(κn^2 + n^3)$ metadata communication complexity through a customized TEE-assisted Reliable Broadcast (T-RBC) primitive with linear communication complexity in one-step broadcast. Building on T-RBC, Fides redefines the DAG construction rules by reducing the reference requirement from $2f+1$ to $f+1$ between consecutive vertices. This new structure weakens DAG connectivity and invalidates traditional commit rules, so we formally abstract the problem and derive new theoretical bounds of liveness. We further propose a four-round commit rule that achieves the theoretically minimal commit latency. Besides, we design two additional primitives, T-RoundCert and T-Coin, to efficiently certify DAG references and replace the costly cryptographic common coin used in prior protocols. Comprehensive evaluations on geo-distributed and local testbeds show that Fides substantially outperforms state-of-the-art protocols, including Tusk, Bullshark, Mysticeti, RCC, Damysus, Achilles and HybridSet, achieving lower latency and higher throughput while preserving strong safety and liveness guarantees.

P-Cast Precision in FP8 Attention: Sink-Induced Collapse and the Optimality of S=2^8

2026-06-02T17:29:15Z

FP8 (E4M3) acceleration for attention computation offers significant throughput gains, but the 3-bit mantissa introduces precision challenges when the softmax probability matrix~$P$ is cast to FP8 before the $P \cdot V$ matrix multiplication. We analyze two implementation choices that affect output precision under the \emph{Attention Sink} phenomenon: (1)~the KV block iteration order, and (2) the static scaling factor applied to $P$ before casting. We show that forward KV iteration causes \emph{P-collapse} -- to leading order a fraction $Φ(Δ+ δ_k - 6.93 - \ln S)$ of non-sink $P$ values underflow to zero, where the small shift $δ_k \approx 1$ (for $k_{\text{sink}}{=}4$) is the expected within-sink-block score maximum -- and that reverse iteration removes it, with a zero-underflow guarantee when reverse is combined with $S{=}256$. We further give a constructive characterization of $S = 256 = 2^8$ as the static scale that simultaneously satisfies (i)~bit-exact IEEE 754 scaling, (ii) the lower envelope of a sawtooth function $dp(S)$ over the E4M3 number line ($dp = 2^{-4}$, the minimum worst-case quantization step), and (iii)~the maximum normal-range coverage \emph{among bit-exact ($2^k$) scales} (a non-bit-exact scale such as $448$ attains slightly higher coverage; sec.5}). Both optimizations are already deployed in FlashAttention-3/4 on engineering grounds; our contribution is a quantitative account of \emph{why} these choices are good and a closed-form threshold $Δ_c = 6.93 + \ln S - δ_k$ for predicting kernel-level precision loss. Kernel-faithful experiments ($Q, K, V$ in FP32 to isolate the P-cast effect) show $3$-$10\times$ MSE improvement at moderate sink strengths, and paired tests confirm both fixes saturate to the same precision floor when combined -- which motivated updating the hpc-ops kernel from $S{=}1$ to $S{=}256$.

NetKV: Network-Aware Decode Instance Selection for Disaggregated LLM Inference

2026-06-02T17:06:57Z

Disaggregated LLM inference forces the KV cache to traverse the datacenter network before decoding begins, so transfer time enters directly into the Time to First Token (TTFT) budget. Current schedulers route on compute load and prefix-cache locality alone, ignoring the topological distance and dynamic congestion between prefill and decode instances. We close this gap with a thin operator-to-scheduler interface, the network cost oracle, and we prove that ignoring the network term renders cache-aware-only scheduling arbitrarily suboptimal as context length grows. NetKV, the O(|D|) per-request greedy that consumes this oracle, has tier rankings that are provably robust to stale telemetry. On a 64-GPU four-tier fat-tree simulator driven by Mooncake traces, NetKV reduces mean TTFT by up to 21.2% over round-robin and 17.6% over a tuned cache+load-aware scheduler, lifts SLO attainment by up to 20.1 percentage points, and keeps the Time Between Tokens overhead below 0.5 ms in every condition tested, with no changes to the transport, inference engine, or hardware.

Characterizing Metastable Faults and Failures

2026-06-02T16:15:24Z

Metastable failures are hard to detect, prevent, and mitigate. During a metastable failure, a system exhibits self-sustaining bad behavior even in the absence of adversarial conditions. Prior work focuses on symptoms and has portrayed metastable failures as instances of self-sustaining overload. This characterization leaves the underlying failure causes and dynamics unknown, and does not account for metastable failures that do not manifest as overload. We present the first causal characterization of metastable failures by identifying their origin in metastable faults, i.e., structural destabilizing cycles of interaction among systems components that, in isolation, are stabilizing. Metastable failures arise when scheduling decisions let these destabilizing interactions gain the upper hand over the individual components' stabilizing tendencies. We then derive a methodology to predict metastable failures, and to build metastable-fault-tolerant (MFT) systems. We apply our methodology to three case studies, showcasing the generality of our results.

eAID: Elastic Asynchronous Information Dispersal with Post-Dissemination Pruning

2026-06-02T15:26:58Z

Spreading and storing erasure-coded data effectively in distributed systems is challenging in practical settings. The dissemination of erasure-coded information is typically designed to complete only after receiving messages from $(N-F)$ nodes, thereby preparing for the worst-case, but rare, scenario of $F$ failures. In steady state, the remaining $F$ nodes may in fact be healthy, but their resources are not counted. This leads to over-provisioning of storage for encoded data. This paper introduces eAID, a novel elastic information dispersal algorithm that addresses this conundrum through a two-stage approach. First, the core protocol estimates the actual number $f$ of faulty nodes, rather than assuming the worst-case bound $F$. Dissemination completes quickly when messages are received from $(N-f)$ nodes, and more gradually when fewer nodes respond. Second, after initial dissemination completes, eAID continues monitoring for additional responses. As responses arrive from up to $N$ nodes, the system prunes the information stored at responding nodes accordingly. A key technique enabling this seamless elasticity is an agile encoding scheme that varies the number of disseminated fragments while keeping both fragment size and the recovery threshold $(F+1)$ fixed. Not only does this enable varying the number of disseminated fragments on the fly, it also allows nodes to discard encoded fragments autonomously. Crucially, this is achieved without maintaining complex metadata, without requiring nodes to reconstruct or re-encode information, and without global coordination for storage decisions. We demonstrate the practicality of eAID by integrating it with a replicated key-value store, and evaluating it in network environments with unpredictable latencies. The results show that eAID improves overall performance while significantly reducing long-term storage consumption.

E2LLM: Towards Efficient LLM Serving in Heterogeneous Edge/Fog Environments

2026-06-02T15:23:28Z

Large Language Models (LLMs) have become integral to modern applications, yet their deployment remains challenging. Beyond executing the models themselves, practical deployment must address cost efficiency, low latency, and optimal resource utilization. Conventional approaches typically assume that an entire model can be hosted on a single device, which does not hold in many real-world scenarios, particularly in Edge and Fog environments where device resources are constrained. In this paper, we introduce E2LLM, a framework designed to enable efficient LLM deployment in such resource limited settings. Rather than simply partitioning a single model across all available devices, E2LLM replicates the full model across multiple groups of devices (replicas) and applies model parallelism within each replica. Each replica is assigned a specialized role PREFILL or DECODER based on its efficiency in handling input and output tokens. This separation leverages the inherent differences between these two phases of LLM inference. To effectively organize devices, we utilize a Genetic Algorithm to form clusters that maximize system performance. Within each cluster, we apply Dynamic Programming to determine an optimal partitioning strategy that minimizes bottlenecks in model-parallel execution. Experimental results demonstrate that our approach adapts robustly to varying workloads, including scenarios with significant variation in input and output token lengths. Compared to the Splitwise baseline, E2LLM reduces average waiting time by over 50% under high-demand conditions

CADET: A Modular Platform for Evaluating Distributed Cooperative Autonomy in Connected Autonomous Vehicles

2026-06-02T15:13:13Z

Deep learning models are increasingly central to autonomous vehicle (AV) pipelines, yet their integration has traditionally followed a monolithic design where perception, planning, and control execute on a single onboard computer. This design overlooks the emerging paradigm of cooperative autonomy, where vehicles interact with roadside units (RSUs), edge servers, and cloud-hosted intelligence through vehicle-to-everything (V2X) connectivity. Cooperative perception and control improve safety and efficiency, but also introduce systems-level challenges: network latency, compute heterogeneity, and multi-tenant contention, all critically affect real-time decision-making. These challenges are further amplified by the increasing reliance on large foundation models, whose scale necessitates cloud deployment. We present CADET (Cooperative Autonomy through Distributed Experimentation Toolkit), a modular platform for systematic and reproducible evaluation of distributed cooperative autonomy systems under realistic deployment conditions. CADET decouples the AV stack into composable modules that can be flexibly deployed across vehicles, infrastructure, and edge/cloud tiers. The framework integrates state-of-the-art models, incorporates trace-driven network and workload emulation, and provides synchronized model-, system-, and task-level instrumentation. Through V2V and V2I experiments, we show that distributed deployment choices fundamentally shape safety, with V2V intent packets outperforming cloud-based perception and RSU-assisted perception sustaining safety until overloaded by concurrent requests. Although designed for AV pipelines, CADET also supports dataset-driven experimentation, enabling systems and ML researchers to benchmark distributed inference workloads independently of full vehicle simulation. CADET is open source, with code and demo available at https://nesl.github.io/cadet-web.

Fast TetraBFT: Optimizing Latency Where It Matters

2026-06-02T15:03:25Z

Unauthenticated Byzantine consensus protocols achieve optimal failure resilience while relying only on authenticated point-to-point channels, not authenticated messages. They are an attractive building block for blockchains that do not mandate symmetric trust assumptions as well as for future post-quantum settings. We consider unauthenticated Byzantine consensus in partially synchronous networks and focus on optimizing its good-case latency - the worst-case time for correct processes to reach a decision under favorable conditions. A recently proposed ForgetIT protocol achieves an optimal good-case latency of 3 message delays but employs a highly complex design. We show that this complexity is unnecessary. To this end, we present Fast TetraBFT - an unauthenticated Byzantine consensus protocol that achieves optimal good-case latency by augmenting an existing TetraBFT protocol with a simple fast-path wrapper. Our solution lowers the good-case latency of TetraBFT from 5 to 3 message delays while preserving its bounded space requirements and low communication complexity.

Distributed Local Verification using Proofs with(out) Errors

2026-06-02T14:44:52Z

We study local verification of graph properties in distributed networks under the framework of \emph{locally checkable proofs} (LCPs). In an LCP, a prover assigns proof labels to nodes, and a distributed verifier must make all nodes accept if the graph satisfies the property, while at least one node rejects otherwise. Each node bases its decision on a local neighborhood, called its \emph{view distance}. Our focus is twofold. First, we study cycle existence, i.e., whether a graph contains a cycle (as opposed to cycle-freeness). We show that cycle existence admits verification with only $3$ proof labels and view distance $1$, and establish a matching lower bound. More importantly, inspired by direction-encoding techniques based on BFS distances, we introduce a novel gadget that encodes direction using only $2$ labels and view distance $3$ through repeated occurrences of the string $001101$. Although developed for cycle existence, this gadget may be useful for other verification tasks. Second, we introduce an \emph{erroneous proof} model in which an adversary may corrupt proof labels of at most $i$ nodes within the $(2i+1)$-hop neighborhood of each node. We present an algorithmic framework, called \textbf{\texttt{refix}}, that transforms an error-free verifier into one that tolerates such errors at the cost of a view distance of $2i+1$. We demonstrate the framework on cycle existence, cycle-freeness, and bipartiteness, and establish lower bounds relating the number of errors to the required view distance. Finally, we show that our $2$-label, view-distance-$3$ verifier for cycle existence admits a $3$-round implementation in the \textsc{CONGEST} model, providing a first step toward implementing LCPs under communication constraints.

Scaling Multi Agent Reinforcement Learning for Underwater Acoustic Tracking via Autonomous Vehicles

2026-06-02T14:16:41Z

Autonomous vehicles (AVs) offer a cost-effective solution for scientific missions such as underwater tracking. Reinforcement learning (RL) has emerged as a powerful method for controlling AVs, but scaling to fleets (essential for multi-target tracking or rapidly moving targets) is challenging. Multi-Agent RL (MARL) is notoriously sample-inefficient, and while high-fidelity simulators like Gazebo's LRAUV provide up to 100x faster-than-real-time single-robot simulations, they offer little speedup in multi-vehicle scenarios, making MARL training impractical. Yet, high-fidelity simulation is crucial to test complex policies and close the sim-to-real gap. To address these limitations, we develop a GPU-accelerated environment that achieves up to 30,000x speedup over Gazebo while preserving its dynamics. This enables fast, end-to-end GPU training and seamless transfer to Gazebo for evaluation. We also introduce a Transformer-based architecture (TransfMAPPO) that learns policies invariant to fleet size and number of targets, enabling curriculum learning to train larger fleets on increasingly complex scenarios. After large-scale GPU training, we perform extensive evaluations in Gazebo, showing our method maintains tracking errors below 5m even with multiple fast-moving targets.

Deterministic Distance Approximation in MPC via Improved Hitting Sets

2026-06-02T13:59:26Z

In this paper, we provide the first deterministic algorithms with sublogarithmic round complexity for spanners and approximate shortest paths in various MPC models. Moreover, we significantly improve upon the state of the art in the deterministic Congested Clique. In particular, we obtain the following four results on undirected graphs: 1. In both linear MPC and Congested Clique, we obtain an $O(k)$ stretch-spanner of a weighted graph of size $O(n^{1+1/k})$ in $O(1)$ rounds, for some parameter $k\ge 0$. For $k=O(\log{n})$, this leads to an $O(\log n)$ approximation of APSP in constant rounds in both models. 2. In sublinear MPC, we obtain an $O(k^{1+\varepsilon})$-stretch spanner of a weighted graph of size $O(n^{1+1/k})$ in $O(\log k)$ rounds, for any fixed constant $\varepsilon>0$. 3. In Congested Clique, we obtain $O(1)$-approximate APSP for weighted graphs in $O(\log \log \log n)$ rounds. 4. In near-linear MPC, we obtain $(1+\varepsilon)$-approximate single-source shortest paths and $O(1)$-approximate all-pairs shortest paths for unweighted graphs in $\textsf{poly}\log \log n$ rounds. Our algorithm only requires a single near-linear memory machine, where the rest can have sublinear memory. Our deterministic algorithms obtain similar guarantees to the state of the art randomized algorithms without incurring additional factors in the round complexity. To obtain these results, we inspect the randomized algorithms and isolate a randomized sampling routine. Then we derandomize these sampling routines by using a deterministic hitting set. Hereto, we develop a versatile deterministic hitting set algorithm, which we hope will have further derandomization applications.

Stability of local tip pool sizes

2026-06-02T13:47:32Z

In directed acyclic graph (DAG)-based distributed ledgers, unreferenced blocks (tips) form the backlog of a distributed queueing system. Each new block creates one tip and attempts to remove up to $k$ existing tips by referencing them. With heterogeneous propagation delays, these service decisions are made from delayed local information, so nodes may disagree on the backlog and some reference attempts are wasted. We study a continuous-time Poisson model with bounded heterogeneous delays and uniform tip selection. We prove that the embedded tip-configuration chain is irreducible, aperiodic, and positive Harris recurrent, and hence admits a unique stationary regime. The observer and local tip-pool sizes have stationary exponential moments, converge to their stationary limits, and satisfy almost-sure ergodic averages. We also derive a Little-type identity relating the stationary mean observer tip count to the mean time until a typical block is first referenced. Simulations are included as qualitative illustrations of the effects of delay variability and issuance heterogeneity.

SIGMA: A Versatile Streaming Graph Partitioner for Vertex- and Edge-Balanced Distributed GNN Training

2026-06-02T11:39:43Z

Distributed Graph Neural Network (GNN) training depends critically on how the underlying graph is partitioned across compute resources. Existing graph partitioners focus either on vertex partitioning or edge partitioning and typically optimize only a single communication objective (edge cut or vertex cut) under a single balance constraint (vertex balance or edge balance). We present SIGMA (Streaming Integrated Graph Partitioning with Multi-objective Awareness), a versatile streaming graph partitioner that supports both vertex and edge partitioning within a unified multi-objective, multi-constraint framework. Depending on the target distributed GNN system, SIGMA can be configured for edgecut-oriented vertex partitioning or vertex-cut-oriented edge partitioning while simultaneously accounting for both vertex and edge balancing. A clustering-based preprocessing stage incorporates global graph structure to improve partition quality while preserving the efficiency and scalability advantages of streaming partitioning. We evaluate SIGMA on six benchmark graphs spanning diverse domains and scales using two distributed GNN training systems: Dist-GNN (edge-partitioned) and DistDGL (vertex-partitioned). Across both settings, SIGMA consistently achieves strong performance, showing its ability to navigate complex trade-offs between partition quality, training efficiency, and memory consumption, frequently outperforming streaming baselines while remaining competitive with high-quality in-memory partitioners such as METIS, KaHIP and HEP. These results demonstrate that a unified streaming partitioner can effectively address the communication, compute, and memory challenges of distributed GNN training across fundamentally different system architectures.