When Agents Control Robots: A Zero Trust Policy Model for Agentic Cyber-Physical Systems

2026-05-25T10:00:34Z

Multi-agent systems powered by large foundation models (LFMs) are increasingly deployed to control industrial robots through natural language, creating deployments in which security failures produce physical consequences. We analyse this threat landscape through Cobot-Claw, a deployed four-agent system for UR3e robotic arm control, and identify five attack classes specific to agentic cyber-physical systems. We propose ZTPM, a Zero Trust Policy Model comprising 25 typed primitives across five enforcement domains with Physical Impact Tiers as a runtime policy dimension. An empirical evaluation across 60 execution traces on two LFM backends provides initial evidence that actuation parameter selection is model-dependent and non-deterministic, motivating the need for policy-level enforcement at the physical actuation boundary.

DisagFusion: Asynchronous Pipeline Parallelism and Elastic Scheduling for Disaggregated Diffusion Serving

2026-05-25T08:07:47Z

Diffusion-based generation is increasingly powering production content pipelines; however, deploying these models at scale remains a significant challenge. Model weights frequently exceed the memory capacity of commodity GPUs, while the encoder, diffusion transformer (DiT), and decoder stages exhibit highly imbalanced computational and memory footprints. A natural remedy is disaggregated serving-running stages as separate services on heterogeneous GPUs-yet this introduces new bottlenecks, including stage handoff overheads and fast-changing workloads that make cross-stage provisioning and scheduling brittle. This paper presents DisagFusion, enabling asynchronous pipeline parallelism and elastic scheduling for disaggregated diffusion serving. First, DisagFusion introduces asynchronous pipeline parallelism that overlaps computation and stage-to-stage communication to reduce pipeline bubbles and mitigate network jitter. Second, DisagFusion employs a hybrid instance scheduling strategy that combines lightweight performance prediction with runtime feedback to continuously rebalance instance ratio across stages under workload shifts. We implement DisagFusion and evaluate it with modern diffusion models. Compared to a monolithic baseline, DisagFusion improves throughput by 3.4x-20.5x and reduces end-to-end latency by 18.5x, while enabling flexible, cost-efficient deployment across heterogeneous GPUs.

PipeSD: An Efficient Cloud-Edge Collaborative Pipeline Inference Framework with Speculative Decoding

2026-05-25T08:06:41Z

Speculative decoding can significantly accelerate LLM inference, especially given that its cloud-edge collaborative deployment offers cloud workload offloading, offline robustness, and privacy enhancement. However, existing collaborative inference frameworks with speculative decoding are constrained by (i) sequential token generation and communication with low resource utilization, and (ii) inflexible cloud non-autoregressive verification (NAV) triggering that induces premature verification or costly rollbacks. In this paper, we propose PipeSD, an efficient cloud-edge collaborative pipeline inference framework with speculative decoding. PipeSD overlaps token generation and communication by a token-batch pipeline scheduling mechanism optimized by dynamic programming, and improves verification flexibility through a dual-threshold NAV triggering mechanism with a lightweight Bayesian optimization autotuner. We implement PipeSD using llama-cpp-python, PyTorch, and FastAPI, and evaluate it on a real-world cloud-edge testbed with two draft-target model pairs across four scenarios. Results show that PipeSD consistently outperforms state-of-the-art baselines, achieving 1.16x-2.16x speedup and reducing energy consumption by 14.3%-25.3%.

Communication-Efficient Hybrid Language Model via Uncertainty-Aware Opportunistic and Compressed Transmission

2026-05-25T03:16:45Z

To support emerging language-based applications using dispersed and heterogeneous computing resources, the hybrid language model (HLM) offers a promising architecture, where an on-device small language model (SLM) generates draft tokens that are validated and corrected by a remote large language model (LLM). However, the original HLM suffers from substantial communication overhead, as the LLM requires the SLM to upload the full vocabulary distribution for each token. Moreover, both communication and computation resources are wasted when the LLM validates tokens that are highly likely to be accepted. To overcome these limitations, we propose communication-efficient and uncertainty-aware HLM (CU-HLM). In CU-HLM, the SLM transmits truncated vocabulary distributions only when its output uncertainty is high. We validate the feasibility of this opportunistic transmission by discovering a strong correlation between SLM's uncertainty and LLM's rejection probability. Furthermore, we theoretically derive optimal uncertainty thresholds and optimal vocabulary truncation strategies. Simulation results show that, compared to standard HLM, CU-HLM achieves up to 206$\times$ higher token throughput by skipping 74.8% transmissions with 97.4% vocabulary compression, while maintaining 97.4% accuracy.

Bandwidth-Aware and Cost-Efficient Pipeline Parallel Scheduling in Geo-Distributed LLM Training

2026-05-25T02:59:12Z

The rapid evolution of large language models (LLMs) has made geographically distributed training necessary due to GPU scarcity within a single cloud region. In such cross-region settings, Pipeline Parallelism (PP) is communication-efficient, yet scheduling PP remains challenging under heterogeneous inter-region bandwidth and regional electricity prices. Existing schedulers are either delay-first, incurring high electricity cost, or cost-first, relying on rigid resource allocation that prolongs Job Completion Time (JCT). They are also ineffective at optimizing execution order in multi-tenant environments, where long-running and bandwidth-intensive jobs can cause head-of-line (HoL) blocking and degrade overall performance. To this end, we propose BACE-Pipe, a bandwidth-aware and cost-efficient pipeline scheduling framework for LLM training across geo-distributed clusters. BACE-Pipe first introduces a dynamic job prioritization mechanism that optimizes execution order by jointly considering job characteristics (e.g., computation time) and real-time network utilization. It then employs a bandwidth-aware pathfinder to identify feasible cross-region pipeline paths that satisfy communication constraints, thereby preventing communication from stalling the pipeline. Among all feasible paths, a cost-minimizing allocator determines the optimal GPU placement strategy by preferentially assigning resources to regions with lower electricity prices. Consequently, BACE-Pipe mitigates HoL blocking, improves resource utilization, and simultaneously reduces both JCT and total electricity cost. Extensive simulations show that BACE-Pipe reduces average JCT by 27.9%--64.7% and total electricity cost by 12.6%--30.6% compared with state-of-the-art baselines.

An Empirical Evaluation of Quantum-Inspired QUBO Methods for Heterogeneous HPC Workflow Mapping and Scheduling

2026-05-25T02:09:49Z

Heterogeneous HPC workflow scheduling under multiple hard constraints poses a challenging combinatorial optimization problem. Classical exact solvers guarantee optimality but face scalability limits, motivating interest in quantum-inspired Quadratic Unconstrained Binary Optimization (QUBO) as an alternative optimization paradigm. This work presents a systematic empirical evaluation of QUBO-based scheduling methods against classical baselines including MILP, CP-SAT, GA, and HEFT. We evaluate three QUBO variants, single-run simulated annealing, multi-attempt annealing, and a layered QAOA-inspired schedule, with hybrid enhancement strategies on validation workflows (3-4 tasks) and synthetic scaling instances (5-20 tasks). All solvers are assessed through a unified pipeline tracking feasibility, makespan, and resource utilization under progressive constraint activation and controlled penalty sweeps. All approaches recover the expected optimal makespan on validation instances, confirming formulation correctness. However, feasibility degradation emerges for specific QUBO variants as constraint interactions intensify, particularly when communication costs are introduced. Penalty analysis reveals a sharp feasibility threshold for QUBO-SA, where insufficient penalties consistently fail and moderate-to-strong penalties restore feasibility. Scaling experiments show that classical solvers remain robust across all tested sizes, while QUBO-SA loses feasibility beyond 15 tasks and the QAOA-inspired variant beyond 10 tasks. The study provides a clear empirical characterization of the reliability boundaries of quantum-inspired QUBO formulations for HPC scheduling and identifies regimes where classical approaches remain preferable under current solver capabilities.

Multithreaded Fine-Grained Asynchronous BSP for Integer Sorting with LCI and OpenMP

2026-05-25T01:54:03Z

The bulk synchronous parallel (BSP) model struggles with irregular workloads due to rigid global communication. While fine-grained asynchronous BSP (FA-BSP) improves overlap, existing implementations typically rely on a limiting one-process-per-core model. This paper proposes a multithreaded FA-BSP approach combining Lightweight Communication Interface (LCI) and OpenMP to fully exploit multicore architectures. We evaluate this design using the NAS Parallel Benchmark Integer Sort (IS), retaining the original irregular Gaussian distribution to rigorously test load balancing. By replacing synchronous MPI collectives with OpenMP multithreading and LCI's fine-grained, zero-copy active messages, we enable efficient computation-communication overlap. Our evaluation demonstrates that multithreaded FA-BSP significantly outperforms traditional bulk-synchronous MPI implementations, offering a scalable solution for irregular scientific applications.

TIP: A Decentralized Intent-Based Protocol for Declarative IoT Interoperability and Sandboxed Schema Adaptation

2026-05-25T01:28:12Z

Heterogeneous Internet of Things (IoT) systems suffer from fragmentation across hardware architectures, networking stacks, and data serialization formats. Existing standards (such as MQTT, COAP, and DDS) rely on address-bound, imperative routing models that require hardcoded configurations and leave no flexibility for runtime schema translation. This paper presents TIP (The Intent Protocol), a decentralized, declarative network protocol. Instead of addressing specific physical endpoints, nodes submit abstract intents specifying desired capabilities, schemas, and Quality of Service (QoS) constraints. The TIP Engine resolves matching nodes using a hybrid discovery mechanism combining local multicast DNS (mDNS) with Kademlia Distributed Hash Tables (DHT). Selection is optimized via a multi-criteria scoring algorithm incorporating network latency, historical reputation, and contract compliance. Mismatched data representations are reconciled on-the-fly inside isolated WebAssembly (WASM) sandboxes compiled dynamically from TOML specifications. Security is enforced through Ed25519 signatures, X25519 key exchanges, and ChaCha20-Poly 1305 payload encryption. Evaluation of our reference implementation in Rust and C++ shows sub-millisecond translation overhead and robust resilience under industrial conditions.

Intelligent Offloading in Vehicular Edge Computing: A Comprehensive Review of Deep Reinforcement Learning Approaches and Architectures

2026-05-25T01:11:10Z

The increasing complexity of Intelligent Transportation Systems (ITS) has led to significant interest in computational offloading to external infrastructures such as edge servers, vehicular nodes, and UAVs. These dynamic and heterogeneous environments pose challenges for traditional offloading strategies, prompting the exploration of Reinforcement Learning (RL) and Deep Reinforcement Learning (DRL) as adaptive decision-making frameworks. This survey presents a comprehensive review of recent advances in DRL-based offloading for vehicular edge computing (VEC). We classify and compare existing works based on learning paradigms (e.g., single-agent, multi-agent), system architectures (e.g., centralized, distributed, hierarchical), and optimization objectives (e.g., latency, energy, fairness). Furthermore, we analyze how Markov Decision Process (MDP) formulations are applied and highlight emerging trends in reward design, coordination mechanisms, and scalability. Finally, we identify open challenges and outline future research directions to guide the development of robust and intelligent offloading strategies for next-generation ITS.

Beyond Thread States: Diagnosing Performance Degradation with eBPF and Thread Dynamics

2026-05-24T23:29:24Z

Online Data-Intensive applications face performance degradation from load variability and resource interference. While Thread State Analysis (TSA) based approaches enable identifying constrained subsystems, they lack the granularity to reveal the inter-thread dependencies that propagate degradation. In this paper, we present an application-agnostic performance degradation analysis method that extends TSA by capturing fine-grained thread dynamics. We implemented $16$ eBPF-based metrics across six kernel subsystems, including scheduling, VFS, networking, futex, multiplexing IO, and block IO which enables tracing thread interactions with specific resources like futexes, sockets, and disks. Our method leverages the fact that performance degradation propagates along inter-thread dependencies, and a subset of thread-resource interactions can enable capturing common degradation patterns. To this end, we employ a selective thread tracking algorithm that traces performance issues from entry-point threads to constrained resources. Experimentation with diverse applications under variable workloads and resource contention shows our method successfully diagnoses CPU, disk, lock, and external service contention with minimal overhead, while also revealing internal application constraints.

DECICE: AI-Driven Scheduling and Digital Twin Integration for the Cloud-HPC-Edge Compute Continuum

2026-05-24T23:12:55Z

This paper presents the DECICE project (Device Edge Cloud Intelligent Collaboration framEwork), a Horizon Europe Research and Innovation Action (Grant No. 101092582, December 2022 to November 2025) that developed an open-source framework for intelligent workload scheduling across the cloud-HPC-edge compute continuum. A consortium of 12 partners across 6 European countries organized the work into six work packages covering AI-driven scheduling, digital twin infrastructure, system architecture and integration, monitoring, use case validation, and dissemination. The two core technical contributions are an Integrated AI Scheduler (IAIS) employing RNN-based prediction and formal workflow modeling for constraint-aware workload mapping, and a Digital Twin aggregating real-time metrics with carbon intensity and anomaly prediction for energy-aware scheduling. The framework operates within Kubernetes environments, supports unified workflow ingestion from multiple formats, and bridges cloud-native and HPC orchestration through a Slurm integration layer. We present the project vision, the overall architecture, contributions from each work package, quantitative evaluation results, and the open-source release.

NEST: Network- and Memory-Aware Device Placement For Distributed Deep Learning

2026-05-24T22:58:50Z

The growing scale of deep learning demands distributed training frameworks that jointly reason about parallelism, memory, and network topology. Prior works often rely on heuristic or topology-agnostic search, handling communication and memory separately. Without per-device memory awareness, these methods typically ensure feasibility post hoc by sharding parameters and activations across many devices, increasing synchronization, inflating communication, and underutilizing compute-limiting scalability and efficiency on real datacenter networks. We present NEST, a network-, compute-, and memory-aware device placement framework that unifies model parallelism, topology modeling, and memory feasibility via structured dynamic programming. NEST's DP operates on operator graphs with tensor and expert parallel configurations, explicit allreduce latencies across hierarchical or arbitrary networks, and memory/compute profiles. By factoring parallelism across tensor, pipeline, data, and expert dimensions, NEST defines a principled search space for hybrid strategies while jointly optimizing co-location, network latency, and memory feasibility. Evaluations across diverse hardware and networks show NEST achieves up to 2.43 times higher throughput, better memory efficiency, and improved scalability over state-of-the-art baselines, providing a foundation for co-designing parallelization strategies and datacenter interconnects for next-generation AI infrastructure. The source code of NEST is available at: https://github.com/scai-tech/Nest

Optimistic, Signature-Free Reliable Broadcast and Its Applications

2026-05-24T21:46:39Z

Reliable broadcast (RBC) is a key primitive in fault-tolerant distributed systems, and improving its efficiency can benefit a wide range of applications. This work focuses on signature-free RBC protocols, which are particularly attractive due to their computational efficiency. Existing protocols in this setting incur an optimal 3 steps to reach a decision while tolerating up to $f < n/3$ Byzantine faults, where $n$ is the number of parties. In this work, we propose an optimistic RBC protocol that maintains the $f < n/3$ fault tolerance but achieves termination in just 2 steps under certain optimistic conditions--when at least $\lceil \frac{n+2f-2}{2} \rceil$ non-broadcaster parties behave honestly. We also prove a matching lower bound on the number of honest parties required for 2-step termination. We show that our latency-reduction technique generalizes beyond RBC and applies to other primitives such as asynchronous verifiable secret sharing (AVSS) and asynchronous verifiable information dispersal (AVID), enabling them to complete in 2 steps under similar optimistic conditions. To highlight the practical impact of our RBC protocol, we integrate it into Sailfish++, a new signature-free, post-quantum secure DAG-based Byzantine fault-tolerant (BFT) consensus protocol. Under optimistic conditions, this protocol achieves a commit latency of 3 steps--matching the performance of the best signature-based protocols. Our experimental evaluation shows that our protocol significantly outperforms existing post-quantum secure and signature-based protocols, even on machines with limited CPU resources. In contrast, signature-based protocols require high CPU capacity to achieve comparable performance. We have open-sourced our Rust implementation of Sailfish++ to facilitate reproducible results.

Kavier: Exploring Performance, Sustainability, and Efficiency of LLM Ecosystems under Inference through Cache-Aware Discrete-Event Simulation

2026-05-24T20:14:17Z

Large Language Models (LLMs) are widely used by our increasingly digitalized society, but raise sustainability, performance, and financial concerns, especially as inference workloads grow. To improve the design and operation of LLM ecosystems, we envision simulators and simulation-based digital twins becoming primary decision-making tools. LLM ecosystems leverage many heterogeneous components, making simulation a non-trivial, yet critical operation. The simulation challenge is exacerbated by the absence of a comprehensive reference architecture of LLM ecosystems; the lack of such a conceptual model can be costly and could misguide the designers and engineers. Without a reference architecture, even the most experienced stakeholders could tinker in researching, engineering, or maintaining LLM ecosystems. In this work, we bring a three-fold contribution to the scientific community. Firstly, we synthesize, propose, and validate a reference architecture (RA) of LLM ecosystems under inference. Then, adhering to the reference architecture, we design Kavier, the first simulation instrument able to predict the performance, sustainability, and efficiency of LLM ecosystems under inference, through discrete-event and cache-aware simulation, focusing on Key-Value-(KV-)Caching and prompt prefix caching policies. Through experiments with a Kavier prototype and real-world traces, (i) we measure the accuracy of Kavier and its performance in massive-scale simulations, (ii) we compare the performance of different KV-Caching policies, and (iii) we analyze the performance, sustainability, and efficiency of LLM ecosystems under various prefix caching policies. Overall, we show that Kavier enables operators, researchers, and engineers to predict LLM ecosystems in a time, performance, and cost-efficient way.

Characterizing CPU-Induced Slowdowns in Multi-GPU LLM Inference

2026-05-24T19:15:33Z

Large-scale machine learning workloads increasingly rely on multi-GPU systems, yet their performance is often limited by an overlooked component: the CPU. Through a detailed study of modern large language model (LLM) inference and serving workloads, we find that multi-GPU performance frequently degrades not because GPUs are saturated, but because CPUs fail to keep the GPUs busy. Under limited CPU allocations, systems exhibit symptoms such as delayed kernel launch, stalled communication, and increased tokenization latency, leading to severe GPU underutilization even when ample GPU resources are available. These bottlenecks persist even in serving stacks that employ process-level separation and modern GPU-side optimizations such as CUDA Graphs. Since the marginal cost of additional CPU cores is small relative to GPU instance pricing, our evaluation indicates that increasing the number of CPU cores can substantially improve performance and stability at minimal additional cost. Under moderate serving load, we observe that CPU-starved configurations frequently time out, while providing adequate CPU resources restores responsiveness and reduces time-to-first-token (TTFT) latency by 1.47--5.15x across configurations, all without requiring additional GPUs.