https://arxiv.org/api/5z4IxYj7WD/A5qzHAsYxIJLh7MQ 2026-06-21T07:41:45Z 1379 60 15 http://arxiv.org/abs/2605.09735v1 KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving 2026-05-10T20:10:26Z Static-graph LLM decoders provide predictable launches, fixed tensor shapes, and low submission overhead, but online decoding exposes highly irregular KV-cache behavior: request lengths differ, EOS events arrive asynchronously, and logical histories fragment over time. Dynamic runtimes recover flexibility through paged KV management and step-level scheduling, while static-graph executors often over-reserve memory and suffer burst-time latency outliers. This paper studies whether much of this variability can be absorbed below a fixed decode interface. We present KV-RM, a runtime design that regularizes KV-cache movement beneath a static-graph LLM decoder. KV-RM decouples logical KV histories from physical storage, tracks active KV state through a block pager, and materializes each decode step through a single committed descriptor. A merge-staged transport path coalesces non-contiguous KV mappings into a small number of large transfer groups before a fixed-shape attention kernel consumes them. Optional bounded far-history summaries can be enabled under the same interface, but the core design does not depend on them. On a 2-GPU NVIDIA A100 node, KV-RM improves mixed-length decoding throughput and tail latency relative to a static-graph baseline, reduces reserved KV memory across workload families, and removes severe burst-time latency spikes under production-trace replay. These results suggest that KV-cache movement, rather than kernel shape, can be an effective boundary for recovering runtime flexibility in static-graph LLM serving. 2026-05-10T20:10:26Z 14 pages, 7 figures, 7 tables Zhiqing Zhong Zhijing Ye Jian Zhang Weijian Zheng Bolun Sun Xiaodong Yu http://arxiv.org/abs/2602.11476v2 Bounded Local Generator Classes for Deterministic State Evolution 2026-05-07T23:12:01Z We define a bounded local generator class (BLGC) for deterministic state evolution on graph-indexed systems. The construction consists of finite-range generators operating on bounded local state under deterministic composition. Each update acts only on a bounded-radius neighborhood and applies a bounded local transformation with projection onto a compact state domain. Under the BLGC constraints, per-step operator work remains independent of total system size M. Specifically, incremental update cost satisfies $W_t = O(1)$ with respect to $M \to \infty$ for fixed interaction radius $r$. The framework admits a Hilbert-space embedding in $\ell^2(V)\otimes \mathbb{R}^d$ and yields bounded operators under composition on admissible subspaces. The result establishes a structural decoupling between global state capacity and incremental computational work. The claims apply specifically to the bounded local generator class defined in this paper. 2026-02-12T01:24:27Z 42 pages, 1 figure. Introduces bounded local generator classes BLGC for deterministic locality-preserving state evolution with dimension-work decoupling under bounded interaction radius R. Jay Martin http://arxiv.org/abs/2605.07008v1 Pomegranate: A Lightweight Compartmentalization Architecture using Virtualization Extensions 2026-05-07T22:44:40Z The monolithic nature of widely used commodity operating systems means that vulnerabilities in one software component potentially compromise the entire kernel. Formally verifying these systems, or redesigning them altogether as microkernels, according to the principle of least privilege, requires significant effort. Researchers have therefore considered compartmentalization techniques that minimize or totally avoid changes to existing systems. However, current approaches use techniques such as Memory Protection Keys (MPKs), necessitating extensive code analysis to ensure security, or use virtualization by instrumenting the kernel with calls to the glue code that switches compartments. In this work, we present Pomegranate, a framework that uses hardware-assisted virtualization to securely compartmentalize an existing system with minimal to no modifications to its source code. Allowed interactions between compartments are defined using an access-control policy and strictly enforced using Extended Page Tables. Using special sentry functions, Pomegranate is able to check all cross-compartment transitions without trapping into the hypervisor. We demonstrate the efficacy of Pomegranate on a compartmentalized Linux network stack using the igc NIC driver. Experiments show the overheads of our approach are negligible at MTU-sized packets when compartment boundaries are carefully established to avoid excessive inter-compartment communication. 2026-05-07T22:44:40Z Shriram Raja Zhiyuan Ruan Richard West http://arxiv.org/abs/2512.01594v5 CAEC: Confidential, Attestable, and Efficient Inter-CVM Communication with Arm CCA 2026-05-06T14:26:16Z Confidential Virtual Machines (CVMs) are increasingly adopted to protect sensitive workloads from privileged adversaries such as the hypervisor. While they provide strong isolation guarantees, existing CVM architectures lack first-class mechanisms for inter-CVM data sharing due to their disjoint memory model, making inter-CVM data exchange a performance bottleneck in compartmentalized or collaborative multi-CVM systems. Under this model, a CVM's accessible memory is either shared with the hypervisor or protected from both the hypervisor and all other CVMs. This design simplifies reasoning about memory ownership; however, it fundamentally precludes plaintext data sharing between CVMs because all inter-CVM communication must pass through hypervisor-accessible memory, requiring costly encryption and decryption to preserve confidentiality and integrity. In this paper, we introduce CAEC, a system that enables protected memory sharing between CVMs. CAEC builds on Arm Confidential Compute Architecture (CCA) and extends its firmware to support Confidential Shared Memory (CSM), a memory region securely shared between multiple CVMs while remaining inaccessible to the hypervisor and all non-participating CVMs. CAEC's design is fully compatible with CCA hardware and introduces only a modest increase (6%) in CCA firmware code size. CAEC delivers substantial performance benefits across a range of workloads. For instance, inter-CVM communication over CAEC achieves up to 209x reduction in CPU cycles compared to encryption-based mechanisms over hypervisor-accessible shared memory. By combining high performance, strong isolation guarantees, and attestable sharing semantics, CAEC provides a practical and scalable foundation for the next generation of trusted multi-CVM services across both edge and cloud environments. 2025-12-01T12:10:43Z Sina Abdollahi Amir Al Sadi David Kotz Marios Kogias Hamed Haddadi http://arxiv.org/abs/2604.18231v2 AgenTEE: Confidential LLM Agent Execution on Edge Devices 2026-05-06T14:26:07Z Large Language Model (LLM) agents provide powerful automation capabilities, but they also create a substantially broader attack surface than traditional applications due to their tight integration with non-deterministic models and third-party services. While current deployments primarily rely on cloud-hosted services, emerging designs increasingly execute agents directly on edge devices to reduce latency and enhance user privacy. However, securely hosting such complex agent pipelines on edge devices remains challenging. These deployments must protect proprietary assets (e.g., system prompts and model weights) and sensitive runtime state on heterogeneous platforms that are vulnerable to software attacks and potentially controlled by malicious users. To address these challenges, we present AgenTEE, a system for deploying confidential agent pipelines on edge devices. AgenTEE places the agent runtime, inference engine, and third-party applications into independently attested confidential virtual machines (cVMs) and mediates their interaction through explicit, verifiable communication channels. Built on Arm Confidential Compute Architecture (CCA), a recent extension to Arm platforms, AgenTEE enforces strong system-level isolation of sensitive assets and runtime state. Our evaluation shows that such multi-cVMs system is practical, achieving near-native performance with less than 5.15% runtime overhead compared to commodity OS multi-process deployments. 2026-04-20T13:13:31Z Sina Abdollahi Mohammad M Maheri Javad Forough Amir Al Sadi Josh Millar David Kotz Marios Kogias Hamed Haddadi 10.1145/3805621.3807660 http://arxiv.org/abs/2605.04837v1 Shedding Light onto Safety Integrity Level and Basic Software Constraints in a Real-World Automotive Application: Case Study with Driverator Framework 2026-05-06T12:34:32Z Automotive electronic control units (ECUs) are intricate systems with hundreds of individual functions, numerous software components, and multiple interdependent tasks. A prevalent structural pattern in these systems are so-called cause-effect chains. While significant research efforts have been dedicated to the temporal analysis and optimization of these chains, particularly minimizing data age and function response times, other crucial non-functional properties remain relatively underexplored. In particular, the safety integrity level (SIL) classification substantially influences the system design by determining task colocation strategies. Improper sharing of functions or interweaving tasks with different safety levels can compromise the integrity of critical functions. Additionally, AUTOSAR basic software (BSW) (e.g. OS, runtime environment, communication stacks, or diagnostics) introduces complexity that varies based on task characteristics and SIL categories. Furthermore, memory requirements present another critical challenge, given the diversity of memory architectures and SIL-specific dependencies that strongly constrain task allocations. This paper thoroughly characterizes a real-world automotive application, describing an automotive application based on SIL constraints, the impact of basic software, and memory requirements. In this context, the Driverator configuration framework is introduced for scalable system analysis. 2026-05-06T12:34:32Z 8 pages, 2 figures, 6 tables. Preprint. Driverator framework: https://doi.org/10.17877/TUDODATA-2026-MOR12ARE Tobias Denzinger CARIAD SE Matthias Becker KTH Royal Institute of Technology Peter Ulbrich TU Dortmund University http://arxiv.org/abs/2605.04226v1 ipc_shared_ptr: A Publish/Subscribe-Aware Smart Pointer for Cross-Process Object Lifetime Management 2026-05-05T19:09:18Z True zero-copy Inter-Process Communication (IPC) in publish/subscribe (pub/sub) middleware such as Robot Operating System 2 (ROS 2) requires subscribers to reference message objects in publisher-owned shared memory. Objects must not be reclaimed while referenced, yet must eventually be reclaimed, with correct handling of crash recovery and Transient Local QoS retention requirements. We propose ipc_shared_ptr, a pub/sub-aware smart pointer for cross-process message lifetime management. ipc_shared_ptr exploits pub/sub structural properties to specialize Birrell's reference listing, limiting global metadata updates to per-subscriber 0<->1 transitions and achieving an order-of-magnitude reduction in global communication over general-purpose distributed reference counting. We analyze the key metadata management tradeoff: scalability versus implementation simplicity. Owner-driven reclaim offers greater scalability, but concurrent membership changes and reclamation decisions produce races that widen the correctness-verification state space. Single-writer achieves structural atomicity, eliminating this complexity at the cost of a centralized bottleneck. iceoryx2 (owner-driven reclaim) and Agnocast -- a true zero-copy ROS 2 IPC middleware sharing the publisher's heap with subscribers and adopting ipc_shared_ptr with single-writer -- embody each architecture. Comparative evaluation at the scale of Autoware -- the largest open-source ROS 2 application -- confirms that single-writer achieves sufficient scalability: at 200 topics, two subscribers per topic and 100 Hz, Agnocast's E2E p99.9 is 2.9x lower than iceoryx2's, justifying implementation simplicity over owner-driven reclaim. 2026-05-05T19:09:18Z Accepted for publication in the 2026 IEEE 29th International Symposium on Real-Time Distributed Computing (ISORC); 10 pages, 8 figures Takahiro Ishikawa-Aso Atsushi Yano Koichi Imai Takuya Azumi Shinpei Kato http://arxiv.org/abs/2605.03375v1 Tutti: Making SSD-Backed KV Cache Practical for Long-Context LLM Serving 2026-05-05T05:33:11Z LLM serving relies on prefix caching to improve inference performance. As growing contexts push key-value (KV) cache footprint far beyond GPU HBM and CPU DRAM capacity, KV cache is increasingly offloaded to NVMe SSDs. Unfortunately, restoring KV cache from SSDs suffers from poor I/O performance and incurs significant GPU stalls. This is primarily because the fragmented GPU memory layout results in a massive number of tiny random I/Os, rendering the low-parallelism CPU a severe bottleneck even with GPU Direct Storage (GDS), which still relies on CPU intervention to initiate each I/O and thus remains CPU-centric. This paper presents Tutti, an efficient SSD-backed KV caching solution that eliminates CPU intervention from the critical data and I/O control paths between HBM and SSDs. At the core of Tutti is a GPU-centric KV cache object store, in which the CPU is only responsible for asynchronously loading I/O kernels once per layer to the GPU. Tutti saturates NVMe SSD bandwidth and reduces GPU stalls to near zero through the following designs: (i) we provide a GPU-native object abstraction that enables bulk KV cache transfers and management; (ii) we re-architect the GPU storage stack by introducing GPU io_uring to support asynchronous GPU direct object I/O; and (iii) we propose slack-aware I/O scheduling to avoid GPU resource contention. We have implemented Tutti and integrated it to vLLM. Extensive evaluation shows that compared to the state-of-the-art GDS-enabled, SSD-backed LMCache, Tutti reduces TTFT by 78.3% under strict SLO constraints and improves the achievable request rate by 2x. The serving cost is reduced by 27%. Tutti achieves nearly the same inference performance as DRAM-backed LMCache, while providing almost infinite capacity. 2026-05-05T05:33:11Z Shi Qiu Yifan Hu Xintao Wang Wenhao Zhu Jianqin Yan Hao Chen Kaiqiang Xu Kai Chen Yiming Zhang http://arxiv.org/abs/2605.02886v1 CityOS: Privacy Architecture for Urban Sensing 2026-05-04T17:54:45Z Cities are rapidly deploying sensing infrastructure -- cameras, environmental sensors, and connected kiosks -- that continuously observe public spaces, yet they lack a system architecture governing how applications access, aggregate, and retain this data, creating privacy risks and preventing consistent policy enforcement. We present CityOS, an operating system for urban sensing that mediates application access to sensor data through a three-tier API inspired by structured, privacy-conscious web interfaces. The tiers expand the spatial scope of data access while imposing progressively stronger privacy constraints: On-Scene supports real-time sensing with raw data confined to the local context; Single-Locality Aggregation enables differentially private longitudinal statistics at a fixed location; and Cross-Locality Aggregation supports citywide analytics via aggregation across locations, with user devices enforcing per-user privacy budgets. CityOS runs as an edge runtime that executes untrusted applications in ephemeral containers, enforcing these policies and providing transparency via broadcasts of differential privacy loss. We implement CityOS and applications across all tiers -- including pedestrian safety alerts, real-time and forecast parking availability, traffic dashboards, and subway trajectory measurement -- and show that it supports practical streetscape applications while enforcing strong privacy. 2026-05-04T17:54:45Z Giorgio Cavicchioli Mark Chen Navid Salami Pargoo Shuren Xia Xiaotian Zhou Roxana Geambasu Jason Nieh Jorge Ortiz http://arxiv.org/abs/2603.11438v2 NCCLbpf: Verified, Composable Policy Execution for GPU Collective Communication 2026-05-04T14:55:04Z NCCL is the de facto standard for collective GPU communication in large-scale distributed training, relying heavily on plugins to customize runtime behavior. However, these plugins execute as unverified native code within NCCL's address space, risking job crashes, silent state corruption, and downtime from restarts during policy updates. Inspired by kernel extensibility models, we introduce NCCLbpf, a verified, high-performance extension framework embedding a userspace eBPF runtime directly into NCCL's existing plugin interfaces, without modifying NCCL itself. NCCLbpf offers load-time static verification to prevent unsafe plugin execution, structured cross-plugin maps enabling composable policies and closed-loop adaptation, and atomic policy hot-reloads eliminating downtime previously required for policy updates. Evaluations on 8x NVIDIA B300 GPUs connected via NVLink demonstrate that NCCLbpf imposes just 80-130 ns overhead per tuner decision (less than 0.03% of collective latency), prevents all tested unsafe plugin behaviors at load-time, and enables a message-size-aware eBPF policy that improves AllReduce throughput by up to 27% over NCCL's default in the 4-128 MiB range. 2026-03-12T02:03:55Z Yusheng Zheng http://arxiv.org/abs/2605.01614v1 CvxCluster: Solving Large, Complex, Granular Resource Allocation Problems 100-1000x Faster 2026-05-02T21:28:24Z Cluster resource allocation is a multidimensional search problem that finds the best allocation of tasks to servers. Because the search space grows exponentially, modern approaches frame it as a mixed integer program (MIP) or a complex set of search heuristics. This paper proposes using a different approach: convex optimization, which has extremely fast solution methods. The research challenge is devising how to transform cluster resource allocation into a convex problem that generates good placements. We describe CvxCluster, which allocates cluster resources with a two-stage algorithm. The first stage solves a convex relaxation of the placement problem to yield a principled set of per-machine resource prices. The second stage uses these prices to drive a lightweight greedy procedure to place tasks. Experimental results with Azure traces find that CvxCluster scales to 100,480 servers under proportional workload growth and sustains arrival rates up to 500,000x the baseline trace. CvxCluster runs 100 to 2,500x faster than a state-of-the-art MIP solver while remaining within 3% of the optimal objective. CvxCluster can support complex constraints such as job anti-affinity, machine types, and GPU servers. The key insight behind CvxCluster is that reformulating placement as a continuous rather than discrete problem enables much faster methods that find solutions just as good or better than prior heuristics. 2026-05-02T21:28:24Z 13 pages, 5 figures, 2 tables. Submitted to SOSP 2026 Obi Nnorom Stephen Boyd Philip Levis http://arxiv.org/abs/2411.10612v4 Contextualizing Security and Privacy of Software-Defined Vehicles: A Literature Review and Industry Perspectives 2026-05-02T15:44:51Z The growing reliance on software in road vehicles has led to the emergence of Software-Defined Vehicles (SDV). This work analyzes SDV security and privacy through a systematic literature review complemented by an industry questionnaire across the automotive supply chain. The analysis is structured as four research questions and results in a security framework serving as a roadmap for SDV protection. The findings emphasize addressing mixed-criticality architectural challenges, deploying layered security mechanisms, and integrating privacy-preserving techniques. The results highlight the need to harmonize in-vehicle and cloud-based defenses to strengthen cybersecurity and V2X resilience in Intelligent Transportation Systems (ITS). 2024-11-15T22:30:07Z ACM Computing Surveys, 2026 Marco De Vincenzi Mert D. Pesé Chiara Bodei Ilaria Matteucci Richard R. Brooks Monowar Hasan Andrea Saracino Mohammad Hamad Sebastian Steinhorst 10.1145/3814955 http://arxiv.org/abs/2605.01352v1 VUDA: Breaking CUDA-Vulkan Isolation for Spatial Sharing of Compute and Graphics on the Same GPU 2026-05-02T09:54:20Z GPU-based simulation environments for embodied AI interleave physics simulation (CUDA) and photorealistic rendering (Vulkan) on a single device. We observe that two foundational scenarios -- simulation data generation and RL training -- can be naturally adapted to execute their simulation and rendering phases concurrently, presenting a significant opportunity to improve GPU utilization through spatial multiplexing. However, a fundamental obstacle we term execution isolation prevents this: CUDA and Vulkan create separate GPU contexts whose channels are bound to different scheduling groups, confining compute and graphics to mutually exclusive time slices. Existing spatial-sharing techniques are limited to the CUDA ecosystem, while temporal-sharing approaches underutilize available resources. This paper presents VUDA, a system that breaks execution isolation to enable spatial parallelism between CUDA compute and Vulkan graphics workloads. VUDA is built on two key observations: although CUDA and Vulkan expose different programming abstractions, their execution paths converge to a common channel primitive at the driver and hardware level; meanwhile, their virtual-address spaces are inherently disjoint, making safe page-table merging feasible without remapping. VUDA exposes a thin API for developers to annotate co-schedulable CUDA streams, and realizes spatial sharing through channel redirection into Vulkan's scheduling domain and page-table grafting to unify address spaces, eliminating all data copying on the critical path. Experiments on representative embodied-AI workloads show that VUDA delivers up to 85% higher throughput than temporal-sharing baselines, while improving GPU utilization and reducing end-to-end latency. 2026-05-02T09:54:20Z Bin Xu Pengfei Hu Wenxin Zheng Jinyu Gu Haibo Chen http://arxiv.org/abs/2605.00528v1 SAGA: Workflow-Atomic Scheduling for AI Agent Inference on GPU Clusters 2026-05-01T09:05:28Z AI agents execute tens to hundreds of chained LLM calls per task, yet GPU schedulers treat each call as independent, discarding gigabytes of intermediate state between steps and inflating end-to-end latency by 3-8x. We argue that this request-level abstraction is fundamentally mismatched to compound AI workloads, and propose a shift to program-level scheduling: treating the entire agent workflow (not individual inference calls) as the first-class schedulable unit. We present SAGA, a distributed scheduler that implements this abstraction through three mechanisms: (1) Agent Execution Graphs that capture workflow structure to predict KV cache reuse across tool-call boundaries, achieving within 1.31x of Bélády's optimal offline policy; (2) session-affinity batching with work stealing that co-locates correlated requests while maintaining global load balance; and (3) Agent Fair Share, a task-completion-time fairness metric with provable bounded-deviation guarantees. On a 64-GPU cluster serving SWE-bench coding agents and WebArena browser tasks, SAGA reduces task completion time by 1.64x (geometric mean, p < 0.001) over vLLM v0.15.1 with prefix caching and affinity routing, while improving GPU memory utilization by 1.22x and achieving 99.2% SLO attainment under multi-tenant interference. These latency gains come at a quantified cost: approximately 30% lower peak throughput than throughput-optimal batch scheduling, a tradeoff appropriate for the latency-sensitive interactive deployments that dominate compound AI usage. Our results demonstrate that workflow-aware scheduling is essential for efficient compound AI serving. 2026-05-01T09:05:28Z 15 pages, 3 figures, 11 tables. Accepted to HPDC '26 (35th International Symposium on High-Performance Parallel and Distributed Computing), July 13-16, 2026, Cleveland, OH, USA Dongxin Guo Jikun Wu Siu Ming Yiu 10.1145/3806645.3807598 http://arxiv.org/abs/2604.28138v1 Crab: A Semantics-Aware Checkpoint/Restore Runtime for Agent Sandboxes 2026-04-30T17:20:19Z Autonomous agents act through sandboxed containers and microVMs whose state spans filesystems, processes, and runtime artifacts. Checkpoint and restore (C/R) of this state is needed for fault tolerance, spot execution, RL rollout branching, and safe rollback-yet existing approaches fall into two extremes: application-level recovery preserves chat history but misses OS-side effects, while full per-turn checkpointing is correct but too expensive under dense co-location. The root cause is an agent-OS semantic gap: agent frameworks see tool calls but not their OS effects; the OS sees state changes but lacks turn-level context to judge recovery relevance. This gap hides massive sparsity: over 75% of agent turns produce no recovery-relevant state, so most checkpoints are unnecessary. Crab (Checkpoint-and-Restore for Agent SandBoxes) is a transparent host-side runtime that bridges this gap without modifying agents or C/R backends. An eBPF-based inspector classifies each turn's OS-visible effects to decide checkpoint granularity; a coordinator aligns checkpoints with turn boundaries and overlaps C/R with LLM wait time; and a host-scoped engine schedules checkpoint traffic across co-located sandboxes. On shell-intensive and code-repair workloads, Crab raises recovery correctness from 8% (chat-only) to 100%, cuts checkpoint traffic by up to 87%, and stays within 1.9% of fault-free execution time. 2026-04-30T17:20:19Z 15 pages, 21 figures Tianyuan Wu Chaokun Chang Lunxi Cao Wei Gao Wei Wang