https://arxiv.org/api/L79nIRgc1aftYhQNKaQcPMeCSZo2026-03-24T08:26:02Z1285015http://arxiv.org/abs/2603.21466v1GateANN: I/O-Efficient Filtered Vector Search on SSDs2026-03-23T01:03:51ZWe present GateANN, an I/O-efficient SSD-based graph ANNS system that supports filtered vector search on an unmodified graph index. Existing SSD-based systems either waste I/O by post-filtering, or require expensive filter-aware index rebuilds. GateANN avoids both by decoupling graph traversal from vector retrieval. Our key insight is that traversing a node requires only its neighbor list and an approximate distance, neither of which needs the full-precision vector on SSD. Based on this, GateANN introduces graph tunneling. It checks each node's filter predicate in memory before issuing I/O and routes through non-matching nodes entirely in memory, preserving graph connectivity without any SSD read for non-matching nodes. Our experimental results show that it reduces SSD reads by up to 10x and improves throughput by up to 7.6x.2026-03-23T01:03:51ZNakyung LeeSoobin ChoJiwoong ParkGyuyeong Kimhttp://arxiv.org/abs/2512.01594v4Confidential, Attestable, and Efficient Inter-CVM Communication with Arm CCA2026-03-21T13:31:25ZConfidential Virtual Machines (CVMs) are increasingly adopted to protect sensitive workloads from privileged adversaries such as the hypervisor. While they provide strong isolation guarantees, existing CVM architectures lack first-class mechanisms for inter-CVM data sharing due to their disjoint memory model, making inter-CVM data exchange a performance bottleneck in compartmentalized or collaborative multi-CVM systems. Under this model, a CVM's accessible memory is either shared with the hypervisor or protected from both the hypervisor and all other CVMs. This design simplifies reasoning about memory ownership; however, it fundamentally precludes plaintext data sharing between CVMs because all inter-CVM communication must pass through hypervisor-accessible memory, requiring costly encryption and decryption to preserve confidentiality and integrity. In this paper, we introduce CAEC, a system that enables protected memory sharing between CVMs. CAEC builds on Arm Confidential Compute Architecture (CCA) and extends its firmware to support Confidential Shared Memory (CSM), a memory region securely shared between multiple CVMs while remaining inaccessible to the hypervisor and all non-participating CVMs. CAEC's design is fully compatible with CCA hardware and introduces only a modest increase (4%) in CCA firmware code size. CAEC delivers substantial performance benefits across a range of workloads. For instance, inter-CVM communication over CAEC achieves up to 209$\times$ reduction in CPU cycles compared to encryption-based mechanisms over hypervisor-accessible shared memory. By combining high performance, strong isolation guarantees, and attestable sharing semantics, CAEC provides a practical and scalable foundation for the next generation of trusted multi-CVM services across both edge and cloud environments.2025-12-01T12:10:43ZSina AbdollahiAmir Al SadiDavid KotzMarios KogiasHamed Haddadihttp://arxiv.org/abs/2509.09525v2TrEnv-X: Transparently Share Serverless Execution Environments Across Different Functions and Nodes2026-03-21T12:02:04ZServerless computing is renowned for its computation elasticity, yet its full potential is often constrained by the requirement for functions to operate within local and dedicated background environments, resulting in limited memory elasticity. To address this limitation, this paper introduces TrEnv-X, a co-designed integration of the serverless platform with the operating system and CXL/RDMA-based remote memory pools. TrEnv-X's core innovations are repurposable sandboxes, which can be shared across different functions to decrease the associated creation overhead, and OS-level memory templates, which enable rapid state restoration from CXL/RDMA-based remote memory pools. To further demonstrate TrEnv-X's versatility, we generalize its design from traditional containers for microVM-based agent workloads and introduce new optimizations, including browser sharing and a page cache bypassing mechanism. Our evaluation shows that TrEnv-X achieves up to 7x reduction in P99 latency and 48% memory savings for container-based functions. When applied to LLM agents, it reduces the P99 latency by up to 58% and memory usage by 61% compared to state-of-the-art systems like E2B.2025-09-11T15:06:03ZAccepted by ACM Transactions on Computer Systems (TOCS)Jialiang HuangTeng MaZheng LiuSixing LinKang ChenJinlei JiangXia LiaoYingdi ShanYongwei WuNing ZhangMengting LuTao MaHaifeng GongMingxing Zhanghttp://arxiv.org/abs/2603.19971v12DIO: A Cache-Accurate Storage Microbenchmark2026-03-20T14:13:54ZWe introduce 2DIO, a microbenchmark creating cache-accurate, stressful I/O traces. While existing tools are limited to generating traces with well-behaved, concave hit ratio curves, 2DIO produces ones with tunable complex cache behaviors, particularly performance cliffs and plateaus.
Our framework encodes a workload as a compact parameter triplet, capturing both short-term recency and long-term frequency. This parsimonious parameterization allows researchers to easily translate individual adjustments into predictable cache effects across various eviction policies, and enables the parameter space to be "swept" for exhaustive exploration of desired cache behavior, or to mimic real traces by calibrating parameters to match observed behaviors.
The tuned parameters are portable, meaning if the scale of the system under evaluation changes, so too will the footprint and length of the trace, while the relative cache behaviors are preserved.
Evaluations demonstrate 2DIO's ability to generate traces across a continuum of "what-if" cache behaviors and to reproduce real-world ones with high accuracy.2026-03-20T14:13:54ZTo appear in EuroSys'26Yirong WangIsaac KhorPeter Desnoyers10.1145/3767295.3769391http://arxiv.org/abs/2602.08199v2Fork, Explore, Commit: OS Primitives for Agentic Exploration2026-03-19T04:38:30ZAI agents increasingly perform agentic exploration: pursuing multiple solution paths in parallel and committing only the successful one. Because each exploration path may modify files and spawn processes, agents require isolated environments with atomic commit and rollback semantics for both filesystem state and process state. We introduce the branch context, a new OS abstraction that provides: (1) copy-on-write state isolation with independent filesystem views and process groups, (2) a structured lifecycle of fork, explore, and commit/abort, (3) first-commit-wins resolution that automatically invalidates sibling branches, and (4) nestable contexts for hierarchical exploration. We realize branch contexts in Linux through two complementary components. First, BranchFS is a FUSE-based filesystem that gives each branch context an isolated copy-on-write workspace, with O(1) creation, atomic commit to the parent, and automatic sibling invalidation, all without root privileges. BranchFS is open sourced in https://github.com/multikernel/branchfs, along with a Python integration library, BranchContext, that provides ready-to-use agent exploration patterns. Second, branch() is a proposed Linux syscall that spawns processes into branch contexts with reliable termination, kernel-enforced sibling isolation, and first-commit-wins coordination. Preliminary evaluation of BranchFS shows sub-350 us branch creation independent of base filesystem size, and modification-proportional commit overhead (under 1 ms for small changes).2026-02-09T01:46:52ZCong WangYusheng Zhenghttp://arxiv.org/abs/2603.17259v1AppFlow: Memory Scheduling for Cold Launch of Large Apps on Mobile and Vehicle Systems2026-03-18T01:35:25ZGB-scale large apps like on-device LLMs and rich media editors are becoming the next-generation trend, but their heavy memory and I/O demands, especially during multitasking, cause devices to reclaim or kill processes, turning warm apps into cold launches. The challenge lies not in storing them, but in fast, accurate launching. For users, 1s is the usability cliff, yet our measurements show 86.6\% of GB-scale cold launches exceed it. Also, Android Vitals flags only $\geq$ 5s as slow, exposing a large satisfaction gap. Existing optimizations are designed in isolation and conflict. For example, preloading reduces I/O stalls but consumes scarce memory and is undone by reclamation, while reclamation and killing free memory but sacrifice background survivability, leading to repeated cold relaunches. Our key insight is that, although multitasking makes runtime behavior complex, each app's file access pattern remains predictable. The challenge lies in exploiting this predictability, i.e., preloading without exhausting memory, reclaiming without undoing gains, and killing selectively to preserve background survivability. We introduce AppFlow, a prediction-based system-wide scheduler that integrates a Selective File Preloader, an Adaptive Memory Reclaimer, and a Context-Aware Process Killer. Implemented across the Android framework and Linux kernel without app changes, AppFlow cuts GB-scale cold-launch latency by 66.5\% (e.g., 2s$\rightarrow$690ms) and sustains 95\% of launches within 1s over a 100-day test, significantly improving responsiveness and multitasking experience.2026-03-18T01:35:25Z13 page, 21 figures, Mobicom 2026Xiaochen LiSicong LiuBin GuoYu OuyangFengmin WuYuan XuZhiwen Yu10.1145/3795866.3796690http://arxiv.org/abs/2603.15042v2Guaranteeing Semantic and Performance Determinism in Flexible GPU Sharing2026-03-17T08:51:40ZGPU sharing is critical for maximizing hardware utilization in modern data centers. However, existing approaches present a stark trade-off: coarse-grained temporal multiplexing incurs severe tail-latency spikes for interactive services, while fine-grained spatial partitioning often necessitates invasive kernel modifications that compromise behavioral equivalence.
We present DetShare, a novel GPU sharing system that prioritizes determinism and transparency. DetShare ensures semantic determinism (unmodified kernels yield identical results) and performance determinism (predictable tail latency), all while maintaining complete transparency (zero code modification). DetShare introduces GPU coroutines, a new abstraction that decouples logical execution contexts from physical GPU resources. This decoupling enables flexible, fine-grained resource allocation via lightweight context migration.
Our evaluation demonstrates that DetShare improves training throughput by up to 79.2% compared to temporal sharing. In co-location scenarios, it outperforms state-of-the-art baselines, reducing P99 tail latency by 15.1% without compromising throughput. Furthermore, through workload-aware placement and our TPOT-First scheduling policy, DetShare decreases average inference latency by 69.1% and reduces Time-Per-Output-Token (TPOT) SLO violations by 21.2% relative to default policies.2026-03-16T09:48:34ZZhenyuan YangWenxin ZhengMingyu Lihttp://arxiv.org/abs/2603.15202v1LMetric: Simple is Better - Multiplication May Be All You Need for LLM Request Scheduling2026-03-16T12:43:32ZHigh-quality LLM request scheduling requires achieving two key objectives: whether the routed instance has KV$ to accelerate the request execution and whether the workload is balanced across instances. Achieving both objectives is challenging because pursuing one objective may compromise the other. Current approaches adopt various combinators (e.g., linear combinations) to compute a scheduling score combining indicators for the two objectives, which are complex in that they either require significant workload-specific hyperparameter tuning or model-hardware-aware simulator development, and could still lead to suboptimal performance. In this paper, we show that using a simple multiplication of two carefully chosen indicators-one for KV$-aware (new prefill tokens if routed to an instance) and one for load balancing-aware (current batch size of the instance)-as the scheduling score can simultaneously achieve both objectives well without any hyperparameter tuning. The key idea is that the multiplied score considers both objectives in a manner similar to a linear combination, with a nice property that the original hyperparameters are canceled out during comparison so we don't need tuning to find the best parameters. The two indicators are chosen based on our analysis of LLM characteristics, and our extensive experiments show that this simple approach can reduce TTFT by 92% and 52%, and TPOT by 21% and 20%, compared to vLLM-v1 and a production scheduler on real-world workloads covering chatbots, API calls, and coding agents. We also mathematically derive the conditions under which multiplication may fail, and find that such conditions are extremely rare in practice and can be detected (and mitigated) beforehand.2026-03-16T12:43:32ZDingyan ZhangJinbo HanKaixi ZhangXingda WeiSijie ShenChenguang FangWenyuan YuJingren ZhouRong Chenhttp://arxiv.org/abs/2603.14357v1Idiosyncrasies of Programmable Caching Engines2026-03-15T12:47:06ZProgrammable caching engines like CacheLib are widely used in production systems to support diverse workloads in multi-tenant environments. CacheLib's design focuses on performance, portability, and configurability, allowing applications to inherit caching improvements with minimal implementation effort. However, its behavior under dynamic and evolving workloads remains largely unexplored. This paper presents an empirical study of CacheLib with multi-tenant settings under dynamic and volatile environments. Our evaluation across multiple CacheLib configurations reveals several limitations that hinder its effectiveness under such environments, including rigid configurations, limited runtime adaptability, lack of quality-of-service support and coordination, which lead to suboptimal performance, inefficient memory usage, and tenant starvation. Based on these findings, we outline future research directions to improve the adaptability, fairness, and programmability of future caching engines.2026-03-15T12:47:06ZPaper accepted at the Workshop on Reliable Large-scale Data Management (co-located with IEEE SRDS 2025). Preliminary version of the paper "Holpaca: Holistic and Adaptable Cache Management for Shared Environments", accepted at 17th ACM/SPEC International Conference on Performance Engineering (ICPE 2026)José PeixotoAlexis GonzalezJanki BhimaniRaju RangaswamiCláudia BritoJoão PauloRicardo Macedohttp://arxiv.org/abs/2509.21550v2A Target-Agnostic Protocol-Independent Interface for the Transport Layer2026-03-14T22:13:13ZTransport protocols continue to evolve to meet the demands of new applications, workloads, and network environments, yet implementing and evolving transport protocols remains difficult and costly. High-performance transport stacks tightly interweave protocol behavior with system-level mechanisms such as packet I/O, memory management, and concurrency control, resulting in large code bases where protocol logic is scattered and hard to modify -- an issue exacerbated by modern heterogeneous execution environments.
This paper introduces transport programs, a target-independent abstraction that precisely and centrally captures a transport protocol's reactions to relevant transport events using abstract instructions for key transport operations such as data reassembly, packet generation and scheduling, and timer manipulation, while leaving execution strategy and low-level mechanisms to the target. We show that transport programs can express a diverse set of transport protocols, be efficiently realized on targets built over DPDK and Linux XDP, achieve performance comparable to hand-optimized implementations, and enable protocol changes and portability across targets without modifying underlying infrastructure.2025-09-25T20:34:52ZPedro MizunoKimiya MohammadtaheriLinfan QianJoshua JohnsonDanny AkbarzadehChris NeelyMario BaldiNachiket KapreMina Tahmasbi Arashloohttp://arxiv.org/abs/2603.13945v1A Case for CATS: A Conductor-driven Asymmetric Transport Scheme for Semantic Prioritization2026-03-14T13:36:15ZStandard transport protocols like TCP operate as a blind, FIFO conveyor belt for data, a model that is increasingly suboptimal for latency-sensitive and interactive applications. This paper challenges this model by introducing CATS (Conductor-driven Asymmetric Transport Scheme), a framework that provides TCP with the semantic awareness necessary to prioritize critical content. By centralizing scheduling intelligence in a transport-native "Conductor", CATS significantly improves user-perceived performance by delivering essential data first. This architecture directly confronts a cascade of historical performance workarounds and their limitations, including the high overhead of parallel connections in HTTP/1.1, the transport-layer Head-of-Line blocking in HTTP/2, and the observed implementation heterogeneity of prioritization in HTTP/3 over QUIC. Built upon TCP BBR, our ns-3 implementation demonstrates this principle by reducing the First Contentful Paint by over 78% in a representative webpage download configured as a deliberate worst-case scenario, with no penalty to total page load time compared to the baseline.2026-03-14T13:36:15Z2025 6th International Conference on Innovative Computing (ICIC)Syed Muhammad Aqdas Rizvi10.1109/ICIC68258.2025.11413235http://arxiv.org/abs/2603.13110v1AgentRM: An OS-Inspired Resource Manager for LLM Agent Systems2026-03-13T16:07:20ZLarge Language Model (LLM) agent systems have experienced rapid adoption across diverse domains, yet they suffer from critical user experience problems that limit their practical deployment. Through an empirical analysis of over 40,000 GitHub issues from six major agent frameworks (OpenClaw, AutoGen, CrewAI, LangGraph, Codex, Claude Code), we identify two fundamental resource management challenges: (1) scheduling failures leading to system unresponsiveness due to blocking, zombie processes, and rate limit cascades, and (2) context degradation causing agent "amnesia" from unbounded memory growth and poor retention policies. Drawing inspiration from decades of operating systems research, we present AgentRM, a middleware resource manager that treats agent resources analogously to OS resources. AgentRM employs a Multi-Level Feedback Queue (MLFQ) scheduler with zombie reaping and rate-limit-aware admission control, coupled with a three-tier Context Lifecycle Manager that implements adaptive compaction and hibernation mechanisms. Our evaluation demonstrates significant improvements: AgentRM-MLFQ reduces P95 latency by 86%, decreases lane waste by 96%, and increases throughput by 168% while eliminating zombie agents (0 vs. 29 baseline). AgentRM-CLM achieves 100% key information retention with 95% quality score compared to 65.1% retention and 87% quality for existing approaches, albeit with higher compaction costs (34,330 vs. 17,212 tokens).2026-03-13T16:07:20ZJianshu Shehttp://arxiv.org/abs/2603.11438v1NCCLbpf: Verified, Composable Policy Execution for GPU Collective Communication2026-03-12T02:03:55ZNCCL is the de facto standard for collective GPU communication in large-scale distributed training, relying heavily on plugins to customize runtime behavior. However, these plugins execute as unverified native code within NCCL's address space, risking job crashes, silent state corruption, and downtime from restarts during policy updates. Inspired by kernel extensibility models, we introduce NCCLbpf, a verified, high-performance extension framework embedding a userspace eBPF runtime directly into NCCL's existing plugin interfaces, without modifying NCCL itself. NCCLbpf offers load-time static verification to prevent unsafe plugin execution, structured cross-plugin maps enabling composable policies and closed-loop adaptation, and atomic policy hot-reloads eliminating downtime previously required for policy updates. Evaluations on 8x NVIDIA B300 GPUs connected via NVLink demonstrate that NCCLbpf imposes just 80-130 ns overhead per tuner decision (less than 0.03% of collective latency), prevents all tested unsafe plugin behaviors at load-time, and enables a message-size-aware eBPF policy that improves AllReduce throughput by up to 27% over NCCL's default in the 4-128 MiB range.2026-03-12T02:03:55ZYusheng Zhenghttp://arxiv.org/abs/2602.13692v2ThunderAgent: A Simple, Fast and Program-Aware Agentic Inference System2026-03-10T20:57:47ZLarge language models(LLMs) are now used to power complex multi-turn agentic workflows. Existing systems run agentic inference by loosely assembling isolated components: an LLM inference engine (e.g., vLLM) and a tool orchestrator (e.g., Kubernetes). Although agentic workflows involve multiple LLM and tool requests, these systems schedule and allocate resources separately on a per-request basis, without end-to-end knowledge of the workflow. This leads to sub-optimal management of KV cache and tool execution environments. To address the challenges, we propose ThunderAgent, a fast, simple, and program-aware agentic inference system. We first abstract agentic workflows as LLM Programs, enabling a unified view of heterogeneous resources, including KV caches, system states, and external tool assets such as disk memory and network ports. Built upon this abstraction, ThunderAgent introduces a program-aware scheduler and a tool resource manager designed to maximize KV cache hit rates, mitigate memory imbalances, and enable asynchronous environment preparation. Evaluations across coding, routing, and scientific discovery agents demonstrate that ThunderAgent achieves 1.5-3.6x throughput improvements in serving, 1.8-3.9x in RL rollout, and up to 4.2x disk memory savings compared to state-of-the-art inference systems. To facilitate reproducibility and support future development, we open-source the system implementations of the whole ThunderAgent at: https://github.com/Agentic-Kinetics/ThunderAgent.2026-02-14T09:26:41ZHao KangZiyang LiXinyu YangWeili XuYinfang ChenJunxiong WangBeidi ChenTushar KrishnaChenfeng XuSimran Arorahttp://arxiv.org/abs/2603.09738v1Ensuring Data Freshness in Multi-Rate Task Chains Scheduling2026-03-10T14:45:16ZIn safety-critical autonomous systems, data freshness presents a fundamental design challenge. While the Logical Execution Time (LET) paradigm ensures compositional determinism, it often does so at the cost of injected latency, degrading the phase margin of high-frequency control loops. Furthermore, mapping heterogeneous, multi-rate sensor fusion requirements onto rigid task-centric schedules typically implies in resource-inefficient oversampling. This paper proposes a Task-based scheduling framework extended with data freshness constraints. Unlike traditional models, scheduling decisions are driven by the lifespan of data. We introduce task offset based on the data freshness constraint to order data production in a Just-in-Time (JIT) fashion: the completion of the production of data with strictest data freshness constraint is delayed to the instant its consumers will be ready to use it. This allows for flexible task release offsets. We introduce a formal methodology to decompose Data Dependency Graphs into Dominant Paths by tracing the strictest data freshness constraints backward from the actuators. Based on this decomposition, we propose a Consensus Offset Search algorithm that synchronizes shared producers and private predecessors. This approach enforces end-to-end data freshness without the artificial latency of LET buffering. We formally prove that this offset-based alignment preserves the 100\% schedulability capacity of Global EDF, ensuring data freshness while eliminating the computational overhead of redundant sampling.2026-03-10T14:45:16ZJosé Luis Conradi HoffmannAntônio Augusto Fröhlich