https://arxiv.org/api/V/ZHXj7FpQznvla8ZGU3NY2K3Uk2026-06-21T14:59:54Z137915015http://arxiv.org/abs/2602.09345v2AgentCgroup: Understanding and Controlling OS Resources of AI Agents2026-02-21T07:48:58ZAI agents are increasingly deployed in multi-tenant cloud environments, where they execute diverse tool calls within sandboxed containers, each call with distinct resource demands and rapid fluctuations. We present a systematic characterization of OS-level resource dynamics in sandboxed AI coding agents, analyzing 144 software engineering tasks from the SWE-rebench benchmark across two LLM models. Our measurements reveal that (1) OS-level execution (tool calls, container and agent initialization) accounts for 56-74% of end-to-end task latency; (2) memory, not CPU, is the concurrency bottleneck; (3) memory spikes are tool-call-driven with a up to 15.4x peak-to-average ratio; and (4) resource demands are highly unpredictable across tasks, runs, and models. Comparing these characteristics against serverless, microservice, and batch workloads, we identify three mismatches in existing resource controls: a granularity mismatch (container-level policies vs. tool-call-level dynamics), a responsiveness mismatch (user-space reaction vs. sub-second unpredictable bursts), and an adaptability mismatch (history-based prediction vs. non-deterministic stateful execution). We propose AgentCgroup, an intent-driven eBPF-based resource controller that exploits agents ability to declare resource needs and reconstruct execution strategies, using hierarchical cgroup structures aligned with tool-call boundaries, in-kernel enforcement via sched_ext and memcg_bpf_ops, and runtime-adaptive policies. Preliminary evaluation demonstrates improved multi-tenant isolation and reduced resource waste. AgentCgroup is open-source at https://github.com/eunomia-bpf/agentcgroup2026-02-10T02:37:42ZYusheng ZhengJiakun FanQuanzhi FuYiwei YangWei ZhangAndi Quinnhttp://arxiv.org/abs/2503.09663v3BYOS: Knowledge-driven Large Language Models Bring Your Own Operating System More Excellent2026-02-12T09:08:06ZOperating system (OS) kernel tuning is a critical yet challenging problem for performance optimization, due to the large configuration space, complex interdependencies among configuration options, and the rapid evolution of kernel versions. Recent work has explored large language models (LLMs) for automated kernel tuning, but existing approaches often suffer from hallucinated configurations, limited interpretability, and poor robustness across workloads and kernel versions. We propose BYOS, a knowledge-driven framework that grounds LLM-based Linux kernel tuning in structured domain knowledge. BYOS incorporates three key components: (1) structured knowledge construction and mapping to bridge the semantic gap, (2) knowledge-driven configuration generation to refine the search space, and (3) continuous knowledge maintenance to adapt to kernel evolution. We evaluate BYOS on diverse workloads across multiple Linux distributions and kernel versions. Experimental results show that BYOS consistently outperforms state-of-the-art tuning baselines, achieving 7.1%-155.4% performance improvement while substantially reducing invalid configurations. These results demonstrate the effectiveness of integrating structured knowledge with LLMs for robust and scalable system optimization. The code of BYOS is available at https://github.com/LHY-24/BYOS.2025-03-12T15:50:16ZHongyu LinYuchen LiHaoran LuoKaichun YaoLibo ZhangZhenghong LinMingjie XingYanjun WuCarl Yanghttp://arxiv.org/abs/2602.11445v1Hardening the OSv Unikernel with Efficient Address Randomization: Design and Performance Evaluation2026-02-11T23:47:45ZUnikernels are single-purpose library operating systems that run the kernel and application in one address space, but often omit security mitigations such as address space layout randomization (ASLR). In OSv, boot, program loading, and thread creation select largely deterministic addresses, leading to near-identical layouts across instances and more repeatable exploitation. To reduce layout predictability, this research introduces ASLR-style diversity into OSv by randomizing the application base and thread stack regions through targeted changes to core memory-management and loading routines. The implementation adds minimal complexity while preserving OSv's lightweight design goals. Evaluation against an unmodified baseline finds comparable boot time, application runtime, and memory usage. Analysis indicates that the generated addresses exhibit a uniform distribution. These results show that layout-randomization defenses can be efficiently and effectively integrated into OSv unikernels, improving resistance to reliable exploitation.2026-02-11T23:47:45Z6 pages, 3 tables2026 IEEE 14th International Symposium on Digital Forensics and Security (ISDFS)Alex WollmanJohn Hastings10.1109/ISDFS69419.2026.11459124http://arxiv.org/abs/2604.09559v1Interferences within a certifiable design methodology for high-performance multi-core platforms2026-02-11T10:59:37ZThe adoption of high-performance multi-core platforms in avionics and automotive systems introduces significant challenges in ensuring predictable execution, primarily due to shared resource interferences. Many existing approaches study interference from a single angle-for example, through hardware-level analysis or by monitoring software execution. However, no single abstraction level is sufficient on its own. Hardware behavior, program structure, and system configuration all interact, and a complete view is needed to understand where interferences come from and how to reduce them. In this paper, we present a methodology that brings together several tools that operate at different abstraction levels. At the lowest level, PHYLOG provides a formal model of the hardware and identifies possible interference channels using micro-architectural transactions. At the program level, machine learning analysis locates the exact parts of the code that are most sensitive to shared-resource contention. At the compilation level, MLIR-based transformations use this information to reshape memory access patterns and reduce pressure on shared resources. Finally, at the system level, Linux cgroups enforce static execution constraints to prevent highly interfering tasks from running together. The goal of our approach is to reduce memory interference and improve the system's predictability, thereby easing the certification process of multi-core systems in safety-critical domains.2026-02-11T10:59:37Z13th European Congress of Embedded Real Time Systems (ERTS), Feb 2026, Toulouse, FranceMohamed Amine KhelassiLECAFelix SuchertTU DresdenAbderaouf AmalouNantes Univ - ECN, LS2NBenjamin LesageTU DresdenAnika ChristmannTU DresdenRobin HapkaTU DresdenJeronimo CastrillonTU DresdenMihail AsavoaeLECAMathieu JanLECAClaire PagettiSelma Saidihttp://arxiv.org/abs/2510.18756v2Hazel: Secure and Efficient Disaggregated Storage2026-02-10T17:49:39ZDisaggregated storage with NVMe-over-Fabrics (NVMe-oF) has emerged as the standard solution in modern supercomputers and data center clusters, achieving superior performance, resource utilization, and power efficiency. Simultaneously, confidential computing (CC) is becoming the de facto security paradigm, enforcing stronger isolation and protection for sensitive workloads. However, securing state-of-the-art storage with traditional CC methods struggles to scale and compromises performance or security. To address these issues, we introduce Hazel, a storage management system that extends the NVMe-oF protocol capabilities and adheres to the CC threat model, providing confidentiality, integrity, and freshness guarantees. Hazel offers an appropriate control path with novel concepts such as counter-leasing. Hazel also optimizes data path performance by leveraging NVMe metadata and introducing a new disaggregated Hazel Merkle Tree (HMT), all while remaining compatible with NVMe-oF. For additional efficiency, Hazel also supports offloading to CC-capable smart NIC accelerators. We prototype Hazel on an NVIDIA BlueField-3 and demonstrate that it can achieve as little as 1-2% performance degradation for synthetic patterns, AI training, IO500, and YCSB.2025-10-21T16:01:36ZMarcin ChrapekMeni OrenbachAhmad AtamliMarcin CopikMikhail KhalilovFritz AlderTorsten Hoeflerhttp://arxiv.org/abs/2512.13047v4Sharpen the Spec, Cut the Code: A Case for Generative File System with SYSSPEC2026-02-10T03:07:58ZFile systems are critical OS components that require constant evolution to support new hardware and emerging application needs. However, the traditional paradigm of developing features, fixing bugs, and maintaining the system incurs significant overhead, especially as systems grow in complexity. This paper proposes a new paradigm, generative file systems, which leverages Large Language Models (LLMs) to generate and evolve a file system from prompts, effectively addressing the need for robust evolution. Despite the widespread success of LLMs in code generation, attempts to create a functional file system have thus far been unsuccessful, mainly due to the ambiguity of natural language prompts.
This paper introduces SYSSPEC, a framework for developing generative file systems. Its key insight is to replace ambiguous natural language with principles adapted from formal methods. Instead of imprecise prompts, SYSSPEC employs a multi-part specification that accurately describes a file system's functionality, modularity, and concurrency. The specification acts as an unambiguous blueprint, guiding LLMs to generate expected code flexibly. To manage evolution, we develop a DAG-structured patch that operates on the specification itself, enabling new features to be added without violating existing invariants. Moreover, the SYSSPEC toolchain features a set of LLM-based agents with mechanisms to mitigate hallucination during construction and evolution. We demonstrate our approach by generating SPECFS, a concurrent file system. SPECFS demonstrates equivalent level of correctness to that of a manually-coded baseline across hundreds of regression tests. We further confirm its evolvability by seamlessly integrating 10 real-world features from Ext4. Our work shows that a specification-guided approach makes generating and evolving complex systems not only feasible but also highly effective.2025-12-15T07:15:01ZQingyuan LiuMo ZouHengbin ZhangDong DuYubin XiaHaibo Chenhttp://arxiv.org/abs/2508.08438v2Selective KV-Cache Sharing to Mitigate Timing Side-Channels in LLM Inference2026-02-09T21:48:43ZGlobal KV-cache sharing is an effective optimization for accelerating large language model (LLM) inference, yet it introduces an API-visible timing side channel that lets adversaries infer sensitive user inputs from shared entries, leading to cross-tenant privacy risks. To address this problem, we introduce SafeKV (Secure and Flexible KV-cache Sharing), a system-level co-design of privacy enforcement and KV-cache management. SafeKV integrates lightweight detection and isolation directly into the serving runtime to eliminate cross-tenant reuse of sensitive KV-cache blocks under our threat model, while recovering most of the performance benefits of global sharing. Our key contributions are: (1) a three-tier asynchronous detection pipeline that decouples privacy classification from inference and supports streaming workloads, (2) a unified radix-tree-based memory manager with path compression and sensitivity-aware eviction for scalable selective isolation, and (3) an RDR-guided (Reuse Diversity Ratio) runtime safeguard that detects and bounds residual leakage. On large LLM backends, SafeKV reduces the time-to-first-token (TTFT) overhead compared to full isolation by up to 40.58% and raises throughput by up to 2.66x. Overall, SafeKV restores the efficiency of KV reuse while enforcing strong, practical privacy for multi-tenant LLM inference.2025-08-11T19:55:44Z14 pages,15 figuresKexin ChuZecheng LinDawei XiangZixu ShenJianchang SuCheng ChuYiwei YangWenhui ZhangWenfei WuWei Zhanghttp://arxiv.org/abs/2502.05413v2Flare: Anomaly Diagnostics for Divergent LLM Training in GPU Clusters of Thousand-Plus Scale2026-02-09T08:40:06ZThe rapid proliferation of large language models has driven the need for efficient GPU training clusters. However, it is challenging due to the frequent occurrence of training anomalies. Since existing diagnostic tools are narrowly tailored to specific issues, there are gaps in their ability to address anomalies spanning the entire training stack. In response, we introduce Flare, a diagnostic framework designed for distributed LLM training at scale. Flare first integrates a lightweight tracing daemon for full-stack and backend-extensible tracing. Additionally, it features a diagnostic engine that automatically diagnoses anomalies, with a focus on performance regressions. The deployment of Flare across 6,000 GPUs has demonstrated significant improvements in pinpointing deficiencies in real-world scenarios, with continuous operation for over eight months.2025-02-08T02:43:10ZWeihao CuiJi ZhangHan ZhaoChao LiuJian ShaBingsheng HeMinyi GuoQuan Chenhttp://arxiv.org/abs/2504.19058v3Scaling Data Center TCP to Terabits with Laminar2026-02-07T17:27:58ZWe present Laminar, the first TCP stack that delivers ASIC-class performance and energy efficiency on programmable Reconfigurable Match-Action Table (RMT) pipelines, providing flexibility while retaining standard TCP semantics and POSIX socket compatibility. The key challenge to Laminar is reconciling TCP's complex dependent state updates with RMT's unidirectional, lock-step execution model. To overcome this challenge, Laminar introduces three novel techniques: optimistic concurrency (speculative updates validated downstream), pseudo-segment injection (circular dependency resolution without stalls), and bump-in-the-wire processing (single-pass segment handling). Together, these enable TCP processing, including retransmission, reassembly, flow, and congestion control, as a pipeline of simple match-action operations.
Our Intel Tofino 2 prototype demonstrates Laminar's scalability to terabit speeds, flexibility, and robustness to network dynamics. Laminar matches RDMA performance and efficiency for both RPC and streaming workloads (including NVMe-oF with SPDK), while maintaining TCP/POSIX compatibility. Laminar saves up to 16 host CPU cores versus state-of-the-art kernel-bypass TCP, while achieving 5$\times$ lower 99.99p tail latency and 2$\times$ better throughput-per-watt for key-value stores. At scale, Laminar drives nearly $1$ Bpps at 20 $μ$s RPC tail latency. Unlike fixed-function offloads, Laminar supports transport evolution through in-data-path extensions (selective ACKs, congestion control variants, application co-design for shared logs). Finally, Laminar generalizes to FPGA SmartNICs, outperforming ToNIC's monolithic design by $3\times$ under equal timing.2025-04-27T00:13:02Z16 pages, 14 figures, 3 TablesRajath ShashidharaAntoine KaufmannSimon Peterhttp://arxiv.org/abs/2504.14489v3Towards High-Goodput LLM Serving with Prefill-decode Multiplexing2026-02-07T09:18:21ZLarge Language Model (LLM) serving must meet stringent Service Level Objectives (SLOs) for both the prefill and decode phases. Some existing solutions disaggregate the two phases, causing potential resource idleness or compute redundancy. Others split the prefill phase into chunks and fuse it with decode iteration, creating a dilemma between SLO compliance and high utilization. To address these issues, an efficient serving system should dynamically adapt compute allocation, decouple compute from memory management, and execute prefill and decode independently. We present MuxWise, an LLM serving framework that adopts a new paradigm, intra-GPU prefill-decode multiplexing, to meet these requirements. To fully exploit the paradigm, MuxWise integrates a bubble-less multiplex engine, a contention-tolerant estimator, and an SLO-aware dispatcher. Evaluation shows that MuxWise improves peak throughput under SLO guarantees by an average of 2.20x (up to 3.06x) over state-of-the-art baselines.2025-04-20T04:46:34ZYukang ChenWeihao CuiHan ZhaoZiyi XuXiaoze FanXusheng ChenYangjie ZhouShixuan SunBingsheng HeQuan Chenhttp://arxiv.org/abs/2602.07191v1HALO: A Fine-Grained Resource Sharing Quantum Operating System2026-02-06T20:54:00ZAs quantum computing enters the cloud era, thousands of users must share access to a small number of quantum processors. Users need to wait minutes to days to start their jobs, which only takes a few seconds for execution. Current quantum cloud platforms employ a fair-share scheduler, as there is no way to multiplex a quantum computer among multiple programs at the same time, leaving many qubits idle and significantly under-utilizing the hardware. This imbalance between high user demand and scarce quantum resources has become a key barrier to scalable and cost-effective quantum computing.
We present HALO, the first quantum operating system design that supports fine-grained resource-sharing. HALO introduces two complementary mechanisms. First, a hardware-aware qubit-sharing algorithm that places shared helper qubits on regions of the quantum computer that minimize routing overhead and avoid cross-talk noise between different users' processes. Second, a shot-adaptive scheduler that allocates execution windows according to each job's sampling requirements, improving throughput and reducing latency. Together, these mechanisms transform the way quantum hardware is scheduled and achieve more fine-grained parallelism.
We evaluate HALO on the IBM Torino quantum computer on helper qubit intense benchmarks. Compared to state-of-the-art systems such as HyperQ, HALO improves overall hardware utilization by up to 2.44x, increasing throughput by 4.44x, and maintains fidelity loss within 33%, demonstrating the practicality of resource-sharing in quantum computing.2026-02-06T20:54:00ZJohn Zhuoyang YeJiyuan WangYifan QiaoJens Palsberghttp://arxiv.org/abs/2602.05540v1Taking the Leap: Efficient and Reliable Fine-Grained NUMA Migration in User-space2026-02-05T10:58:27ZModern multi-socket architectures offer a single virtual address space, but physically divide main-memory across multiple regions, where each region is attached to a CPU and its cores. While this simplifies the usage, developers must be aware of non-uniform memory access (NUMA), where an access by a thread running on a core-local NUMA region is significantly cheaper than an access from a core-remote region. Obviously, if query answering is parallelized across the cores of multiple regions, then the portion of the database on which the query is operating should be distributed across the same regions to ensure local accesses. As the present data placement might not fit this, migrating pages from one NUMA region to another can be performed to improve the situation. To do so, different options exist: One option is to rely on automatic NUMA balancing integrated in Linux, which is steered by the observed access patterns and frequency. Another option is to actively trigger migration via the system call move_pages(). Unfortunately, both variants have significant downsides in terms of their feature set and performance. As an alternative, we propose a new user-space migration method called page_leap() that can perform page migration asynchronously at a high performance by exploiting features of the virtual memory subsystem. The method is (a) actively triggered by the user, (b) ensures that all pages are eventually migrated, (c) handles concurrent writes correctly, (d) supports pooled memory, (e) adaptively adjusts its migration granularity based on the workload, and (f) supports both small pages and huge pages.2026-02-05T10:58:27ZFelix SchuhknechtNick Rassauhttp://arxiv.org/abs/2602.02579v3ProphetKV: User-Query-Driven Selective Recomputation for Efficient KV Cache Reuse in Retrieval-Augmented Generation2026-02-05T03:13:02ZThe prefill stage of long-context Retrieval-Augmented Generation (RAG) is severely bottlenecked by computational overhead. To mitigate this, recent methods assemble pre-calculated KV caches of retrieved RAG documents (by a user query) and reprocess selected tokens to recover cross-attention between these pre-calculated KV caches. However, we identify a fundamental "crowding-out effect" in current token selection criteria: globally salient but user-query-irrelevant tokens saturate the limited recomputation budget, displacing the tokens truly essential for answering the user query and degrading inference accuracy.
We propose ProphetKV, a user-query-driven KV Cache reuse method for RAG scenarios. ProphetKV dynamically prioritizes tokens based on their semantic relevance to the user query and employs a dual-stage recomputation pipeline to fuse layer-wise attention metrics into a high-utility set. By ensuring the recomputation budget is dedicated to bridging the informational gap between retrieved context and the user query, ProphetKV achieves high-fidelity attention recovery with minimal overhead. Our extensive evaluation results show that ProphetKV retains 96%-101% of full-prefill accuracy with only a 20% recomputation ratio, while achieving accuracy improvements of 8.8%-24.9% on RULER and 18.6%-50.9% on LongBench over the state-of-the-art approaches (e.g., CacheBlend, EPIC, and KVShare).2026-01-31T09:53:31ZShihao WangJiahao ChenYanqi PanHao HuangYichen HaoXiangyu ZouWen XiaWentao ZhangChongyang QiuPengfei Wanghttp://arxiv.org/abs/2602.03757v1Mitigating Timing-Based Attacks in Real-Time Cyber-Physical Systems2026-02-03T17:21:26ZReal-time cyber-physical systems depend on deterministic task execution to guarantee safety and correctness. Unfortunately, this determinism can unintentionally expose timing information that enables adversaries to infer task execution patterns and carry out timing-based attacks targeting safety-critical control tasks. While prior defenses aim to obscure schedules through randomization or isolation, they typically neglect the implications of such modifications on closed-loop control behavior and real-time feasibility. This work studies the problem of securing real-time control workloads against timing inference attacks while explicitly accounting for both schedulability constraints and control performance requirements. We present a scheduling-based mitigation approach that introduces bounded timing perturbations to control task executions in a structured manner, reducing adversarial opportunities without violating real-time guarantees. The framework jointly considers worst-case execution behavior and the impact of execution delays on control performance, enabling the system to operate within predefined safety and performance limits. Through experimental evaluation on representative task sets and control scenarios, the proposed approach demonstrates that exposure to timing-based attacks can be significantly reduced while preserving predictable execution and acceptable control quality.2026-02-03T17:21:26Z12 pages, 10 figuresArkaprava SainSunandan AdhikarySoumyajit Deyhttp://arxiv.org/abs/2601.16448v2Ringmaster: How to juggle high-throughput host OS system calls from TrustZone TEEs2026-02-03T03:44:36ZMany safety-critical systems require timely processing of sensor inputs to avoid potential safety hazards. Additionally, to support useful application features, such systems increasingly have a large rich operating system (OS) at the cost of potential security bugs. Thus, if a malicious party gains supervisor privileges, they could cause real-world damage by denying service to time-sensitive programs. Many past approaches to this problem completely isolate time-sensitive programs with a hypervisor; however, this prevents the programs from accessing useful OS services. We introduce Ringmaster, a novel framework that enables enclaves or TEEs (Trusted Execution Environments) to asynchronously access rich, but potentially untrusted, OS services via Linux's io_uring. When service is denied by the untrusted OS, enclaves continue to operate on Ringmaster's minimal ARM TrustZone kernel with access to small, critical device drivers. This approach balances the need for secure, time-sensitive processing with the convenience of rich OS services. Additionally, Ringmaster supports large unmodified programs as enclaves, offering lower overhead compared to existing systems. We demonstrate how Ringmaster helps us build a working highly-secure system with minimal engineering. In our experiments with an unmanned aerial vehicle, Ringmaster achieved nearly 1GiB/sec of data into enclave on a Raspberry Pi4b, 0-3% throughput overhead compared to non-enclave tasks.2026-01-23T05:01:45ZRichard HabeebMan-Ki YoonHao ChenZhong Shao