https://arxiv.org/api/V/ZHXj7FpQznvla8ZGU3NY2K3Uk 2026-06-21T14:59:54Z 1379 150 15 http://arxiv.org/abs/2602.09345v2 AgentCgroup: Understanding and Controlling OS Resources of AI Agents 2026-02-21T07:48:58Z

AI agents are increasingly deployed in multi-tenant cloud environments, where they execute diverse tool calls within sandboxed containers, each call with distinct resource demands and rapid fluctuations. We present a systematic characterization of OS-level resource dynamics in sandboxed AI coding agents, analyzing 144 software engineering tasks from the SWE-rebench benchmark across two LLM models. Our measurements reveal that (1) OS-level execution (tool calls, container and agent initialization) accounts for 56-74% of end-to-end task latency; (2) memory, not CPU, is the concurrency bottleneck; (3) memory spikes are tool-call-driven with a up to 15.4x peak-to-average ratio; and (4) resource demands are highly unpredictable across tasks, runs, and models. Comparing these characteristics against serverless, microservice, and batch workloads, we identify three mismatches in existing resource controls: a granularity mismatch (container-level policies vs. tool-call-level dynamics), a responsiveness mismatch (user-space reaction vs. sub-second unpredictable bursts), and an adaptability mismatch (history-based prediction vs. non-deterministic stateful execution). We propose AgentCgroup, an intent-driven eBPF-based resource controller that exploits agents ability to declare resource needs and reconstruct execution strategies, using hierarchical cgroup structures aligned with tool-call boundaries, in-kernel enforcement via sched_ext and memcg_bpf_ops, and runtime-adaptive policies. Preliminary evaluation demonstrates improved multi-tenant isolation and reduced resource waste. AgentCgroup is open-source at https://github.com/eunomia-bpf/agentcgroup

2026-02-10T02:37:42Z Yusheng Zheng Jiakun Fan Quanzhi Fu Yiwei Yang Wei Zhang Andi Quinn http://arxiv.org/abs/2503.09663v3 BYOS: Knowledge-driven Large Language Models Bring Your Own Operating System More Excellent 2026-02-12T09:08:06Z

Operating system (OS) kernel tuning is a critical yet challenging problem for performance optimization, due to the large configuration space, complex interdependencies among configuration options, and the rapid evolution of kernel versions. Recent work has explored large language models (LLMs) for automated kernel tuning, but existing approaches often suffer from hallucinated configurations, limited interpretability, and poor robustness across workloads and kernel versions. We propose BYOS, a knowledge-driven framework that grounds LLM-based Linux kernel tuning in structured domain knowledge. BYOS incorporates three key components: (1) structured knowledge construction and mapping to bridge the semantic gap, (2) knowledge-driven configuration generation to refine the search space, and (3) continuous knowledge maintenance to adapt to kernel evolution. We evaluate BYOS on diverse workloads across multiple Linux distributions and kernel versions. Experimental results show that BYOS consistently outperforms state-of-the-art tuning baselines, achieving 7.1%-155.4% performance improvement while substantially reducing invalid configurations. These results demonstrate the effectiveness of integrating structured knowledge with LLMs for robust and scalable system optimization. The code of BYOS is available at https://github.com/LHY-24/BYOS.

2025-03-12T15:50:16Z Hongyu Lin Yuchen Li Haoran Luo Kaichun Yao Libo Zhang Zhenghong Lin Mingjie Xing Yanjun Wu Carl Yang http://arxiv.org/abs/2602.11445v1 Hardening the OSv Unikernel with Efficient Address Randomization: Design and Performance Evaluation 2026-02-11T23:47:45Z

Unikernels are single-purpose library operating systems that run the kernel and application in one address space, but often omit security mitigations such as address space layout randomization (ASLR). In OSv, boot, program loading, and thread creation select largely deterministic addresses, leading to near-identical layouts across instances and more repeatable exploitation. To reduce layout predictability, this research introduces ASLR-style diversity into OSv by randomizing the application base and thread stack regions through targeted changes to core memory-management and loading routines. The implementation adds minimal complexity while preserving OSv's lightweight design goals. Evaluation against an unmodified baseline finds comparable boot time, application runtime, and memory usage. Analysis indicates that the generated addresses exhibit a uniform distribution. These results show that layout-randomization defenses can be efficiently and effectively integrated into OSv unikernels, improving resistance to reliable exploitation.

2026-02-11T23:47:45Z 6 pages, 3 tables 2026 IEEE 14th International Symposium on Digital Forensics and Security (ISDFS) Alex Wollman John Hastings 10.1109/ISDFS69419.2026.11459124 http://arxiv.org/abs/2604.09559v1 Interferences within a certifiable design methodology for high-performance multi-core platforms 2026-02-11T10:59:37Z

The adoption of high-performance multi-core platforms in avionics and automotive systems introduces significant challenges in ensuring predictable execution, primarily due to shared resource interferences. Many existing approaches study interference from a single angle-for example, through hardware-level analysis or by monitoring software execution. However, no single abstraction level is sufficient on its own. Hardware behavior, program structure, and system configuration all interact, and a complete view is needed to understand where interferences come from and how to reduce them. In this paper, we present a methodology that brings together several tools that operate at different abstraction levels. At the lowest level, PHYLOG provides a formal model of the hardware and identifies possible interference channels using micro-architectural transactions. At the program level, machine learning analysis locates the exact parts of the code that are most sensitive to shared-resource contention. At the compilation level, MLIR-based transformations use this information to reshape memory access patterns and reduce pressure on shared resources. Finally, at the system level, Linux cgroups enforce static execution constraints to prevent highly interfering tasks from running together. The goal of our approach is to reduce memory interference and improve the system's predictability, thereby easing the certification process of multi-core systems in safety-critical domains.

2026-02-11T10:59:37Z 13th European Congress of Embedded Real Time Systems (ERTS), Feb 2026, Toulouse, France Mohamed Amine Khelassi LECA Felix Suchert TU Dresden Abderaouf Amalou Nantes Univ - ECN, LS2N Benjamin Lesage TU Dresden Anika Christmann TU Dresden Robin Hapka TU Dresden Jeronimo Castrillon TU Dresden Mihail Asavoae LECA Mathieu Jan LECA Claire Pagetti Selma Saidi http://arxiv.org/abs/2510.18756v2 Hazel: Secure and Efficient Disaggregated Storage 2026-02-10T17:49:39Z

Disaggregated storage with NVMe-over-Fabrics (NVMe-oF) has emerged as the standard solution in modern supercomputers and data center clusters, achieving superior performance, resource utilization, and power efficiency. Simultaneously, confidential computing (CC) is becoming the de facto security paradigm, enforcing stronger isolation and protection for sensitive workloads. However, securing state-of-the-art storage with traditional CC methods struggles to scale and compromises performance or security. To address these issues, we introduce Hazel, a storage management system that extends the NVMe-oF protocol capabilities and adheres to the CC threat model, providing confidentiality, integrity, and freshness guarantees. Hazel offers an appropriate control path with novel concepts such as counter-leasing. Hazel also optimizes data path performance by leveraging NVMe metadata and introducing a new disaggregated Hazel Merkle Tree (HMT), all while remaining compatible with NVMe-oF. For additional efficiency, Hazel also supports offloading to CC-capable smart NIC accelerators. We prototype Hazel on an NVIDIA BlueField-3 and demonstrate that it can achieve as little as 1-2% performance degradation for synthetic patterns, AI training, IO500, and YCSB.

2025-10-21T16:01:36Z Marcin Chrapek Meni Orenbach Ahmad Atamli Marcin Copik Mikhail Khalilov Fritz Alder Torsten Hoefler http://arxiv.org/abs/2512.13047v4 Sharpen the Spec, Cut the Code: A Case for Generative File System with SYSSPEC 2026-02-10T03:07:58Z

File systems are critical OS components that require constant evolution to support new hardware and emerging application needs. However, the traditional paradigm of developing features, fixing bugs, and maintaining the system incurs significant overhead, especially as systems grow in complexity. This paper proposes a new paradigm, generative file systems, which leverages Large Language Models (LLMs) to generate and evolve a file system from prompts, effectively addressing the need for robust evolution. Despite the widespread success of LLMs in code generation, attempts to create a functional file system have thus far been unsuccessful, mainly due to the ambiguity of natural language prompts. This paper introduces SYSSPEC, a framework for developing generative file systems. Its key insight is to replace ambiguous natural language with principles adapted from formal methods. Instead of imprecise prompts, SYSSPEC employs a multi-part specification that accurately describes a file system's functionality, modularity, and concurrency. The specification acts as an unambiguous blueprint, guiding LLMs to generate expected code flexibly. To manage evolution, we develop a DAG-structured patch that operates on the specification itself, enabling new features to be added without violating existing invariants. Moreover, the SYSSPEC toolchain features a set of LLM-based agents with mechanisms to mitigate hallucination during construction and evolution. We demonstrate our approach by generating SPECFS, a concurrent file system. SPECFS demonstrates equivalent level of correctness to that of a manually-coded baseline across hundreds of regression tests. We further confirm its evolvability by seamlessly integrating 10 real-world features from Ext4. Our work shows that a specification-guided approach makes generating and evolving complex systems not only feasible but also highly effective.

2025-12-15T07:15:01Z Qingyuan Liu Mo Zou Hengbin Zhang Dong Du Yubin Xia Haibo Chen http://arxiv.org/abs/2508.08438v2 Selective KV-Cache Sharing to Mitigate Timing Side-Channels in LLM Inference 2026-02-09T21:48:43Z

Global KV-cache sharing is an effective optimization for accelerating large language model (LLM) inference, yet it introduces an API-visible timing side channel that lets adversaries infer sensitive user inputs from shared entries, leading to cross-tenant privacy risks. To address this problem, we introduce SafeKV (Secure and Flexible KV-cache Sharing), a system-level co-design of privacy enforcement and KV-cache management. SafeKV integrates lightweight detection and isolation directly into the serving runtime to eliminate cross-tenant reuse of sensitive KV-cache blocks under our threat model, while recovering most of the performance benefits of global sharing. Our key contributions are: (1) a three-tier asynchronous detection pipeline that decouples privacy classification from inference and supports streaming workloads, (2) a unified radix-tree-based memory manager with path compression and sensitivity-aware eviction for scalable selective isolation, and (3) an RDR-guided (Reuse Diversity Ratio) runtime safeguard that detects and bounds residual leakage. On large LLM backends, SafeKV reduces the time-to-first-token (TTFT) overhead compared to full isolation by up to 40.58% and raises throughput by up to 2.66x. Overall, SafeKV restores the efficiency of KV reuse while enforcing strong, practical privacy for multi-tenant LLM inference.

2025-08-11T19:55:44Z 14 pages,15 figures Kexin Chu Zecheng Lin Dawei Xiang Zixu Shen Jianchang Su Cheng Chu Yiwei Yang Wenhui Zhang Wenfei Wu Wei Zhang http://arxiv.org/abs/2502.05413v2 Flare: Anomaly Diagnostics for Divergent LLM Training in GPU Clusters of Thousand-Plus Scale 2026-02-09T08:40:06Z

The rapid proliferation of large language models has driven the need for efficient GPU training clusters. However, it is challenging due to the frequent occurrence of training anomalies. Since existing diagnostic tools are narrowly tailored to specific issues, there are gaps in their ability to address anomalies spanning the entire training stack. In response, we introduce Flare, a diagnostic framework designed for distributed LLM training at scale. Flare first integrates a lightweight tracing daemon for full-stack and backend-extensible tracing. Additionally, it features a diagnostic engine that automatically diagnoses anomalies, with a focus on performance regressions. The deployment of Flare across 6,000 GPUs has demonstrated significant improvements in pinpointing deficiencies in real-world scenarios, with continuous operation for over eight months.

2025-02-08T02:43:10Z Weihao Cui Ji Zhang Han Zhao Chao Liu Jian Sha Bingsheng He Minyi Guo Quan Chen http://arxiv.org/abs/2504.19058v3 Scaling Data Center TCP to Terabits with Laminar 2026-02-07T17:27:58Z

We present Laminar, the first TCP stack that delivers ASIC-class performance and energy efficiency on programmable Reconfigurable Match-Action Table (RMT) pipelines, providing flexibility while retaining standard TCP semantics and POSIX socket compatibility. The key challenge to Laminar is reconciling TCP's complex dependent state updates with RMT's unidirectional, lock-step execution model. To overcome this challenge, Laminar introduces three novel techniques: optimistic concurrency (speculative updates validated downstream), pseudo-segment injection (circular dependency resolution without stalls), and bump-in-the-wire processing (single-pass segment handling). Together, these enable TCP processing, including retransmission, reassembly, flow, and congestion control, as a pipeline of simple match-action operations. Our Intel Tofino 2 prototype demonstrates Laminar's scalability to terabit speeds, flexibility, and robustness to network dynamics. Laminar matches RDMA performance and efficiency for both RPC and streaming workloads (including NVMe-oF with SPDK), while maintaining TCP/POSIX compatibility. Laminar saves up to 16 host CPU cores versus state-of-the-art kernel-bypass TCP, while achieving 5$\times$ lower 99.99p tail latency and 2$\times$ better throughput-per-watt for key-value stores. At scale, Laminar drives nearly $1$ Bpps at 20 $μ$s RPC tail latency. Unlike fixed-function offloads, Laminar supports transport evolution through in-data-path extensions (selective ACKs, congestion control variants, application co-design for shared logs). Finally, Laminar generalizes to FPGA SmartNICs, outperforming ToNIC's monolithic design by $3\times$ under equal timing.

2025-04-27T00:13:02Z 16 pages, 14 figures, 3 Tables Rajath Shashidhara Antoine Kaufmann Simon Peter http://arxiv.org/abs/2504.14489v3 Towards High-Goodput LLM Serving with Prefill-decode Multiplexing 2026-02-07T09:18:21Z

Large Language Model (LLM) serving must meet stringent Service Level Objectives (SLOs) for both the prefill and decode phases. Some existing solutions disaggregate the two phases, causing potential resource idleness or compute redundancy. Others split the prefill phase into chunks and fuse it with decode iteration, creating a dilemma between SLO compliance and high utilization. To address these issues, an efficient serving system should dynamically adapt compute allocation, decouple compute from memory management, and execute prefill and decode independently. We present MuxWise, an LLM serving framework that adopts a new paradigm, intra-GPU prefill-decode multiplexing, to meet these requirements. To fully exploit the paradigm, MuxWise integrates a bubble-less multiplex engine, a contention-tolerant estimator, and an SLO-aware dispatcher. Evaluation shows that MuxWise improves peak throughput under SLO guarantees by an average of 2.20x (up to 3.06x) over state-of-the-art baselines.

2025-04-20T04:46:34Z Yukang Chen Weihao Cui Han Zhao Ziyi Xu Xiaoze Fan Xusheng Chen Yangjie Zhou Shixuan Sun Bingsheng He Quan Chen http://arxiv.org/abs/2602.07191v1 HALO: A Fine-Grained Resource Sharing Quantum Operating System 2026-02-06T20:54:00Z

As quantum computing enters the cloud era, thousands of users must share access to a small number of quantum processors. Users need to wait minutes to days to start their jobs, which only takes a few seconds for execution. Current quantum cloud platforms employ a fair-share scheduler, as there is no way to multiplex a quantum computer among multiple programs at the same time, leaving many qubits idle and significantly under-utilizing the hardware. This imbalance between high user demand and scarce quantum resources has become a key barrier to scalable and cost-effective quantum computing. We present HALO, the first quantum operating system design that supports fine-grained resource-sharing. HALO introduces two complementary mechanisms. First, a hardware-aware qubit-sharing algorithm that places shared helper qubits on regions of the quantum computer that minimize routing overhead and avoid cross-talk noise between different users' processes. Second, a shot-adaptive scheduler that allocates execution windows according to each job's sampling requirements, improving throughput and reducing latency. Together, these mechanisms transform the way quantum hardware is scheduled and achieve more fine-grained parallelism. We evaluate HALO on the IBM Torino quantum computer on helper qubit intense benchmarks. Compared to state-of-the-art systems such as HyperQ, HALO improves overall hardware utilization by up to 2.44x, increasing throughput by 4.44x, and maintains fidelity loss within 33%, demonstrating the practicality of resource-sharing in quantum computing.

2026-02-06T20:54:00Z John Zhuoyang Ye Jiyuan Wang Yifan Qiao Jens Palsberg http://arxiv.org/abs/2602.05540v1 Taking the Leap: Efficient and Reliable Fine-Grained NUMA Migration in User-space 2026-02-05T10:58:27Z

Modern multi-socket architectures offer a single virtual address space, but physically divide main-memory across multiple regions, where each region is attached to a CPU and its cores. While this simplifies the usage, developers must be aware of non-uniform memory access (NUMA), where an access by a thread running on a core-local NUMA region is significantly cheaper than an access from a core-remote region. Obviously, if query answering is parallelized across the cores of multiple regions, then the portion of the database on which the query is operating should be distributed across the same regions to ensure local accesses. As the present data placement might not fit this, migrating pages from one NUMA region to another can be performed to improve the situation. To do so, different options exist: One option is to rely on automatic NUMA balancing integrated in Linux, which is steered by the observed access patterns and frequency. Another option is to actively trigger migration via the system call move_pages(). Unfortunately, both variants have significant downsides in terms of their feature set and performance. As an alternative, we propose a new user-space migration method called page_leap() that can perform page migration asynchronously at a high performance by exploiting features of the virtual memory subsystem. The method is (a) actively triggered by the user, (b) ensures that all pages are eventually migrated, (c) handles concurrent writes correctly, (d) supports pooled memory, (e) adaptively adjusts its migration granularity based on the workload, and (f) supports both small pages and huge pages.

2026-02-05T10:58:27Z Felix Schuhknecht Nick Rassau http://arxiv.org/abs/2602.02579v3 ProphetKV: User-Query-Driven Selective Recomputation for Efficient KV Cache Reuse in Retrieval-Augmented Generation 2026-02-05T03:13:02Z

The prefill stage of long-context Retrieval-Augmented Generation (RAG) is severely bottlenecked by computational overhead. To mitigate this, recent methods assemble pre-calculated KV caches of retrieved RAG documents (by a user query) and reprocess selected tokens to recover cross-attention between these pre-calculated KV caches. However, we identify a fundamental "crowding-out effect" in current token selection criteria: globally salient but user-query-irrelevant tokens saturate the limited recomputation budget, displacing the tokens truly essential for answering the user query and degrading inference accuracy. We propose ProphetKV, a user-query-driven KV Cache reuse method for RAG scenarios. ProphetKV dynamically prioritizes tokens based on their semantic relevance to the user query and employs a dual-stage recomputation pipeline to fuse layer-wise attention metrics into a high-utility set. By ensuring the recomputation budget is dedicated to bridging the informational gap between retrieved context and the user query, ProphetKV achieves high-fidelity attention recovery with minimal overhead. Our extensive evaluation results show that ProphetKV retains 96%-101% of full-prefill accuracy with only a 20% recomputation ratio, while achieving accuracy improvements of 8.8%-24.9% on RULER and 18.6%-50.9% on LongBench over the state-of-the-art approaches (e.g., CacheBlend, EPIC, and KVShare).

2026-01-31T09:53:31Z Shihao Wang Jiahao Chen Yanqi Pan Hao Huang Yichen Hao Xiangyu Zou Wen Xia Wentao Zhang Chongyang Qiu Pengfei Wang http://arxiv.org/abs/2602.03757v1 Mitigating Timing-Based Attacks in Real-Time Cyber-Physical Systems 2026-02-03T17:21:26Z

Real-time cyber-physical systems depend on deterministic task execution to guarantee safety and correctness. Unfortunately, this determinism can unintentionally expose timing information that enables adversaries to infer task execution patterns and carry out timing-based attacks targeting safety-critical control tasks. While prior defenses aim to obscure schedules through randomization or isolation, they typically neglect the implications of such modifications on closed-loop control behavior and real-time feasibility. This work studies the problem of securing real-time control workloads against timing inference attacks while explicitly accounting for both schedulability constraints and control performance requirements. We present a scheduling-based mitigation approach that introduces bounded timing perturbations to control task executions in a structured manner, reducing adversarial opportunities without violating real-time guarantees. The framework jointly considers worst-case execution behavior and the impact of execution delays on control performance, enabling the system to operate within predefined safety and performance limits. Through experimental evaluation on representative task sets and control scenarios, the proposed approach demonstrates that exposure to timing-based attacks can be significantly reduced while preserving predictable execution and acceptable control quality.

2026-02-03T17:21:26Z 12 pages, 10 figures Arkaprava Sain Sunandan Adhikary Soumyajit Dey http://arxiv.org/abs/2601.16448v2 Ringmaster: How to juggle high-throughput host OS system calls from TrustZone TEEs 2026-02-03T03:44:36Z

Many safety-critical systems require timely processing of sensor inputs to avoid potential safety hazards. Additionally, to support useful application features, such systems increasingly have a large rich operating system (OS) at the cost of potential security bugs. Thus, if a malicious party gains supervisor privileges, they could cause real-world damage by denying service to time-sensitive programs. Many past approaches to this problem completely isolate time-sensitive programs with a hypervisor; however, this prevents the programs from accessing useful OS services. We introduce Ringmaster, a novel framework that enables enclaves or TEEs (Trusted Execution Environments) to asynchronously access rich, but potentially untrusted, OS services via Linux's io_uring. When service is denied by the untrusted OS, enclaves continue to operate on Ringmaster's minimal ARM TrustZone kernel with access to small, critical device drivers. This approach balances the need for secure, time-sensitive processing with the convenience of rich OS services. Additionally, Ringmaster supports large unmodified programs as enclaves, offering lower overhead compared to existing systems. We demonstrate how Ringmaster helps us build a working highly-secure system with minimal engineering. In our experiments with an unmanned aerial vehicle, Ringmaster achieved nearly 1GiB/sec of data into enclave on a Raspberry Pi4b, 0-3% throughput overhead compared to non-enclave tasks.

2026-01-23T05:01:45Z Richard Habeeb Man-Ki Yoon Hao Chen Zhong Shao