https://arxiv.org/api/L79nIRgc1aftYhQNKaQcPMeCSZo 2026-06-09T20:30:52Z 1370 0 15 http://arxiv.org/abs/2606.09643v1 FMplex: Model Virtualization for Serving Extensible Foundation Models 2026-06-08T15:38:16Z Foundation models (FMs) are increasingly used as backbones for downstream tasks across language, vision, time-series, and multimodal applications. Yet existing model-serving systems deploy each customized task as an independent model instance, thereby replicating heavyweight backbones, wasting accelerator memory, and losing opportunities to amortize batching and loading costs. This paper presents FMplex, a serving system that treats FM backbones as a virtualization substrate for deployment sharing. FMplex presents each task with a virtual foundation model (vFM), a logically private FM instance backed by a shared physical FM. This abstraction lets independently customized tasks share a backbone while preserving task-specific extensions, independent lifecycles, and task-level isolation. In addition, we propose a batch-aware fair-queueing scheduler that combines weighted task-level sharing with inter- and intra-task batching across colocated tasks. We implement a FMplex-based serving stack spanning task construction, sharing-aware deployment, and runtime execution. Across 7 FM backbones (16 variants) and 92 downstream tasks, FMplex reduces latency by up to 80% over spatial partitioning and 33.3% over best-effort co-location, while hosting up to 6x more tasks at cluster scale. 2026-06-08T15:38:16Z Hetvi Shastri Pragya Sharma Walid A. Hanafy David Irwin Mani Srivastava Prashant Shenoy http://arxiv.org/abs/2605.22781v2 DeltaBox: Scaling Stateful AI Agents with Millisecond-Level Sandbox Checkpoint/Rollback 2026-06-08T12:58:55Z LLM-powered AI agents require high-frequency state exploration (e.g., test-time tree search and reinforcement learning), relying on rapid checkpoint and rollback (C/R) of the complete sandbox state, including files and process state (e.g., memory, contexts, etc.). Existing mechanisms duplicate the entire state, causing hundreds of milliseconds to seconds of latency per C/R, which severely bottlenecks deep search and large-scale fan-outs. This paper observes that subsequent checkpoints in AI agents are highly similar. Therefore, instead of full duplication, a sandbox should only duplicate the changes between consecutive checkpoints (Key Insight). However, it is non-trivial to realize the idea, mainly due to the missing OS supports. This paper proposes a new OS-level abstraction, DeltaState, to enable the change-based transactional C/R for AI agents with two co-designed OS mechanisms. First, DeltaFS enables change-based filesystem C/R by organizing the file states into layers and dynamically freezing the writable layer and inserting a new one during checkpoint, reducing file updates to copy-on-write, and making rollback a simple layer switch. Second, DeltaCR enables change-based process state C/R using incremental dumps, and accelerates rollback by bypassing traditional pipelines to directly fork() from a frozen template process. We then present DeltaBox, a novel agent sandbox achieving millisecond level C/R through the two new mechanisms. Evaluations on SWE-bench and RL micro-benchmarks show DeltaBox completes checkpoint and rollback in millisecond-level latency (14ms and 5ms, respectively), empowering agents to explore substantially more nodes under fixed time budgets. 2026-05-21T17:36:17Z Yunpeng Dong Jingkai He Shiqi Liu Yuze Hou Dong Du Zhonghu Xu Si Yu Baochuan Yang Yubin Xia Haibo Chen http://arxiv.org/abs/2606.09225v1 TinyContainer: Container Runtime Middleware Enabling Multi-tenant Microcontrollers with Built-in Security 2026-06-08T08:57:42Z Software containerization technologies for resource-limited devices enable multi-tenant microcontrollers, which allow running multiple applications with different permission levels. However, current solutions lack run time configuration over various settings on container scheduling and container permissions to host resources. This limits the applicability of constrained containerization in dynamic and heterogeneous environments. This paper introduces TinyContainer, a lightweight software container management middleware designed for multi-tenant microcontrollers. TinyContainer provides per-container configurable scheduling and fine-grained access control to host resources through a metadata-driven approach, supporting multiple runtimes via a runtime abstraction layer. We analyze the performance of TinyContainer with a small WebAssembly runtime, CS4WAMR, and RIOT OS, a common RTOS. We report on experiments using popular IoT boards based on various Cortex-M microcontrollers. We show the endpoint system brought by TinyContainer allowing to regulate access of containers to host resources and provide host services to containers with an overhead of up to 4 ms per call. In particular, we showcase a TinyML use case, whereby containers retain data and model weights, while model inference is delegated to native host RTOS services. 2026-06-08T08:57:42Z ACM WiSec 2026 Bastien Buil Chrystel Gaber Samuel Legouix Emmanuel Baccelli Samia Bouzefrane http://arxiv.org/abs/2606.08119v1 Policy Description Language for Authorization using Logic-Based Programming 2026-06-06T11:48:00Z Recently, with the impossibility of eradicating the vulnerabilities of information systems, we must prepare for the occurrence of the security incident by the multi-layer defense called the Defense-in-Depth strategy. In the multi-layer defense, it is important to authorize accesses in fine-grained granularity to compose each layer effectively, and many access control models are proposed to follow them. However, policy description languages proposed so far cannot express the models appropriately in proper granularity. In this paper, we propose a policy description language which can designate many kinds of conditions for access control, such as the dynamic status of an application process, as an element of decision data, and implement it in Datalog. Using the proposed language, we compose the policy of SELinux, which is a major implementation achieving the multi-layer defense, and we confirm the advantages of the proposed language by evaluating its validity and expressiveness. 2026-06-06T11:48:00Z Masaki Hashimoto Mira Kim Hidenori Tsuji Hidehiko Tanaka http://arxiv.org/abs/2606.08060v1 TOMOYO Linux: A Mandatory Access Control Method Based on Application Execution State 2026-06-06T08:47:14Z Existing access control methods grant access requests based on the combinations of applications as subject and files as objects. Therefore intents of applications and the possible effects caused by granting the access requests have not been taken into consideration. In this paper, we propose a new access control method based on application history and intents. With our access control method, system administrators can reduce the risks caused by malicious access attempts and wrong operations. In this paper, the concept and implementation design will be explained as well as the brief evaluation report of TOMOYO Linux, our implementation of the new access control method to Linux. 2026-06-06T08:47:14Z Toshiharu Harada Tetsuo Handa Masaki Hashimoto Hidehiko Tanaka http://arxiv.org/abs/2603.15202v3 Simple is Better: Multiplication May Be All You Need for LLM Request Scheduling 2026-06-05T04:55:52Z High-quality LLM request scheduling requires meeting two key objectives: ensuring the routed instance has KVCache to accelerate request execution, and ensuring that the workload is balanced across instances. Achieving both objectives is challenging because pursuing one may compromise the other. Current approaches use various combinators (e.g., linear combinations) to compute a scheduling score that combines indicators for the two objectives. These approaches are complex: they either require significant workload-specific hyperparameter tuning or model-hardware-aware simulator development, yet could still lead to suboptimal performance. In this paper, we show that using a simple multiplication of two carefully chosen indicators: one KVCache-aware (new prefill tokens if routed to an instance) and one load-balancing-aware (current batch size of the instance), as the scheduling score (LMETRIC) can achieve both objectives simultaneously without any hyperparameter tuning. The key idea is that the simply multiplied score considers both objectives in a manner similar to a linear combination, but the original hyperparameters cancel out during comparison, so no tuning is needed to find the best parameters. The two indicators are chosen based on our analysis of LLM characteristics. Our extensive experiments show that this simple approach can reduce TTFT by 92% and 39%, and TPOT by 24% and 51%, compared to vLLM-v1 and an in-production scheduler on real-world workloads covering chatbots and coding agents. We also derive the mathematical conditions under which multiplication may fail, and find that such conditions are extremely rare in practice and can be detected (and mitigated) beforehand. LMETRIC has been deployed in production and canary release confirms its effectiveness 2026-03-16T12:43:32Z To appear in the Proceedings of 20th USENIX Symposium on Operating Systems Design and Implementation (OSDI'26) Dingyan Zhang Jinbo Han Kaixi Zhang Xingda Wei Sijie Shen Chenguang Fang Wenyuan Yu Jingren Zhou Rong Chen http://arxiv.org/abs/2505.07833v2 Harmonia: End-to-End RAG Serving Optimization 2026-06-04T21:46:57Z Retrieval-Augmented Generation (RAG) improves the reliability of large language models by integrating external knowledge, but serving RAG pipelines efficiently is challenging because requests traverse heterogeneous components spanning LLM inference, databases, and CPU-side processing. We present Harmonia, an end-to-end RAG serving framework that addresses these bottlenecks through (i) a flexible pipeline specification interface for composing custom workflows, (ii) heterogeneity-aware deployment that provisions and configures components as a distributed inference system, and (iii) a closed-loop runtime controller that monitors load and execution progress and reduces SLO violations through request prioritization and auto-scaling. Across four RAG applications, Harmonia outperforms commercial alternatives, improving throughput by more than 2.04x while reducing SLO violations by up to 78.4 percent. 2025-05-01T18:58:26Z Saurabh Agarwal Bodun Hu Luis Pabon Myungjin Lee Jayanth Srinivasa Aditya Akella http://arxiv.org/abs/2606.06697v1 AgileOS: A GPU Operating System Layer for Protected CUDA Services 2026-06-04T20:34:56Z Modern GPU applications increasingly interact with storage systems, network devices, vendor libraries, and GPU-resident services rather than executing only isolated compute kernels. This shift creates a need for operating-system-like protection around GPU services, where service metadata, device queues, memory-mapped I/O regions, and library-internal state should not be directly exposed to untrusted application kernels. However, today's CUDA programming model, by default, still gives each application direct ownership of its CUDA context, device pointers, runtime handles, module loading path, and kernel launches, leaving protected GPU services to build their own ad hoc interfaces and isolation mechanisms. This paper presents the initial design and prototype scope of AgileOS, a GPU operating-system layer for protected CUDA services. AgileOS virtualizes CUDA at the library boundary: applications link against client-side CUDA Runtime, Driver, and selected library shims, while a trusted runtime worker owns the real CUDA context and mediates supported operations. To protect service state and module interfaces, AgileOS also defines a GPU memory-management model that separates user allocations from protected module/MMIO ranges, using pointer validation and memory access guards via PTX injection. AgileOS is modularized and flexible, supporting a range of protected services and existing libraries such as cuFFT and PyTorch. The prototype includes client-side interceptors, worker-side CUDA handlers, virtualized CUDA object tables, protected AgileOS modules, a GPU memory manager that separates user allocations from protected module/MMIO ranges, selected trusted library adapters, and the PTX-level kernel memory guard. 2026-06-04T20:34:56Z Zhuoping Yang Yiyu Shi Alex Jones Peipei Zhou http://arxiv.org/abs/2606.06438v1 CarbonSim: A Lifecycle-Aware Framework for Evaluating Carbon Tradeoffs in Hardware Upgrade Decisions 2026-06-04T17:40:13Z As the demand for information and communication technologies (ICT) continues to rise, the environmental impact of computing systems is becoming an increasingly critical concern. Although newer hardware often improves performance and energy efficiency, these gains do not always offset the carbon cost of premature replacement, particularly under low-utilization workloads or low-carbon electricity grids. We present CarbonSim, a lifecycle-aware simulation framework for evaluating carbon tradeoffs in hardware upgrade decisions. CarbonSim combines workload execution profiles, machine-level power characteristics, embodied carbon inventories, scheduling policies, and time-varying grid carbon intensity to estimate total emissions under alternative deployment scenarios. The framework supports multiple embodied-carbon accounting strategies, including uniform amortization and front-loaded lifecycle attribution, enabling analysis under different hardware lifespan assumptions. Using heterogeneous CPU generations as calibration platforms, we demonstrate that newer machines do not always minimize total emissions: under lightly loaded workloads or cleaner electricity mixes, extending the useful life of existing hardware can reduce lifecycle carbon despite lower operational efficiency. These results highlight that hardware refresh decisions should be workload-aware, location-aware, and lifecycle-aware. 2026-06-04T17:40:13Z Kartik Hans Kaiwen Zhao Stephen Lee http://arxiv.org/abs/2606.04908v1 GNStor: Design of GPU-Native High-Performance Remote All-Flash Array 2026-06-03T14:06:04Z GPU has become the leading computing device for a wide range of data-intensive applications, which tightly collaborates with remote all-flash array (AFA) to accommodate ever-expanding datasets, facilitate multi-client data sharing, and guarantee fault tolerance. Although GPU is the center of computation, all I/O processes in existing GPU-AFA systems are still CPU-centric. CPU orchestrates remote I/O requests and executes a centralized AFA engine to take charge of AFA-level functionalities (e.g., access control and metadata persistence). This design disparity suffers from substantial CPU-GPU interaction overhead and I/O traffic amplification, compromising end-to-end I/O performance. In this work, we present \emph{GNStor}, a GPU-native AFA system that enables GPU to directly access remote AFA without CPU intervention in the I/O path, thereby fully exploiting the performance of AFA. Specifically, GNStor first proposes a GPU-centric NVMe over RDMA (NoR) software stack (named \emph{GNoR}), paving a fast path for GPUs to directly initiate NoR I/O requests to SSDs within remote AFA. GNoR employs an atomic-operation-based I/O orchestration design and follows the single-instruction-multiple-thread (SIMT) execution model of GPU, fully exploiting the massive parallelism of GPU architectures. To facilitate essential AFA functionalities in a CPU-bypass I/O path, GNStor further designs \emph{deEngine}, a decentralized AFA engine that seamlessly decomposes and integrates AFA-level tasks into each SSD firmware, thereby achieving efficient AFA access at low cost. Evaluation results show that GNStor achieves 3.2$\times$ higher I/O throughput and reduces application execution time by 31.1\%, compared to state-of-the-art AFA systems. 2026-06-03T14:06:04Z Shushu Yi Wenbo Wu Guoci Chen Junrong Zhu Shengwen Liang Mao Bo Chenying Huan Chen Tian Jie Zhang http://arxiv.org/abs/2606.03895v1 Agent libOS: A Library-OS-Inspired Runtime for Long-Running, Capability-Controlled LLM Agents 2026-06-02T16:53:24Z Large language model (LLM) agents are evolving from request-response assistants into long-running software actors: they maintain state across model calls, fork subtasks, wait for external events, request human authority, generate tools, and perform side effects that must be resumed and audited. This paper presents Agent libOS, a library-OS-inspired runtime substrate for LLM agents. Agent libOS runs above a conventional host operating system; it does not implement hardware drivers, kernel-mode isolation, or a POSIX-compatible operating system. Instead, it treats an agent as an AgentProcess: a schedulable execution subject with process identity, parent-child lineage, lifecycle state, a tool table derived from an AgentImage, typed Object Memory, explicit capabilities, human queues, checkpoints, events, and audit records. Its central design rule is tools are libc-like wrappers; runtime primitives are the authority boundary. Filesystem access, object access, sleeps, human approval, JIT tool registration, and external side effects are checked at primitive boundaries under explicit capabilities and policy. We describe the design, threat model, Python prototype, and safety-oriented evaluation. The current prototype implements async scheduling, namespace-local Object Memory, runtime-integrated human approval, one-shot permission grants, per-process working directories, shell and image-registration primitives, Deno/TypeScript JIT tools over a libOS syscall broker, filesystem/object bridge tools, an injectable Resource Provider Substrate, deterministic demos, real-model smoke scripts, and 123 regression tests at the time of writing. Rather than improving planner accuracy, Agent libOS demonstrates a runtime substrate in which long-running LLM agents can be scheduled, authorized, resumed, and audited without treating tool dispatch as the trust boundary. 2026-06-02T16:53:24Z 14 pages, 1 figure, 2 tables Yingqi Zhang http://arxiv.org/abs/2606.00942v2 Characterizing Metastable Faults and Failures 2026-06-02T16:15:24Z Metastable failures are hard to detect, prevent, and mitigate. During a metastable failure, a system exhibits self-sustaining bad behavior even in the absence of adversarial conditions. Prior work focuses on symptoms and has portrayed metastable failures as instances of self-sustaining overload. This characterization leaves the underlying failure causes and dynamics unknown, and does not account for metastable failures that do not manifest as overload. We present the first causal characterization of metastable failures by identifying their origin in metastable faults, i.e., structural destabilizing cycles of interaction among systems components that, in isolation, are stabilizing. Metastable failures arise when scheduling decisions let these destabilizing interactions gain the upper hand over the individual components' stabilizing tendencies. We then derive a methodology to predict metastable failures, and to build metastable-fault-tolerant (MFT) systems. We apply our methodology to three case studies, showcasing the generality of our results. 2026-05-31T01:04:27Z 19 pages, 5 figures, submitted to SOSP 2026 Ali Farahbakhsh Qingjie Lu Lorenzo Alvisi Andreas Haeberlen Robbert Van Renesse http://arxiv.org/abs/2604.19275v2 Scheduling Analysis of UAV Flight Control Workloads on PREEMPT_RT Linux Using a Raspberry Pi 5 2026-06-02T06:53:47Z Modern UAV architectures increasingly aim to unify high-level autonomy and low-level flight control on a single General-Purpose Operating System (GPOS). However, complex multi-core System-on-Chips (SoCs) introduce significant timing indeterminism due to shared resource contention. This paper performs an architectural analysis of the PREEMPT RT Linux kernel on a Raspberry Pi 5, specifically isolating the impact of kernel activation paths (deferred execution SoftIRQs versus real-time direct activation) on a 250 Hz control loop. Results show that under heavy stress, the standard kernel is unsuitable, exhibiting worst-case latencies exceeding 9 ms. In contrast, PREEMPT RT reduced the worst-case latency by nearly 88 percent to under 225 microseconds, enforcing a direct wake-up path that mitigates OS noise. These findings demonstrate that while PREEMPT RT resolves scheduling variance, the residual jitter on modern SoCs is primarily driven by hardware memory contention. 2026-04-21T09:46:41Z 9 pages, 8 figures, conference Luiz Giacomossi Håkan Forsberg Ivan Tomasic Baran Çürüklü Tommaso Cucinotta http://arxiv.org/abs/2508.12551v2 TuneAgent: Agentic Operating System Kernel Tuning with Reinforcement Learning 2026-05-31T06:10:12Z Linux kernel tuning is essential for optimizing operating system (OS) performance, yet remains challenging due to the complex kernel space, sparse performance feedback, and strong workload sensitivity. We present TuneAgent, an agentic Linux kernel tuning framework powered by rule-based reinforcement learning (RL). TuneAgent formulates the kernel space as a constrained RL environment, enabling large language models (LLMs) to autonomously explore the kernel while enforcing valid and precise configuration modifications. To address sparse performance feedback, we design structured reward functions that jointly promote reasoning standardization, configuration correctness, and performance awareness. Furthermore, we propose a two-phase training strategy that first ensures format and semantic correctness and then transitions to performance-driven exploration, accelerating convergence and reducing overhead. Experimental results show that TuneAgent consistently outperforms existing baselines, achieving up to 5.6% relative overall performance improvement while maintaining high configuration validity. We further demonstrate its robustness across multiple real-world applications, highlighting its practicality and adaptability in diverse deployment environments. 2025-08-18T01:09:57Z Hongyu Lin Yuchen Li Haoran Luo Zhenghong Lin Libo Zhang Mingjie Xing Yanjun Wu http://arxiv.org/abs/2606.00866v1 Idleness is Relative: Exploiting Tool-Call Idle Windows for Offloading in Agentic Systems with MORI 2026-05-30T19:44:25Z Modern LLM serving systems increasingly host agentic workloads, whose sessions issue tens of model invocations interleaved with tool calls, accumulating KV cache that can be reused across steps. As requests' total KV cache size easily exceeds GPU HBM capacity, researchers offload them to CPU DRAM. However, tool-call durations span orders of magnitude, and the cost of transferring KV cache between tiers makes it impractical to re-place entries on every call. We observe that agentic programs exhibit a two-phase structure: busy phases of rapid short tool calls and idle phases dominated by long-running calls. Current eviction policies such as LRU fail to capture this property. A binary busy/idle label also falls short because the ratio of busy to idle programs may not match the hardware's GPU-to-CPU capacity ratio. When it does not, one tier sits underutilized while the other is oversubscribed, wasting memory or forcing unnecessary evictions. We present MORI, an agent serving system that solves the above problem. Our key insight is that idleness is a continuous, relative spectrum. MORI ranks all active programs by idleness, assigns the busiest to GPU HBM and the most idle to CPU DRAM, dynamically shifts the partition boundary to match hardware capacity, and enforces admission control at each memory tier. Evaluated on real coding agent workloads collected from Claude Code across four GPU and model pairs, MORI delivers 20--71% higher throughput and 18--43% lower TTFT than the best baseline with offloading. 2026-05-30T19:44:25Z Tian Xia Hanchen Li Zhifei Li Xiaokun Chen Hao Kang Yifan Qiao Yi Xu Ion Stoica