https://arxiv.org/api/RaHtGr59ZkPAzIFkpkcEKV/CWbg 2026-06-21T12:38:18Z 1379 120 15 http://arxiv.org/abs/2510.04607v2 From Imperative to Declarative: Towards LLM-friendly OS Interfaces for Boosted Computer-Use Agents 2026-03-25T11:29:08Z

Computer-use agents (CUAs) powered by large language models (LLMs) have emerged as a promising approach to automating computer tasks, yet they struggle with the existing human-oriented OS interfaces - graphical user interfaces (GUIs). GUIs force LLMs to decompose high-level goals into lengthy, error-prone sequences of fine-grained actions, resulting in low success rates and an excessive number of LLM calls. We propose Declarative Model Interface (DMI), an abstraction that transforms existing GUIs into three declarative primitives: access, state, and observation, thereby providing novel OS interfaces tailored for LLM agents. Our key idea is policy-mechanism separation: LLMs focus on high-level semantic planning (policy) while DMI handles low-level navigation and interaction (mechanism). DMI does not require modifying the application source code or relying on application programming interfaces (APIs). We evaluate DMI with Microsoft Office Suite (Word, PowerPoint, Excel) on Windows. Integrating DMI into a leading GUI-based agent baseline improves task success rates by 67% and reduces interaction steps by 43.5%. Notably, DMI completes over 61% of successful tasks with a single LLM call.

2025-10-06T09:14:58Z Yuan Wang Mingyu Li Haibo Chen 10.1145/3767295.3803576 http://arxiv.org/abs/2603.28795v1 StepCache: Step-Level Reuse with Lightweight Verification and Selective Patching for LLM Serving 2026-03-24T17:19:26Z

We address LLM serving workloads where repeated requests share a common solution structure but differ in localized constraints, such as output schema, variable names, or numeric constants. Prior caching approaches typically reuse either full responses (semantic caching) or model-internal KV/prefix states, which are respectively brittle under partial changes or tightly coupled to specific backends. We present StepCache, a backend-agnostic step-level reuse layer that segments outputs into ordered steps, retrieves the best-matching cached request, verifies steps using lightweight task-aware checks, and regenerates only failing regions via selective patching. StepCache additionally supports strict structured-output enforcement for JSON, including single-step extraction, required-key constraints, and one-shot repair, as well as conservative skip-reuse fallbacks for semantic changes. For linear equations, StepCache promotes verification into correction via a bounded repair loop with a deterministic fallback that guarantees correctness when the backend model fails. In a CPU-only perturbation-heavy micro-benchmark on math and JSON variants, averaged over three seeds, StepCache reduces mean latency from 2.13 s to 0.67 s, median latency from 2.42 s to 0.01 s, and p95 latency from 3.38 s to 3.30 s. It also reduces total token usage from 36.1k to 27.3k and improves end-to-end correctness from 72.5% to 100% under task-specific checks and a stitched-output integrity check. Across requests, 79.7% take the reuse-only fast path, 5.4% require patching, and 14.9% trigger skip-reuse.

2026-03-24T17:19:26Z 9 pages, 1 figure Azam Nouri http://arxiv.org/abs/2603.23425v1 Wayfinder: Automated Operating System Specialization 2026-03-24T16:55:41Z

Specializing an OS to optimize the performance of a particular application is typically a manual process that requires great expertise. Specialization through configuration lends itself well to automation; however, it is challenging due to the sheer size of the configuration space of modern OSes, the difficulty to quantify that space, the long time it takes to evaluate a configuration, and the large number of invalid configurations. Hence, existing attempts at specializing OSes automatically are limited to switching features on and off to minimize memory consumption or attack surface, and cannot target metrics such as performance. We present Wayfinder, a framework specializing the configuration of OSes completely automatically and without expert knowledge. It can specialize all aspects of an OS configuration (compile-/boot-/run-time) towards any quantifiable performance, resource consumption, or security metric, for an application processing a given workload on a given hardware setup. Wayfinder consists of an automated OS benchmarking platform, and a neural network-based search algorithm driving the specialization process. This is achieved by learning on the fly which configuration parameters and values impact performance the most, and which ones lead to runtime failures. Optionally, a model pre-trained on one application can be reused to accelerate the specialization of related applications. We evaluate Wayfinder on two OSes, four applications, and two target metrics: Wayfinder fully automatically identifies specialized configurations with up to 24% application performance improvement and 8.5% memory usage reduction compared to default configurations. We highlight the benefits of our neural network, reaching good solutions faster than competing approaches (random and Bayesian), and successfully transferring knowledge between related applications.

2026-03-24T16:55:41Z Accepted to appear in EuroSys'26 Alexander Jung Cezar Crăciunoiu Nikolaos Karaolidis Hugo Lefeuvre Daniel Oñoro Rubio Felipe Huici Charalampos Rotsos Pierre Olivier 10.1145/3767295.3803589 http://arxiv.org/abs/2603.22585v1 Tock: From Research to Securing 10 Million Computers 2026-03-23T21:24:19Z

Tock began 10 years ago as a research operating system developed by academics to help other academics build urban sensing applications. By leveraging a new language (Rust) and new hardware protection mechanisms, Tock enabled Multiprogramming a 64 kB Computer Safely and Efficiently. Today, it is an open source project with a vibrant community of users and contributors. It is deployed on root of trust hardware in data center servers and on millions of laptops; it is used to develop automotive and space products, wearable electronics, and hardware security tokens--all while remaining a platform for operating systems research. This paper focuses on the impact of Tock's technical design on its adoption, the challenges and unexpected benefits of using a type safe language (Rust)--particularly in security sensitive settings--and the experience of supporting a production open4source operating system from academia.

2026-03-23T21:24:19Z In Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles (SOSP '25) Leon Schuermann Brad Campbell Branden Ghena Philip Levis Amit Levy Pat Pannuto 10.1145/3731569.3764828 http://arxiv.org/abs/2509.09525v2 TrEnv-X: Transparently Share Serverless Execution Environments Across Different Functions and Nodes 2026-03-21T12:02:04Z

Serverless computing is renowned for its computation elasticity, yet its full potential is often constrained by the requirement for functions to operate within local and dedicated background environments, resulting in limited memory elasticity. To address this limitation, this paper introduces TrEnv-X, a co-designed integration of the serverless platform with the operating system and CXL/RDMA-based remote memory pools. TrEnv-X's core innovations are repurposable sandboxes, which can be shared across different functions to decrease the associated creation overhead, and OS-level memory templates, which enable rapid state restoration from CXL/RDMA-based remote memory pools. To further demonstrate TrEnv-X's versatility, we generalize its design from traditional containers for microVM-based agent workloads and introduce new optimizations, including browser sharing and a page cache bypassing mechanism. Our evaluation shows that TrEnv-X achieves up to 7x reduction in P99 latency and 48% memory savings for container-based functions. When applied to LLM agents, it reduces the P99 latency by up to 58% and memory usage by 61% compared to state-of-the-art systems like E2B.

2025-09-11T15:06:03Z Accepted by ACM Transactions on Computer Systems (TOCS) Jialiang Huang Teng Ma Zheng Liu Sixing Lin Kang Chen Jinlei Jiang Xia Liao Yingdi Shan Yongwei Wu Ning Zhang Mengting Lu Tao Ma Haifeng Gong Mingxing Zhang http://arxiv.org/abs/2603.19971v1 2DIO: A Cache-Accurate Storage Microbenchmark 2026-03-20T14:13:54Z

We introduce 2DIO, a microbenchmark creating cache-accurate, stressful I/O traces. While existing tools are limited to generating traces with well-behaved, concave hit ratio curves, 2DIO produces ones with tunable complex cache behaviors, particularly performance cliffs and plateaus. Our framework encodes a workload as a compact parameter triplet, capturing both short-term recency and long-term frequency. This parsimonious parameterization allows researchers to easily translate individual adjustments into predictable cache effects across various eviction policies, and enables the parameter space to be "swept" for exhaustive exploration of desired cache behavior, or to mimic real traces by calibrating parameters to match observed behaviors. The tuned parameters are portable, meaning if the scale of the system under evaluation changes, so too will the footprint and length of the trace, while the relative cache behaviors are preserved. Evaluations demonstrate 2DIO's ability to generate traces across a continuum of "what-if" cache behaviors and to reproduce real-world ones with high accuracy.

2026-03-20T14:13:54Z To appear in EuroSys'26 Yirong Wang Isaac Khor Peter Desnoyers 10.1145/3767295.3769391 http://arxiv.org/abs/2603.26722v1 Brain-inspired AI for Edge Intelligence: a systematic review 2026-03-19T08:13:49Z

While Spiking Neural Networks (SNNs) promise to circumvent the severe Size, Weight, and Power (SWaP) constraints of edge intelligence, the field currently faces a "Deployment Paradox" where theoretical energy gains are frequently negated by the inefficiencies of mapping asynchronous, event-driven dynamics onto traditional von Neumann substrates. Transcending the reductionism of algorithm-only reviews, this survey adopts a rigorous system-level hardware-software co-design perspective to examine the 2020-2025 trajectory, specifically targeting the "last mile" technologies - from quantization methodologies to hybrid architectures - that translate biological plausibility into silicon reality. We critically dissect the interplay between training complexity (the dichotomy of direct learning vs. conversion), the "memory wall" bottlenecking stateful neuronal updates, and the critical software gap in neuromorphic compilation toolchains. Finally, we envision a roadmap to reconcile the fundamental "Sync-Async Mismatch," proposing the development of a standardized Neuromorphic OS as the foundational layer for realizing a ubiquitous, energy-autonomous Green Cognitive Substrate.

2026-03-19T08:13:49Z Yingchao Cheng Meijia Wang Zhifeng Hao Rajkumar Buyya http://arxiv.org/abs/2602.08199v2 Fork, Explore, Commit: OS Primitives for Agentic Exploration 2026-03-19T04:38:30Z

AI agents increasingly perform agentic exploration: pursuing multiple solution paths in parallel and committing only the successful one. Because each exploration path may modify files and spawn processes, agents require isolated environments with atomic commit and rollback semantics for both filesystem state and process state. We introduce the branch context, a new OS abstraction that provides: (1) copy-on-write state isolation with independent filesystem views and process groups, (2) a structured lifecycle of fork, explore, and commit/abort, (3) first-commit-wins resolution that automatically invalidates sibling branches, and (4) nestable contexts for hierarchical exploration. We realize branch contexts in Linux through two complementary components. First, BranchFS is a FUSE-based filesystem that gives each branch context an isolated copy-on-write workspace, with O(1) creation, atomic commit to the parent, and automatic sibling invalidation, all without root privileges. BranchFS is open sourced in https://github.com/multikernel/branchfs, along with a Python integration library, BranchContext, that provides ready-to-use agent exploration patterns. Second, branch() is a proposed Linux syscall that spawns processes into branch contexts with reliable termination, kernel-enforced sibling isolation, and first-commit-wins coordination. Preliminary evaluation of BranchFS shows sub-350 us branch creation independent of base filesystem size, and modification-proportional commit overhead (under 1 ms for small changes).

2026-02-09T01:46:52Z Cong Wang Yusheng Zheng http://arxiv.org/abs/2603.17259v1 AppFlow: Memory Scheduling for Cold Launch of Large Apps on Mobile and Vehicle Systems 2026-03-18T01:35:25Z

GB-scale large apps like on-device LLMs and rich media editors are becoming the next-generation trend, but their heavy memory and I/O demands, especially during multitasking, cause devices to reclaim or kill processes, turning warm apps into cold launches. The challenge lies not in storing them, but in fast, accurate launching. For users, 1s is the usability cliff, yet our measurements show 86.6\% of GB-scale cold launches exceed it. Also, Android Vitals flags only $\geq$ 5s as slow, exposing a large satisfaction gap. Existing optimizations are designed in isolation and conflict. For example, preloading reduces I/O stalls but consumes scarce memory and is undone by reclamation, while reclamation and killing free memory but sacrifice background survivability, leading to repeated cold relaunches. Our key insight is that, although multitasking makes runtime behavior complex, each app's file access pattern remains predictable. The challenge lies in exploiting this predictability, i.e., preloading without exhausting memory, reclaiming without undoing gains, and killing selectively to preserve background survivability. We introduce AppFlow, a prediction-based system-wide scheduler that integrates a Selective File Preloader, an Adaptive Memory Reclaimer, and a Context-Aware Process Killer. Implemented across the Android framework and Linux kernel without app changes, AppFlow cuts GB-scale cold-launch latency by 66.5\% (e.g., 2s$\rightarrow$690ms) and sustains 95\% of launches within 1s over a 100-day test, significantly improving responsiveness and multitasking experience.

2026-03-18T01:35:25Z 13 page, 21 figures, Mobicom 2026 Xiaochen Li Sicong Liu Bin Guo Yu Ouyang Fengmin Wu Yuan Xu Zhiwen Yu 10.1145/3795866.3796690 http://arxiv.org/abs/2603.14357v1 Idiosyncrasies of Programmable Caching Engines 2026-03-15T12:47:06Z

Programmable caching engines like CacheLib are widely used in production systems to support diverse workloads in multi-tenant environments. CacheLib's design focuses on performance, portability, and configurability, allowing applications to inherit caching improvements with minimal implementation effort. However, its behavior under dynamic and evolving workloads remains largely unexplored. This paper presents an empirical study of CacheLib with multi-tenant settings under dynamic and volatile environments. Our evaluation across multiple CacheLib configurations reveals several limitations that hinder its effectiveness under such environments, including rigid configurations, limited runtime adaptability, lack of quality-of-service support and coordination, which lead to suboptimal performance, inefficient memory usage, and tenant starvation. Based on these findings, we outline future research directions to improve the adaptability, fairness, and programmability of future caching engines.

2026-03-15T12:47:06Z Paper accepted at the Workshop on Reliable Large-scale Data Management (co-located with IEEE SRDS 2025). Preliminary version of the paper "Holpaca: Holistic and Adaptable Cache Management for Shared Environments", accepted at 17th ACM/SPEC International Conference on Performance Engineering (ICPE 2026) José Peixoto Alexis Gonzalez Janki Bhimani Raju Rangaswami Cláudia Brito João Paulo Ricardo Macedo http://arxiv.org/abs/2509.21550v2 A Target-Agnostic Protocol-Independent Interface for the Transport Layer 2026-03-14T22:13:13Z

Transport protocols continue to evolve to meet the demands of new applications, workloads, and network environments, yet implementing and evolving transport protocols remains difficult and costly. High-performance transport stacks tightly interweave protocol behavior with system-level mechanisms such as packet I/O, memory management, and concurrency control, resulting in large code bases where protocol logic is scattered and hard to modify -- an issue exacerbated by modern heterogeneous execution environments. This paper introduces transport programs, a target-independent abstraction that precisely and centrally captures a transport protocol's reactions to relevant transport events using abstract instructions for key transport operations such as data reassembly, packet generation and scheduling, and timer manipulation, while leaving execution strategy and low-level mechanisms to the target. We show that transport programs can express a diverse set of transport protocols, be efficiently realized on targets built over DPDK and Linux XDP, achieve performance comparable to hand-optimized implementations, and enable protocol changes and portability across targets without modifying underlying infrastructure.

2025-09-25T20:34:52Z Pedro Mizuno Kimiya Mohammadtaheri Linfan Qian Joshua Johnson Danny Akbarzadeh Chris Neely Mario Baldi Nachiket Kapre Mina Tahmasbi Arashloo http://arxiv.org/abs/2603.13110v1 AgentRM: An OS-Inspired Resource Manager for LLM Agent Systems 2026-03-13T16:07:20Z

Large Language Model (LLM) agent systems have experienced rapid adoption across diverse domains, yet they suffer from critical user experience problems that limit their practical deployment. Through an empirical analysis of over 40,000 GitHub issues from six major agent frameworks (OpenClaw, AutoGen, CrewAI, LangGraph, Codex, Claude Code), we identify two fundamental resource management challenges: (1) scheduling failures leading to system unresponsiveness due to blocking, zombie processes, and rate limit cascades, and (2) context degradation causing agent "amnesia" from unbounded memory growth and poor retention policies. Drawing inspiration from decades of operating systems research, we present AgentRM, a middleware resource manager that treats agent resources analogously to OS resources. AgentRM employs a Multi-Level Feedback Queue (MLFQ) scheduler with zombie reaping and rate-limit-aware admission control, coupled with a three-tier Context Lifecycle Manager that implements adaptive compaction and hibernation mechanisms. Our evaluation demonstrates significant improvements: AgentRM-MLFQ reduces P95 latency by 86%, decreases lane waste by 96%, and increases throughput by 168% while eliminating zombie agents (0 vs. 29 baseline). AgentRM-CLM achieves 100% key information retention with 95% quality score compared to 65.1% retention and 87% quality for existing approaches, albeit with higher compaction costs (34,330 vs. 17,212 tokens).

2026-03-13T16:07:20Z Jianshu She http://arxiv.org/abs/2602.13692v2 ThunderAgent: A Simple, Fast and Program-Aware Agentic Inference System 2026-03-10T20:57:47Z

Large language models(LLMs) are now used to power complex multi-turn agentic workflows. Existing systems run agentic inference by loosely assembling isolated components: an LLM inference engine (e.g., vLLM) and a tool orchestrator (e.g., Kubernetes). Although agentic workflows involve multiple LLM and tool requests, these systems schedule and allocate resources separately on a per-request basis, without end-to-end knowledge of the workflow. This leads to sub-optimal management of KV cache and tool execution environments. To address the challenges, we propose ThunderAgent, a fast, simple, and program-aware agentic inference system. We first abstract agentic workflows as LLM Programs, enabling a unified view of heterogeneous resources, including KV caches, system states, and external tool assets such as disk memory and network ports. Built upon this abstraction, ThunderAgent introduces a program-aware scheduler and a tool resource manager designed to maximize KV cache hit rates, mitigate memory imbalances, and enable asynchronous environment preparation. Evaluations across coding, routing, and scientific discovery agents demonstrate that ThunderAgent achieves 1.5-3.6x throughput improvements in serving, 1.8-3.9x in RL rollout, and up to 4.2x disk memory savings compared to state-of-the-art inference systems. To facilitate reproducibility and support future development, we open-source the system implementations of the whole ThunderAgent at: https://github.com/Agentic-Kinetics/ThunderAgent.

2026-02-14T09:26:41Z Hao Kang Ziyang Li Xinyu Yang Weili Xu Yinfang Chen Junxiong Wang Beidi Chen Tushar Krishna Chenfeng Xu Simran Arora http://arxiv.org/abs/2603.09023v1 The Missing Memory Hierarchy: Demand Paging for LLM Context Windows 2026-03-09T23:38:32Z

The context window of a large language model is not memory. It is L1 cache: a small, fast, expensive resource that the field treats as the entire memory system. There is no L2, no virtual memory, no paging. Every tool definition, every system prompt, and every stale tool result occupies context for the lifetime of the session. The result is measurable: across 857 production sessions and 4.45 million effective input tokens, 21.8% is structural waste. We present Pichay, a demand paging system for LLM context windows. Implemented as a transparent proxy between client and inference API, Pichay interposes on the message stream to evict stale content, detect page faults when the model re-requests evicted material, and pin working-set pages identified by fault history. In offline replay across 1.4 million simulated evictions, the fault rate is 0.0254%. In live production deployment over 681turns, the system reduces context consumption by up to 93% (5,038KB to 339KB); under extreme sustained pressure, the system remains operational but exhibits the expected thrashing pathology, with repeated fault-in of evicted content. The key observation is that the problems the field faces, such as context limits, attention degradation, cost scaling, lost state across sessions, are virtual memory problems wearing different clothes. The solutions exist: working set theory (Denning, 1968), demand paging, fault-driven replacement policies, and memory hierarchies with multiple eviction-managed levels. We describe the architecture of a full memory hierarchy for LLM systems (L1 through persistent storage), report on the first three levels deployed in production use (L1 eviction, L2 fault-driven pinning, L3 model-initiated conversation compaction), and identify cross-session memory as the remaining frontier.

2026-03-09T23:38:32Z Tony Mason http://arxiv.org/abs/2603.08400v1 Trust Nothing: RTOS Security without Run-Time Software TCB (Extended Version) 2026-03-09T13:59:27Z

Embedded devices face an ever-expanding threat landscape: vulnerabilities in application software, operating system kernels, and peripherals threaten the embedded device integrity. Existing computer-architectural defenses fully consider at most two of these threat vectors in their security model. This paper aims at addressing this gap using a novel capability architecture. To this end, we combine a token capability approach suitable for building an untrusted operating system with protection against malicious devices without requiring hardware changes to peripherals. First, we develop and evaluate a full FPGA implementation of our capability architecture around legacy hardware components. Further, we present a soft real-time operating system based on Zephyr that has no run-time software TCB. To this end, we disaggregate Zephyr's subsystems into small, mutually isolated components. All subsystems that exist at run time, including scheduler, allocator and DMA drivers, and all peripherals are fully untrusted. We believe that our work offers a foundation for more rigorous security-by-design in tomorrow's security-critical embedded devices.

2026-03-09T13:59:27Z Eric Ackermann Sven Bugiel