https://arxiv.org/api/RaHtGr59ZkPAzIFkpkcEKV/CWbg2026-06-21T12:38:18Z137912015http://arxiv.org/abs/2510.04607v2From Imperative to Declarative: Towards LLM-friendly OS Interfaces for Boosted Computer-Use Agents2026-03-25T11:29:08ZComputer-use agents (CUAs) powered by large language models (LLMs) have emerged as a promising approach to automating computer tasks, yet they struggle with the existing human-oriented OS interfaces - graphical user interfaces (GUIs). GUIs force LLMs to decompose high-level goals into lengthy, error-prone sequences of fine-grained actions, resulting in low success rates and an excessive number of LLM calls.
We propose Declarative Model Interface (DMI), an abstraction that transforms existing GUIs into three declarative primitives: access, state, and observation, thereby providing novel OS interfaces tailored for LLM agents. Our key idea is policy-mechanism separation: LLMs focus on high-level semantic planning (policy) while DMI handles low-level navigation and interaction (mechanism). DMI does not require modifying the application source code or relying on application programming interfaces (APIs).
We evaluate DMI with Microsoft Office Suite (Word, PowerPoint, Excel) on Windows. Integrating DMI into a leading GUI-based agent baseline improves task success rates by 67% and reduces interaction steps by 43.5%. Notably, DMI completes over 61% of successful tasks with a single LLM call.2025-10-06T09:14:58ZYuan WangMingyu LiHaibo Chen10.1145/3767295.3803576http://arxiv.org/abs/2603.28795v1StepCache: Step-Level Reuse with Lightweight Verification and Selective Patching for LLM Serving2026-03-24T17:19:26ZWe address LLM serving workloads where repeated requests share a common solution structure but differ in localized constraints, such as output schema, variable names, or numeric constants. Prior caching approaches typically reuse either full responses (semantic caching) or model-internal KV/prefix states, which are respectively brittle under partial changes or tightly coupled to specific backends. We present StepCache, a backend-agnostic step-level reuse layer that segments outputs into ordered steps, retrieves the best-matching cached request, verifies steps using lightweight task-aware checks, and regenerates only failing regions via selective patching. StepCache additionally supports strict structured-output enforcement for JSON, including single-step extraction, required-key constraints, and one-shot repair, as well as conservative skip-reuse fallbacks for semantic changes. For linear equations, StepCache promotes verification into correction via a bounded repair loop with a deterministic fallback that guarantees correctness when the backend model fails.
In a CPU-only perturbation-heavy micro-benchmark on math and JSON variants, averaged over three seeds, StepCache reduces mean latency from 2.13 s to 0.67 s, median latency from 2.42 s to 0.01 s, and p95 latency from 3.38 s to 3.30 s. It also reduces total token usage from 36.1k to 27.3k and improves end-to-end correctness from 72.5% to 100% under task-specific checks and a stitched-output integrity check. Across requests, 79.7% take the reuse-only fast path, 5.4% require patching, and 14.9% trigger skip-reuse.2026-03-24T17:19:26Z9 pages, 1 figureAzam Nourihttp://arxiv.org/abs/2603.23425v1Wayfinder: Automated Operating System Specialization2026-03-24T16:55:41ZSpecializing an OS to optimize the performance of a particular application is typically a manual process that requires great expertise. Specialization through configuration lends itself well to automation; however, it is challenging due to the sheer size of the configuration space of modern OSes, the difficulty to quantify that space, the long time it takes to evaluate a configuration, and the large number of invalid configurations. Hence, existing attempts at specializing OSes automatically are limited to switching features on and off to minimize memory consumption or attack surface, and cannot target metrics such as performance.
We present Wayfinder, a framework specializing the configuration of OSes completely automatically and without expert knowledge. It can specialize all aspects of an OS configuration (compile-/boot-/run-time) towards any quantifiable performance, resource consumption, or security metric, for an application processing a given workload on a given hardware setup. Wayfinder consists of an automated OS benchmarking platform, and a neural network-based search algorithm driving the specialization process. This is achieved by learning on the fly which configuration parameters and values impact performance the most, and which ones lead to runtime failures. Optionally, a model pre-trained on one application can be reused to accelerate the specialization of related applications. We evaluate Wayfinder on two OSes, four applications, and two target metrics: Wayfinder fully automatically identifies specialized configurations with up to 24% application performance improvement and 8.5% memory usage reduction compared to default configurations. We highlight the benefits of our neural network, reaching good solutions faster than competing approaches (random and Bayesian), and successfully transferring knowledge between related applications.2026-03-24T16:55:41ZAccepted to appear in EuroSys'26Alexander JungCezar CrăciunoiuNikolaos KaraolidisHugo LefeuvreDaniel Oñoro RubioFelipe HuiciCharalampos RotsosPierre Olivier10.1145/3767295.3803589http://arxiv.org/abs/2603.22585v1Tock: From Research to Securing 10 Million Computers2026-03-23T21:24:19ZTock began 10 years ago as a research operating system developed by academics to help other academics build urban sensing applications. By leveraging a new language (Rust) and new hardware protection mechanisms, Tock enabled Multiprogramming a 64 kB Computer Safely and Efficiently. Today, it is an open source project with a vibrant community of users and contributors. It is deployed on root of trust hardware in data center servers and on millions of laptops; it is used to develop automotive and space products, wearable electronics, and hardware security tokens--all while remaining a platform for operating systems research. This paper focuses on the impact of Tock's technical design on its adoption, the challenges and unexpected benefits of using a type safe language (Rust)--particularly in security sensitive settings--and the experience of supporting a production open4source operating system from academia.2026-03-23T21:24:19ZIn Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles (SOSP '25)Leon SchuermannBrad CampbellBranden GhenaPhilip LevisAmit LevyPat Pannuto10.1145/3731569.3764828http://arxiv.org/abs/2509.09525v2TrEnv-X: Transparently Share Serverless Execution Environments Across Different Functions and Nodes2026-03-21T12:02:04ZServerless computing is renowned for its computation elasticity, yet its full potential is often constrained by the requirement for functions to operate within local and dedicated background environments, resulting in limited memory elasticity. To address this limitation, this paper introduces TrEnv-X, a co-designed integration of the serverless platform with the operating system and CXL/RDMA-based remote memory pools. TrEnv-X's core innovations are repurposable sandboxes, which can be shared across different functions to decrease the associated creation overhead, and OS-level memory templates, which enable rapid state restoration from CXL/RDMA-based remote memory pools. To further demonstrate TrEnv-X's versatility, we generalize its design from traditional containers for microVM-based agent workloads and introduce new optimizations, including browser sharing and a page cache bypassing mechanism. Our evaluation shows that TrEnv-X achieves up to 7x reduction in P99 latency and 48% memory savings for container-based functions. When applied to LLM agents, it reduces the P99 latency by up to 58% and memory usage by 61% compared to state-of-the-art systems like E2B.2025-09-11T15:06:03ZAccepted by ACM Transactions on Computer Systems (TOCS)Jialiang HuangTeng MaZheng LiuSixing LinKang ChenJinlei JiangXia LiaoYingdi ShanYongwei WuNing ZhangMengting LuTao MaHaifeng GongMingxing Zhanghttp://arxiv.org/abs/2603.19971v12DIO: A Cache-Accurate Storage Microbenchmark2026-03-20T14:13:54ZWe introduce 2DIO, a microbenchmark creating cache-accurate, stressful I/O traces. While existing tools are limited to generating traces with well-behaved, concave hit ratio curves, 2DIO produces ones with tunable complex cache behaviors, particularly performance cliffs and plateaus.
Our framework encodes a workload as a compact parameter triplet, capturing both short-term recency and long-term frequency. This parsimonious parameterization allows researchers to easily translate individual adjustments into predictable cache effects across various eviction policies, and enables the parameter space to be "swept" for exhaustive exploration of desired cache behavior, or to mimic real traces by calibrating parameters to match observed behaviors.
The tuned parameters are portable, meaning if the scale of the system under evaluation changes, so too will the footprint and length of the trace, while the relative cache behaviors are preserved.
Evaluations demonstrate 2DIO's ability to generate traces across a continuum of "what-if" cache behaviors and to reproduce real-world ones with high accuracy.2026-03-20T14:13:54ZTo appear in EuroSys'26Yirong WangIsaac KhorPeter Desnoyers10.1145/3767295.3769391http://arxiv.org/abs/2603.26722v1Brain-inspired AI for Edge Intelligence: a systematic review2026-03-19T08:13:49ZWhile Spiking Neural Networks (SNNs) promise to circumvent the severe Size, Weight, and Power (SWaP) constraints of edge intelligence, the field currently faces a "Deployment Paradox" where theoretical energy gains are frequently negated by the inefficiencies of mapping asynchronous, event-driven dynamics onto traditional von Neumann substrates. Transcending the reductionism of algorithm-only reviews, this survey adopts a rigorous system-level hardware-software co-design perspective to examine the 2020-2025 trajectory, specifically targeting the "last mile" technologies - from quantization methodologies to hybrid architectures - that translate biological plausibility into silicon reality. We critically dissect the interplay between training complexity (the dichotomy of direct learning vs. conversion), the "memory wall" bottlenecking stateful neuronal updates, and the critical software gap in neuromorphic compilation toolchains. Finally, we envision a roadmap to reconcile the fundamental "Sync-Async Mismatch," proposing the development of a standardized Neuromorphic OS as the foundational layer for realizing a ubiquitous, energy-autonomous Green Cognitive Substrate.2026-03-19T08:13:49ZYingchao ChengMeijia WangZhifeng HaoRajkumar Buyyahttp://arxiv.org/abs/2602.08199v2Fork, Explore, Commit: OS Primitives for Agentic Exploration2026-03-19T04:38:30ZAI agents increasingly perform agentic exploration: pursuing multiple solution paths in parallel and committing only the successful one. Because each exploration path may modify files and spawn processes, agents require isolated environments with atomic commit and rollback semantics for both filesystem state and process state. We introduce the branch context, a new OS abstraction that provides: (1) copy-on-write state isolation with independent filesystem views and process groups, (2) a structured lifecycle of fork, explore, and commit/abort, (3) first-commit-wins resolution that automatically invalidates sibling branches, and (4) nestable contexts for hierarchical exploration. We realize branch contexts in Linux through two complementary components. First, BranchFS is a FUSE-based filesystem that gives each branch context an isolated copy-on-write workspace, with O(1) creation, atomic commit to the parent, and automatic sibling invalidation, all without root privileges. BranchFS is open sourced in https://github.com/multikernel/branchfs, along with a Python integration library, BranchContext, that provides ready-to-use agent exploration patterns. Second, branch() is a proposed Linux syscall that spawns processes into branch contexts with reliable termination, kernel-enforced sibling isolation, and first-commit-wins coordination. Preliminary evaluation of BranchFS shows sub-350 us branch creation independent of base filesystem size, and modification-proportional commit overhead (under 1 ms for small changes).2026-02-09T01:46:52ZCong WangYusheng Zhenghttp://arxiv.org/abs/2603.17259v1AppFlow: Memory Scheduling for Cold Launch of Large Apps on Mobile and Vehicle Systems2026-03-18T01:35:25ZGB-scale large apps like on-device LLMs and rich media editors are becoming the next-generation trend, but their heavy memory and I/O demands, especially during multitasking, cause devices to reclaim or kill processes, turning warm apps into cold launches. The challenge lies not in storing them, but in fast, accurate launching. For users, 1s is the usability cliff, yet our measurements show 86.6\% of GB-scale cold launches exceed it. Also, Android Vitals flags only $\geq$ 5s as slow, exposing a large satisfaction gap. Existing optimizations are designed in isolation and conflict. For example, preloading reduces I/O stalls but consumes scarce memory and is undone by reclamation, while reclamation and killing free memory but sacrifice background survivability, leading to repeated cold relaunches. Our key insight is that, although multitasking makes runtime behavior complex, each app's file access pattern remains predictable. The challenge lies in exploiting this predictability, i.e., preloading without exhausting memory, reclaiming without undoing gains, and killing selectively to preserve background survivability. We introduce AppFlow, a prediction-based system-wide scheduler that integrates a Selective File Preloader, an Adaptive Memory Reclaimer, and a Context-Aware Process Killer. Implemented across the Android framework and Linux kernel without app changes, AppFlow cuts GB-scale cold-launch latency by 66.5\% (e.g., 2s$\rightarrow$690ms) and sustains 95\% of launches within 1s over a 100-day test, significantly improving responsiveness and multitasking experience.2026-03-18T01:35:25Z13 page, 21 figures, Mobicom 2026Xiaochen LiSicong LiuBin GuoYu OuyangFengmin WuYuan XuZhiwen Yu10.1145/3795866.3796690http://arxiv.org/abs/2603.14357v1Idiosyncrasies of Programmable Caching Engines2026-03-15T12:47:06ZProgrammable caching engines like CacheLib are widely used in production systems to support diverse workloads in multi-tenant environments. CacheLib's design focuses on performance, portability, and configurability, allowing applications to inherit caching improvements with minimal implementation effort. However, its behavior under dynamic and evolving workloads remains largely unexplored. This paper presents an empirical study of CacheLib with multi-tenant settings under dynamic and volatile environments. Our evaluation across multiple CacheLib configurations reveals several limitations that hinder its effectiveness under such environments, including rigid configurations, limited runtime adaptability, lack of quality-of-service support and coordination, which lead to suboptimal performance, inefficient memory usage, and tenant starvation. Based on these findings, we outline future research directions to improve the adaptability, fairness, and programmability of future caching engines.2026-03-15T12:47:06ZPaper accepted at the Workshop on Reliable Large-scale Data Management (co-located with IEEE SRDS 2025). Preliminary version of the paper "Holpaca: Holistic and Adaptable Cache Management for Shared Environments", accepted at 17th ACM/SPEC International Conference on Performance Engineering (ICPE 2026)José PeixotoAlexis GonzalezJanki BhimaniRaju RangaswamiCláudia BritoJoão PauloRicardo Macedohttp://arxiv.org/abs/2509.21550v2A Target-Agnostic Protocol-Independent Interface for the Transport Layer2026-03-14T22:13:13ZTransport protocols continue to evolve to meet the demands of new applications, workloads, and network environments, yet implementing and evolving transport protocols remains difficult and costly. High-performance transport stacks tightly interweave protocol behavior with system-level mechanisms such as packet I/O, memory management, and concurrency control, resulting in large code bases where protocol logic is scattered and hard to modify -- an issue exacerbated by modern heterogeneous execution environments.
This paper introduces transport programs, a target-independent abstraction that precisely and centrally captures a transport protocol's reactions to relevant transport events using abstract instructions for key transport operations such as data reassembly, packet generation and scheduling, and timer manipulation, while leaving execution strategy and low-level mechanisms to the target. We show that transport programs can express a diverse set of transport protocols, be efficiently realized on targets built over DPDK and Linux XDP, achieve performance comparable to hand-optimized implementations, and enable protocol changes and portability across targets without modifying underlying infrastructure.2025-09-25T20:34:52ZPedro MizunoKimiya MohammadtaheriLinfan QianJoshua JohnsonDanny AkbarzadehChris NeelyMario BaldiNachiket KapreMina Tahmasbi Arashloohttp://arxiv.org/abs/2603.13110v1AgentRM: An OS-Inspired Resource Manager for LLM Agent Systems2026-03-13T16:07:20ZLarge Language Model (LLM) agent systems have experienced rapid adoption across diverse domains, yet they suffer from critical user experience problems that limit their practical deployment. Through an empirical analysis of over 40,000 GitHub issues from six major agent frameworks (OpenClaw, AutoGen, CrewAI, LangGraph, Codex, Claude Code), we identify two fundamental resource management challenges: (1) scheduling failures leading to system unresponsiveness due to blocking, zombie processes, and rate limit cascades, and (2) context degradation causing agent "amnesia" from unbounded memory growth and poor retention policies. Drawing inspiration from decades of operating systems research, we present AgentRM, a middleware resource manager that treats agent resources analogously to OS resources. AgentRM employs a Multi-Level Feedback Queue (MLFQ) scheduler with zombie reaping and rate-limit-aware admission control, coupled with a three-tier Context Lifecycle Manager that implements adaptive compaction and hibernation mechanisms. Our evaluation demonstrates significant improvements: AgentRM-MLFQ reduces P95 latency by 86%, decreases lane waste by 96%, and increases throughput by 168% while eliminating zombie agents (0 vs. 29 baseline). AgentRM-CLM achieves 100% key information retention with 95% quality score compared to 65.1% retention and 87% quality for existing approaches, albeit with higher compaction costs (34,330 vs. 17,212 tokens).2026-03-13T16:07:20ZJianshu Shehttp://arxiv.org/abs/2602.13692v2ThunderAgent: A Simple, Fast and Program-Aware Agentic Inference System2026-03-10T20:57:47ZLarge language models(LLMs) are now used to power complex multi-turn agentic workflows. Existing systems run agentic inference by loosely assembling isolated components: an LLM inference engine (e.g., vLLM) and a tool orchestrator (e.g., Kubernetes). Although agentic workflows involve multiple LLM and tool requests, these systems schedule and allocate resources separately on a per-request basis, without end-to-end knowledge of the workflow. This leads to sub-optimal management of KV cache and tool execution environments. To address the challenges, we propose ThunderAgent, a fast, simple, and program-aware agentic inference system. We first abstract agentic workflows as LLM Programs, enabling a unified view of heterogeneous resources, including KV caches, system states, and external tool assets such as disk memory and network ports. Built upon this abstraction, ThunderAgent introduces a program-aware scheduler and a tool resource manager designed to maximize KV cache hit rates, mitigate memory imbalances, and enable asynchronous environment preparation. Evaluations across coding, routing, and scientific discovery agents demonstrate that ThunderAgent achieves 1.5-3.6x throughput improvements in serving, 1.8-3.9x in RL rollout, and up to 4.2x disk memory savings compared to state-of-the-art inference systems. To facilitate reproducibility and support future development, we open-source the system implementations of the whole ThunderAgent at: https://github.com/Agentic-Kinetics/ThunderAgent.2026-02-14T09:26:41ZHao KangZiyang LiXinyu YangWeili XuYinfang ChenJunxiong WangBeidi ChenTushar KrishnaChenfeng XuSimran Arorahttp://arxiv.org/abs/2603.09023v1The Missing Memory Hierarchy: Demand Paging for LLM Context Windows2026-03-09T23:38:32ZThe context window of a large language model is not memory. It is L1 cache: a small, fast, expensive resource that the field treats as the entire memory system. There is no L2, no virtual memory, no paging. Every tool definition, every system prompt, and every stale tool result occupies context for the lifetime of the session. The result is measurable: across 857 production sessions and 4.45 million effective input tokens, 21.8% is structural waste.
We present Pichay, a demand paging system for LLM context windows. Implemented as a transparent proxy between client and inference API, Pichay interposes on the message stream to evict stale content, detect page faults when the model re-requests evicted material, and pin working-set pages identified by fault history. In offline replay across 1.4 million simulated evictions, the fault rate is 0.0254%. In live production deployment over 681turns, the system reduces context consumption by up to 93% (5,038KB to 339KB); under extreme sustained pressure, the system remains operational but exhibits the expected thrashing pathology, with repeated fault-in of evicted content.
The key observation is that the problems the field faces, such as context limits, attention degradation, cost scaling, lost state across sessions, are virtual memory problems wearing different clothes. The solutions exist: working set theory (Denning, 1968), demand paging, fault-driven replacement policies, and memory hierarchies with multiple eviction-managed levels. We describe the architecture of a full memory hierarchy for LLM systems (L1 through persistent storage), report on the first three levels deployed in production use (L1 eviction, L2 fault-driven pinning, L3 model-initiated conversation compaction), and identify cross-session memory as the remaining frontier.2026-03-09T23:38:32ZTony Masonhttp://arxiv.org/abs/2603.08400v1Trust Nothing: RTOS Security without Run-Time Software TCB (Extended Version)2026-03-09T13:59:27ZEmbedded devices face an ever-expanding threat landscape: vulnerabilities in application software, operating system kernels, and peripherals threaten the embedded device integrity. Existing computer-architectural defenses fully consider at most two of these threat vectors in their security model.
This paper aims at addressing this gap using a novel capability architecture. To this end, we combine a token capability approach suitable for building an untrusted operating system with protection against malicious devices without requiring hardware changes to peripherals.
First, we develop and evaluate a full FPGA implementation of our capability architecture around legacy hardware components. Further, we present a soft real-time operating system based on Zephyr that has no run-time software TCB. To this end, we disaggregate Zephyr's subsystems into small, mutually isolated components. All subsystems that exist at run time, including scheduler, allocator and DMA drivers, and all peripherals are fully untrusted. We believe that our work offers a foundation for more rigorous security-by-design in tomorrow's security-critical embedded devices.2026-03-09T13:59:27ZEric AckermannSven Bugiel