https://arxiv.org/api/oggwDY26YjsvmbKTuRIcxYOUCbo2026-06-21T08:59:04Z13797515http://arxiv.org/abs/2604.27915v1Affinity Tailor: Dynamic Locality-Aware Scheduling at Scale2026-04-30T14:21:50ZModern large multicore systems often run multiple workloads that share CPUs under schedulers such as Linux CFS. To keep CPUs busy, these schedulers load-balance runnable work, causing each workload to execute on many cores. This weakens locality at the microarchitectural level: workloads lose reuse in caches, branch predictors, and prefetchers, and interfere more with one another - especially on chiplet-based systems, where spreading execution across cores also spreads it across LLC boundaries. A natural alternative is strict CPU partitioning, but hard partitions leave capacity idle when workloads do not fully use their reserved CPUs.
We present Affinity Tailor, a userspace-guided kernel scheduling system built on a key insight: the kernel can preserve locality for workloads that share CPUs by treating demand-sized, topologically compact CPU sets as affinity hints rather than hard partitions. A userspace controller estimates each workload's CPU demand online and assigns a preferred CPU set sized to that demand, chosen to be as disjoint as possible from other workloads while spanning as few LLC domains as possible. The kernel then uses this set as an affinity hint, steering threads toward those CPUs while still allowing execution elsewhere when needed to preserve utilization. Deployed at Google, Affinity Tailor delivers geometric-mean per-CPU throughput gains of 12% on chiplet-based systems and 3% on non-chiplet systems over Linux CFS. Furthermore, faster execution reduces memory residency, yielding per-GB throughput gains of 3-7%. Our findings suggest that future schedulers should treat spatial locality as a first-class objective, even at the expense of work-conservation.2026-04-30T14:21:50ZJin Xin NgOri LivnehRichard O'GradyJosh DonPeng DingSamuel GrossmanLuis OteroChris KennellyDavid LoCarlos Villaviejahttp://arxiv.org/abs/2604.27570v1treVM: Tiny Rust Embedded Virtual Machines with WASM on Variable Resource-Constrained Hardware2026-04-30T08:23:55ZSoftware stacks embedded on microcontroller-based hardware typically provide rudimentary APIs programmed in C/C++, basic connectivity and, sometimes, a firmware update mechanism. Such coarse mechanisms contrast with widely used APIs and more advanced networked interaction expected from software stacks deployed on less resource-constrained hardware (microprocessor-based). In this paper, we aim to bridge this gap by designing treVM, a generic scheme to host high-level WebAssembly code capsules, bolted on a general-purpose Rust embedded software platform, able to run on a large variety of 32-bit microcontrollers. Not only can treVM capsules host highly customizable business logic, but capsules can also be securely updated on demand over the network, on devices already deployed in the field. We implement treVM in Rust, on top of Ariel OS, a general-purpose RTOS, and we publish the code as open source. Based on our implementation, we validate the feasibility of treVM on commonly available boards, and we report on extensive benchmarks we performed on heterogeneous hardware including Arm Cortex-M, RISC-V, and Xtensa microcontroller architectures. As such, treVM provides a promising new framework to secure continuous deployment of embedded software on low-power networked devices.2026-04-30T08:23:55ZAntoine LavandierBastien BuilChrystel GaberEmmanuel Baccellihttp://arxiv.org/abs/2604.22603v1Chamelio: A Fast Shared Cloud Network Stack for Isolated Tenant-Defined Protocols2026-04-24T14:28:37ZConventional cloud network virtualization sends packets through multiple guest and host layers, inflating CPU cost and tail latency. Shared host datapaths collapse this layering into one optimized path across tenants, but existing shared stacks are fixed-function: tenants cannot specialize their protocols. eBPF is the natural vehicle for restoring programmability to a shared datapath, but today's extensions are hook-sized, and its verifier provides safety -- not performance isolation: one tenant's per-packet work can inflate every other tenant's tail latency.
Chamelio is a programmable shared network stack that lets tenants implement full protocols through a bounded eBPF fast path and a tenant slow path, while approaching the performance and preserving the strong isolation of fixed shared stacks. It combines three ideas: a shared-stack architecture for tenant-defined protocols; joint optimisation of tenant handlers with provider infrastructure and co-resident tenants in the shared fast path; and a bounded fast path contract with runtime cycle accounting that keeps tenant programmability compatible with strong performance isolation. A tenant programmable TCP on Chamelio reaches 9.2 Mreq/s, matching the hand-tuned TAS stack; joint compilation shrinks the programmability tax from 23.9% to 3.8%; and under a scaling TCP adversary that drives uninstrumented stacks to 154 microseconds, Chamelio bounds victim tail latency at 46 microseconds.2026-04-24T14:28:37ZMatheus StoletSimon PeterAntoine Kaufmannhttp://arxiv.org/abs/2603.09046v2FlexServe: A Fast and Secure LLM Serving System for Mobile Devices with Flexible Resource Isolation2026-04-22T06:47:52ZDevice-side Large Language Models (LLMs) have witnessed explosive growth, offering higher privacy and availability compared to cloud-side LLMs. During LLM inference, both model weights and user data are valuable, and attackers may even compromise the OS kernel to steal them. ARM TrustZone is the de facto hardware-based isolation technology on mobile devices, used to protect sensitive applications from a compromised OS. However, protecting LLM inference with TrustZone incurs significant overhead due to its inflexible isolation of memory and the NPU. To address these challenges, this paper introduces FlexServe, a fast and secure LLM serving system for mobile devices. It first introduces a Flexible Resource Isolation mechanism to construct Flexible Secure Memory (Flex-Mem) and Flexible Secure NPU (Flex-NPU). Both memory pages and the NPU can be efficiently switched between unprotected and protected modes. Based on these mechanisms, FlexServe designs a fast and secure LLM inference framework within TrustZone's secure world. The LLM-Aware Memory Management and Secure Inference Pipeline are introduced to accelerate inference. A Multi-Model Scheduler is proposed to optimize multi-model workflows. We implement a prototype of FlexServe and compare it with two TrustZone-based strawman designs. The results show that FlexServe achieves an average $10.05\times$ speedup in Time to First Token (TTFT) compared to the strawman, and an average $2.44\times$ TTFT speedup compared to an optimized strawman with pipeline and secure NPU enabled. For multi-model agent workflows, the end-to-end speedup is up to $24.30\times$ and $4.05\times$ compared to the strawman and optimized strawman, respectively.2026-03-10T00:31:25Z13 pages, 11 figuresYinpeng WuYitong ChenLixiang WangJinyu GuZhichao HuaYubin Xiahttp://arxiv.org/abs/2604.19958v1Equinox: Decentralized Scheduling for Hardware-Aware Orbital Intelligence2026-04-21T20:10:03ZEarth-observation satellites are emerging as distributed edge platforms for time-critical tasks, yet orbital scheduling remains challenged by intermittent energy harvesting and temporal coupling where eager execution risks future battery depletion. Existing schedulers rely on static priorities and lack mechanisms to adaptively shed work. We present Equinox, a lightweight, decentralized runtime for resource-constrained orbital systems. Equinox enables adaptive scheduling by compressing time-varying constraints, including battery charge, thermal headroom, and queue backlog, into a single state-dependent marginal cost of execution. Derived from a barrier function that rises sharply near safety limits, this cost encodes both instantaneous pressure and future risk. This local signal serves as a constellation-wide coordination primitive. Tasks execute only when their value exceeds the current cost, enabling value-ordered load shedding without explicit policies. If local costs exceed a neighbor's, tasks are dynamically offloaded over inter-satellite links, achieving distributed load balancing without routing protocols or global state. We evaluate Equinox using a multi-day simulation of a 143-satellite constellation grounded in physical Jetson Orin Nano measurements. Equinox improves scientific goodput by 20% and image-processing throughput by 31% over priority-based scheduling while maintaining 2.2x higher mean battery reserves. Under high demand, Equinox achieves 5.2x the execution rate of static scheduling by gracefully shedding work rather than collapsing under contention.2026-04-21T20:10:03ZAnsel Kaplan ErolDivya Mahajanhttp://arxiv.org/abs/2604.19657v1An AI Agent Execution Environment to Safeguard User Data2026-04-21T16:45:30ZAI agents promise to serve as general-purpose personal assistants for their users, which requires them to have access to private user data (e.g., personal and financial information). This poses a serious risk to security and privacy. Adversaries may attack the AI model (e.g., via prompt injection) to exfiltrate user data. Furthermore, sharing private data with an AI agent requires users to trust a potentially unscrupulous or compromised AI model provider with their private data.
This paper presents GAAP (Guaranteed Accounting for Agent Privacy), an execution environment for AI agents that guarantees confidentiality for private user data. Through dynamic and directed user prompts, GAAP collects permission specifications from users describing how their private data may be shared, and GAAP enforces that the agent's disclosures of private user data, including disclosures to the AI model and its provider, comply with these specifications. Crucially, GAAP provides this guarantee deterministically, without trusting the agent with private user data, and without requiring any AI model or the user prompt to be free of attacks.
GAAP enforces the user's permission specification by tracking how the AI agent accesses and uses private user data. It augments Information Flow Control with novel persistent data stores and annotations that enable it to track the flow of private information both across execution steps within a single task, and also over multiple tasks separated in time. Our evaluation confirms that GAAP blocks all data disclosure attacks, including those that make other state-of-the-art systems disclose private user data to untrusted parties, without a significant impact on agent utility.2026-04-21T16:45:30ZRobert StanleyAvi VermaLillian TsaiKonstantinos KallasSam Kumarhttp://arxiv.org/abs/2604.19494v1DPC: A Distributed Page Cache over CXL2026-04-21T14:12:56ZModern distributed file systems rely on uncoordinated, per node page caches that replicate hot data locally across the cluster. While ensuring fast local access, this architecture underutilizes aggregate cluster DRAM capacity through massive data redundancy and incurs prohibitive coherence overhead via heavyweight, lock-based protocols. In this paper, we focus on the design of a distributed page cache that treats the entire cluster's main memory as a single cache budget while preserving standard file-system interfaces and semantics. We present Distributed Page Cache (DPC), an OS-level, distributed page cache built on top of Compute Express Link (CXL) 3.0 memory semantics. DPC enforces a single-copy invariant at page granularity: each file page has exactly one owner node holding the sole resident DRAM copy, and other nodes access it via CXL-based remote mappings rather than creating replicas of the page. DPC is implemented end-to-end on a CXL-based emulation framework that models multi-host CXL 3.0 memory fabrics, enabling detailed evaluation in the absence of widespread hardware. Across real-world and representative data-sharing workloads, DPC delivers speedups of up to 12.4X, with a geometric-mean speedup of 5.6X.2026-04-21T14:12:56ZShai BergmanZhe YangJulien EudineGiorgio NegroOnur MutluArash TavakkolJi Zhanghttp://arxiv.org/abs/2602.08800v2Equilibria: Fair Multi-Tenant CXL Memory Tiering At Scale2026-04-20T15:31:04ZMemory dominates datacenter system cost and power. Memory expansion via Compute Express Link (CXL) is an effective way to provide additional memory at lower cost and power, but its effective use requires software-level tiering for hyperscaler workloads. Existing tiering solutions, including current Linux support, face fundamental limitations in production deployments. First, they lack multi-tenancy support, failing to handle stacked homogeneous or heterogeneous workloads. Second, limited control-plane flexibility leads to fairness violations and performance variability. Finally, insufficient observability prevents operators from diagnosing performance pathologies at scale.
We present Equilibria, an OS framework enabling fair, multi-tenant CXL tiering at datacenter scale. Equilibria provides per-container controls for memory fair-share allocation and fine-grained observability of tiered-memory usage and operations. It further enforces flexible, user-specified fairness policies through regulated promotion and demotion, and mitigates noisy-neighbor interference by suppressing thrashing.
Evaluated in a large hyperscaler fleet using production workloads and benchmarks, Equilibria helps workloads meet service level objectives (SLOs) while avoiding performance interference. It improves performance over the state-of-the-art Linux solution, TPP, by up to 52% for production workloads and 1.7x for benchmarks. All Equilibria patches have been released to the Linux community.2026-02-09T15:39:56ZDe-anonymized "corpX" in the paperKaiyang ZhaoNeha GholkarHasan MarufAbhishek DhanotiaJohannes WeinerGregory PriceNing SunBhavya DwivediStuart ClarkDimitrios Skarlatoshttp://arxiv.org/abs/2604.18120v1Proxics: an efficient programming model for far memory accelerators2026-04-20T11:38:54ZThe use of disaggregated or far memory systems such as CXL memory pools has renewed interest in Near-Data Processing (NDP): situating cores close to memory to reduce bandwidth requirements to and from the CPU. Hardware designs for such accelerators are appearing, but there lack clean, portable OS abstractions for programming them.
We propose a programming model for NDP devices based on familiar OS abstractions: virtual processors (processes) and inter-process communication channels (like Unix pipes).
While appealing from a user perspective, a naive implementation of such abstractions is inappropriate for NDP accelerators: the paucity of processing power in some hardware designs makes classical processes overly heavyweight, and IPC based on shared buffers makes no sense in a system designed to reduce memory bandwidth.
Accordingly, we show how to implement these abstractions in a lightweight and efficient manner by exploiting compilation and interconnect protocols. We demonstrate them with a real hardware platform runing applications with a range of memory access patterns, including bulk memory operations, in-memory databases and graph applications.
Crucially, we show not only the benefits over CPU-only implementations, but also the critical importance of efficient, low-latency communication channels between CPU and NDP accelerators, a feature largely neglected in existing proposals.2026-04-20T11:38:54ZZikai LiuNiels PresselJasmin SchultRoman MeierPengcheng XuTimothy Roscoehttp://arxiv.org/abs/2510.27485v2Sockeye: a language for analyzing hardware documentation2026-04-20T11:34:41ZThe ever increasing complexity of hardware platforms poses a challenge to systems programmers. Correctly programming a multitude of components, providing functionality and security, is difficult: semantics of individual units are described in prose, underspecified, and prone to inaccuracies. Rigorous statements about platform security are often impossible.
We introduce a domain-specific language to describe hardware semantics, assumptions about software behavior, and desired security properties. We then create machine-readable specifications for a diverse set of eight platforms from their reference manuals, and formally prove their (in-)security. In addition to security proofs about memory confidentiality and integrity, we discover a handful of documentation errors. Finally, our analysis also revealed a vulnerability on a real-world server chip, which was confirmed by the vendor to apply to a wide family of deployed network appliances. Our tooling offers system integrators a way of formally describing security properties for whole platforms, and the means to find counterexamples, or proving them correct.2025-10-31T14:04:01ZBen FiedlerSamuel GruetterTimothy Roscoehttp://arxiv.org/abs/2604.17861v1GPUOS: A GPU Operating System Primitive for Transparent Operation Fusion2026-04-20T06:19:16ZModern deep learning workloads often consist of many small tensor operations, especially in inference, attention, and micro-batched training. In these settings, kernel launch overhead can become a major bottleneck, sometimes exceeding the actual computation time.
We present GPUOS, a GPU runtime JIT system that reduces launch overhead using a persistent kernel architecture with runtime operator injection. GPUOS runs a single long-lived GPU kernel that continuously processes tasks from a host-managed work queue, eliminating repeated kernel launches. To support diverse operations, GPUOS uses NVIDIA NVRTC to just-in-time compile operators at runtime and inject them into the running kernel through device function pointer tables. This design enables operator updates without restarting the kernel or recompiling the system.
GPUOS introduces four key ideas: (1) a persistent worker kernel with atomic task queues, (2) a runtime operator injection mechanism based on NVRTC and relocatable device code, (3) a dual-slot aliasing scheme for safe concurrent operator updates, and (4) transparent PyTorch integration through TorchDispatch that batches micro-operations into unified submissions. The system supports arbitrary tensor shapes, strides, data types, and broadcasting through a generic tensor abstraction.
Experiments show that GPUOS achieves up to 15.3x speedup over standard PyTorch on workloads dominated by small operations, including micro-batched inference and attention patterns. GPUOS improves utilization while remaining compatible with the PyTorch ecosystem.2026-04-20T06:19:16ZYiwei YangXiangyu GaoYuan ZhouYuhang GanYusheng ZhengAndi Quinnhttp://arxiv.org/abs/2604.16870v1Governed MCP: Kernel-Level Tool Governance for AI Agents via Logit-Based Safety Primitives2026-04-18T06:40:20ZAI agents increasingly call external tools (file system, network, APIs) through the Model Context Protocol (MCP). These tool calls are the agent's syscalls -- privileged operations with side effects on shared state -- yet today's safety enforcement lives entirely in userspace, where a 10-line script can bypass it. I propose Governed MCP, a kernel-resident tool governance gateway built on a logit-based safety primitive (ProbeLogits, companion paper: arXiv:2604.11943). The gateway interposes on every MCP tool call in a 6-layer pipeline: schema validation, trust tier check, rate limit, adversarial pre-filter, ProbeLogits gate (the load-bearing semantic check), and constitutional policy match, with a Blake3-hashed audit chain.
I implement Governed MCP in Anima OS, a bare-metal x86_64 OS in approximately 86,000 lines of Rust. The five non-inference layers add 65.3 microseconds of overhead per call; ProbeLogits adds 65 ms (per-token-class semantic decision) on 7B Q4_0. A 4-config ablation on a 101-prompt MCP-domain benchmark shows that removing the ProbeLogits layer collapses F1 from 0.773 to 0.327 (Delta F1 = -0.446) -- hand-rule firewalling alone is insufficient. All 15 WASM-to-system host functions in the runtime route through the gateway (complete mediation of the WASM ABI surface; the scope and caveats of this claim are stated in Section 4.6); a 10-LoC userspace bypass that defeats existing guardrail libraries is structurally impossible against the kernel-resident gate.2026-04-18T06:40:20Z12 pages. Companion paper to arXiv:2604.11943 (ProbeLogits)Daeyeon Son10.5281/zenodo.19639122http://arxiv.org/abs/2604.11943v2ProbeLogits: Kernel-Level LLM Inference Primitives for AI-Native Operating Systems2026-04-18T06:28:19ZAn OS kernel that runs LLM inference internally can read logit distributions before any text is generated and act on them as a governance primitive. This paper presents ProbeLogits, a kernel-level operation that performs a single forward pass and reads specific token logits to classify agent actions as safe or dangerous, with zero learned parameters.
I evaluate ProbeLogits across three base models (Qwen 2.5-7B, Llama 3 8B, Mistral 7B) on three external benchmarks: HarmBench, XSTest, and ToxicChat. On HarmBench non-copyright (n=300), all three models reach 97-99% block rate with the right verbalizer. On ToxicChat (n=1,000), ProbeLogits achieves F1 parity-or-better against Llama Guard 3 in the same hosted environment: the strongest configuration (Qwen 2.5-7B Safe/Dangerous, alpha=0.0) reaches F1=0.812 with bootstrap 95% CIs disjoint from LG3 (+13.7pp significant); Llama 3 S/D matches LG3 within CI (+0.4pp, parity); Mistral Y/N exceeds by +4.4pp. Latency is approximately 2.5x faster than LG3 in the same hosted environment because the primitive reads a single logit position instead of generating tokens; in the bare-metal native runtime ProbeLogits drops to 65 ms.
A key design contribution is the calibration strength alpha, which serves as a deployment-time policy knob rather than a learned hyperparameter. Contextual calibration corrects verbalizer prior asymmetry, with bias magnitude varying by (model, verbalizer) pair.
I implement ProbeLogits within Anima OS, a bare-metal x86_64 OS written in approximately 86,000 lines of Rust. Because agent actions must pass through 15 kernel-mediated host functions, ProbeLogits enforcement operates below the WASM sandbox boundary, making it significantly harder to circumvent than application-layer classifiers.2026-04-13T18:32:02Z15 pages, 13 tablesDaeyeon Son10.5281/zenodo.19639032http://arxiv.org/abs/2510.02323v2NetCAS: Dynamic Cache and Backend Device Management in Networked Environments2026-04-17T23:00:19ZModern storage systems often combine fast cache with slower backend devices to accelerate I/O. As performance gaps narrow, concurrently accessing both devices, rather than relying solely on cache hits, can improve throughput. However, in data centers, remote backend storage accessed over networks suffers from unpredictable contention, complicating this split. We present NetCAS, a framework that dynamically splits I/O between cache and backend devices based on real-time network feedback and a precomputed Perf Profile. Unlike traditional hit-rate-based policies, NetCAS adapts split ratios to workload configuration and networking performance. NetCAS employs a low-overhead batched round-robin scheduler to enforce splits, avoiding per-request costs. It achieves up to 174% higher performance than traditional caching in remote storage environments and outperforms converging schemes like Orthus by up to 3.5X under fluctuating network conditions.2025-09-25T07:01:44Z12 pages, 12 figures, submitted to IEEE CLOUD 2026Joon Yong HwangChanseo ParkYounghoon Kimhttp://arxiv.org/abs/2501.04580v2Goldilocks Isolation: High Performance VMs with Edera2026-04-16T20:42:03ZOrganizations run applications on cloud infrastructure shared between multiple users and organizations. Popular tooling for this shared infrastructure, including Docker and Kubernetes, supports such multi-tenancy through the use of operating system virtualization. With operating system virtualization (known as containerization), multiple applications share the same kernel, reducing the runtime overhead. However, this shared kernel presents a large attack surface and has led to a proliferation of container escape attacks in which a kernel exploit lets an attacker escape the isolation of operating system virtualization to access other applications or the operating system itself. To address this, some systems have proposed a return to hypervisor virtualization for stronger isolation between applications. However, no existing system has achieved both the isolation of hypervisor virtualization and the performance and usability of operating system virtualization.
We present Edera, an optimized type 1 hypervisor that uses paravirtualization to improve the runtime of hypervisor virtualization. We illustrate Edera's usability and performance through two use cases. First, we create a container runtime compatible with Kubernetes that runs on the Edera hypervisor. This implementation can be used as a drop-in replacement for the Kubernetes runtime and is compatible with all the tooling in the Kubernetes ecosystem. Second, we use Edera to provide driver isolation for hardware drivers, including those for networking, storage, and GPUs. This use of isolation protects the hypervisor and other applications from driver vulnerabilities. We find that Edera has runtime comparable to Docker with .9% slower cpu speeds, an average of 3% faster system call performance, and memory performance 0-7% faster. It achieves this with a 648 millisecond increase in startup time from Docker's 177.4 milliseconds.2025-01-08T15:51:02ZMarina MooreAlex Zenla