https://arxiv.org/api/IUwjKnEQlTtqNEwtpEYcriDaMvw2026-06-15T08:17:24Z13723015http://arxiv.org/abs/2603.00178v3A TEE-Based Architecture for Confidential and Dependable Process Attestation in Authorship Verification2026-05-23T12:58:28ZProcess attestation systems verify that a continuous physical process, such as human authorship, actually occurred, rather than merely checking system state. These systems face a fundamental dependability challenge: the evidence collection infrastructure must remain available and tamper-resistant even when the attesting party controls the platform. Trusted Execution Environments (TEEs) provide hardware-enforced isolation that can address this challenge, but their integration with continuous process attestation introduces novel resilience requirements not addressed by existing frameworks. We present the first architecture for continuous process attestation evidence collection inside TEEs, providing hardware-backed tamper resistance against trust-inverted adversaries with graduated input assurance from software-channel integrity (Tier 1) through hardware-bound input (Tier 3). We develop a Markov-chain dependability model quantifying Evidence Chain Availability (ECA), Mean Time Between Evidence Gaps (MTBEG), and Recovery Time Objectives (RTO). We introduce a resilient evidence chain protocol maintaining chain integrity across TEE crashes, network partitions, and enclave migration. Our security analysis derives formal bounds under combined threat models including trust inversion and TEE side channels, parameterized by a conjectural side-channel leakage bound esc that requires empirical validation. Evaluation on Intel SGX demonstrates under 25% per-checkpoint CPU overhead (<0.3% of the 30 s checkpoint interval), >99.5% Evidence Chain Availability (ECA) (the fraction of session time with active evidence collection) in Monte Carlo simulation under Poisson failure models, and sealed-state recovery under 200 ms.2026-02-26T20:17:52Z13 pagesDavid Condreyhttp://arxiv.org/abs/2512.04320v2VLCs: Managing Parallelism with Virtualized Libraries2026-05-22T19:23:20ZAs the complexity and scale of modern parallel machines continue to grow, programmers increasingly rely on composition of software libraries to encapsulate and exploit parallelism. However, many libraries are not designed with composition in mind and assume they have exclusive access to all resources. Using such libraries concurrently can result in contention and degraded performance. Prior solutions involve modifying the libraries or the OS, which is often infeasible.
We propose Virtual Library Contexts (VLCs), which are process subunits that encapsulate sets of libraries and associated resource allocations. VLCs control the resource utilization of these libraries without modifying library code. This enables the user to partition resources between libraries to prevent contention, or load multiple copies of the same library to allow parallel execution of otherwise thread-unsafe code within the same process.
In this paper, we describe and evaluate C++ and Python prototypes of VLCs. Experiments show VLCs enable a speedup up to 2.85x on benchmarks including applications using OpenMP, OpenBLAS, and LibTorch. Source code of VLCs is available at https://github.com/pecos/Virtual-Library-Context.2025-12-03T23:11:02ZIn Proceedings of the 2025 ACM Symposium on Cloud Computing (SoCC '25)Proceedings of the 2025 ACM Symposium on Cloud Computing (2025) 629-643Yineng YanWilliam RuysHochan LeeIan HenriksenArthur PetersSean StephensBozhi YouHenrique FinglerMartin BurtscherMilos GligoricKeshav PingaliMattan ErezGeorge BirosChristopher J. Rossbach10.1145/3772052.3772265http://arxiv.org/abs/2507.12364v2Tyche: Composable Isolation as a Foundation to Manage Trust in the Cloud2026-05-21T15:36:19ZCloud workloads combine software components from different parties to process sensitive data. Each component has its own trust model - it must protect its assets from the rest of the system, yet share sensitive data with components it cannot trust to keep confidential. This tension requires composing isolation boundaries for confidentiality and encapsulation. Unfortunately, the cloud offers no direct way to compose such boundaries, forcing tenants to assemble, deploy, and maintain their own solutions. This paper shifts that burden back to the infrastructure by making composable, attestable isolation a first-class systems abstraction.
We present Tyche, a security monitor that centers isolation around a unified composable abstraction: security domains (SDs). An SD is an execution environment whose access to machine resources - memory, cores, devices - is controlled through explicit capabilities. A small set of capability operations enables SDs to partition, share, and reclaim resources; by nesting recursively, SDs compose attestable trust boundaries for confidentiality and encapsulation. Tyche attests these compositions, providing end-to-end security guarantees for workloads made of mutually distrustful components. As a first-class cloud primitive, this single abstraction subsumes enclaves, sandboxes, CVMs, and their compositions.
Tyche provides composable isolation without sacrificing compatibility with existing hardware and software stacks. It runs on commodity x86 64 hardware without security extensions, and a RISC-V prototype demonstrates portability across platforms. Our SDK composes isolation for unmodified workloads within SDs with minimal overhead. In a confidential LLM inference scenario with mutually distrustful users, model owners, and cloud providers, the slowdown is just 2% compared to bare-metal Linux.2025-07-16T16:08:24ZAdrien GhosnCharly CastesNeelu S. KalaniYuchen QianMarios KogiasEdouard Bugnionhttp://arxiv.org/abs/2605.19893v2SSV: Sparse Speculative Verification for Efficient LLM Inference2026-05-20T15:53:57ZSpeculative decoding and dynamic sparse attention are two complementary approaches for accelerating long-context LLM inference: the former amortizes target-model execution across multiple verifier queries, while the latter reduces each query's KV-cache working set. Directly combining them, however, exposes a structural mismatch: speculative verification relies on cross-query commonality, whereas dynamic sparse attention assigns query-specific sparse layouts. This mismatch limits KV-block reuse, amplifies NSA's branch-wise overheads, and makes verification strategy selection input- and regime-dependent. We present SSV, a sparse speculative-verification framework that turns dynamic sparse attention into a verification-oriented workload. SSV combines overlap-aware grouped-query execution, refresh/reuse-based NSA kernel fusion, and profile-guided prompt-adaptive orchestration to improve cross-query reuse, reduce selected-index and branch-fusion overheads, and select effective draft-verification strategies under user-specified precision classes. Experiments on NVIDIA H100 GPUs show that SSV achieves up to 3.49x end-to-end throughput over autoregressive NSA decoding and up to 6.86x kernel speedups for sparse speculative verification.2026-05-19T14:24:27ZZhibin WangZiyu ZhongNuo ShenYuhang ZhouRong GuSheng Zhonghttp://arxiv.org/abs/2503.03722v4Where Linux Breaks Under Radiation: A Cross-Architecture Kernel-Level Characterization of Proton-Induced Failures in COTS SoCs2026-05-20T14:33:36ZLinux is increasingly deployed in Low Earth Orbit on commercial off the shelf systems on chip that were not designed for space radiation. Ionizing particles can trigger single event functional interrupts that crash the kernel without warning. Prior work mainly measured board level cross sections, leaving unclear which Linux subsystems fail and how a single upset propagates into an operating system wide failure across architectures, stress conditions, and irradiation conditions. We address this gap by subjecting three Linux platforms to proton irradiation in the 20 to 58 MeV range: a Raspberry Pi Zero 2W with a 40 nm planar ARM Cortex A53, an NXP i MX 8M Plus with a 14 nm FinFET ARM Cortex A53, and an OrangeCrab ECP5 FPGA hosting a VexRiscV RV32I soft core at 40 nm. Through kernel log forensics, we trace all 133 observed Linux failures, most of which have not been previously reported, to their originating kernel handlers. Failure profiles differ sharply across nodes. On the two 40 nm platforms, memory management and driver handlers account for 67 to 78% of events, while on the 14 nm SoC approximately 90% of failures funnel through a single eMMC storage path, comprising 56% filesystem failures and 34% driver failures. This shows that a SEFI susceptible peripheral can strongly dictate system reliability. The 14 nm SoC also shows roughly an order of magnitude lower Linux SEFI cross section, although irradiation geometry and DRAM exposure differences preclude isolating the contribution of process scaling. Reconstructed propagation chains show that faults can cascade through up to six kernel subsystems before terminal failure in severe events. Rather than motivating blanket redundancy, these results identify the kernel subsystem boundaries where radiation induced faults originate, enabling targeted mitigations for hardening COTS Linux systems for orbit.2025-03-05T18:21:34ZSaad MemonRafal GraczykTomasz RajkowskiJan SwakonDamian WrobelSebastian KusykSeth RoffeMike Papadakishttp://arxiv.org/abs/2605.24026v1A Per-Access Upper Bound for Shared-Resource Interference in Direct-Mapped Multicore Architectures2026-05-20T14:07:45ZWe present a formal bounding analysis for maximum credible interference in multicore processors under strict architectural invariants: direct-mapped L2 cache (1-way associativity), disabled Miss Status Handling Registers (MSHRs), single-bank main memory, deterministic pinned tasks with fixed physical memory mapping, and a pessimistic L2/memory arbitration policy. We prove that, under these invariants, the per-critical-access stall imposed on a target task T is bounded above by (N-1)Lmem, and that this bound is attained by a synchronized adversarial workload of N-1 congruent-different-tag memory requests issued in phase with T's critical accesses. The argument is per-access and direct, requiring no informal multiplicative interference function. The derivation is purely analytical and discussed in the context of DO-178C/CAST-32A certification objectives for airborne software. Limitations and conditions for applicability are explicitly stated. This work provides a traceable method for separating multicore interference from Worst-Case Execution Time (WCET) budgets under fixed architectural constraints.2026-05-20T14:07:45ZFelipe T. Pedronihttp://arxiv.org/abs/2605.20906v1ParaCell: Paravirtualized Secure Containers with Lightweight Intra-Container Isolation and Intent-Driven Memory Management2026-05-20T08:53:35ZSecure containers isolate each container with its own kernel, mitigating shared-kernel attacks prevalent in traditional container systems. However, existing designs still face a fundamental isolation--performance trade-off. Nested-cloud deployments amplify the cost of VM exits and page-table management, while emerging agentic workloads expose bursty memory demand that requires fine-grained elasticity. We attribute this trade-off to two root causes. First, existing designs lack lightweight intra-container isolation primitives for frequent container user--kernel transitions. Second, the host treats container memory management as opaque, forcing reactive secondary faults and coarse-grained huge page mappings to amortize their cost.
This paper presents ParaCell, a paravirtualized secure container runtime built on two insights. First, intra-address-space hardware protection primitives can provide lightweight intra-container isolation. ParaCell uses MPK-based XGates to isolate the container user and container kernel within a single address space, turning frequent user--kernel transitions into direct domain switches. Second, container kernel allocators already encode memory-management intent. ParaCell introduces Pager to interpose on allocation and free events, batch proactive GPA to HPA bindings and unbindings, and avoid reactive shadow page-table faults while preserving fine-grained memory elasticity.
ParaCell is implemented as a drop-in replacement for RunV. Our experiments demonstrate that, across traditional cloud and emerging agent applications, ParaCell reduces latency by up to 57% and 79% over PVM, and by up to 33% and 88% over RunV, in bare-metal and nested setups, respectively. On agent workloads, ParaCell saves up to 35.6% memory compared with the state-of-the-art VM memory reclamation technique, HyperAlloc.2026-05-20T08:53:35ZYiyang WuXunjie WangJinyu GuHaibo Chenhttp://arxiv.org/abs/2603.28777v2The Computer System Trail2026-05-20T04:39:43ZNo matter how much the world of computing changes, system design remains crucial. While most people try to learn it through quick tutorials or AI-generated summaries, there is no better way to master the field than by studying the original research papers. This book serves as a roadmap through those foundational texts, covering seminal papers in distributed systems, operating systems, and big data. It doesn't just look at what these systems do; it digs deep into why they were built that way.
Built from years of notes taken during discussions at top universities and industry meetups, this guide helps readers understand how systems work under the hood. It is for those who are tired of surface-level content and want to develop the technical patience to wrestle with complex problem-solving. Readers will find the journey long and challenging but highly rewarding, as it enables them to elevate their engineering craft to a truly professional level.2026-02-09T02:55:30Z663 pages, 199 figuresSushant Kumar Guptahttp://arxiv.org/abs/2605.20370v1Clove: Object-Level CXL Memory Management in Managed Runtimes2026-05-19T18:21:07ZObject-level management of tiered memory has been studied to address the inefficiencies in page-based systems. However, object-level management for CXL-tiered memory remains underexplored due to CXL's tight performance budget and load/store interface. As a result, existing approaches remain limited in scope, primarily targeting unmanaged-language applications with bespoke runtimes or compiler support.
This paper identifies and explores a new design point for object-level CXL management: managed languages and their runtimes. The key observation is that existing managed runtimes already provide highly optimized mechanisms for problems closely related to object-level management, including object relocation and dynamic code generation. However, they still lack the features needed for tiered memory management, such as hotness tracking and relocation policies, and thus must be carefully extended to fully realize this direction.
We present Clove, a system that extends existing managed runtimes to support object-level CXL management for managed-language applications. Clove combines profile-guided object hotness tracking with object relocation techniques and policies. Our JVM prototype demonstrates that this extension enables high utilization of fast-tier memory while bounding runtime overhead, reducing application slowdown by 22-84% compared to page-based systems.2026-05-19T18:21:07Z12 pages (15 pages including references), 13 figuresSam SonZhihong LuoWen ZhangSylvia RatnasamyScott Shenkerhttp://arxiv.org/abs/2605.16565v2Skim: Speculative Execution for Fast and Efficient Web Agents2026-05-19T13:45:27ZSkim is a speculative execution framework for web agents that exploits the predictable structure of purpose-built websites. Today's web-agent expense is not intrinsic to the tasks but a property of how agents are composed: frontier-model inference, browser rendering, and ReAct-style planning are applied to every step of every task regardless of complexity. Skim's key observation is that websites enforce stable URL patterns, answer formats, and task-to-trajectory mappings across queries of the same type, so most queries can bypass these heavyweight components entirely. An offline profiler captures these patterns once per site. At runtime, Skim matches each query to a template, synthesizes the destination URL, and extracts the answer with a small model. A lightweight verifier gates each fast-path output against the query and schema; rare misspeculations cascade to the full agent, warm-started by the fast path's final URL to preserve upstream trajectory progress. Across standard web-agent benchmarks paired with three backboneagents (WebVoyager, AgentOccam, BrowserUse), Skim reduces median per-task cost by 1.9x and latency by 33.4% with no accuracy loss.2026-05-15T19:12:43Z14 pages, 21 figuresMike WongKevin HsiehSuman NathRavi Netravalihttp://arxiv.org/abs/2605.19481v1C2CServe: Leveraging NVLink-C2C for Elastic Serverless LLM Serving on MIG2026-05-19T07:34:08ZModern LLM serving is increasingly serverless in shape: large model catalogs, long-tail invocations, and multi-tenant demand. Existing GPU serving systems face a tradeoff: dedicated-GPU allocation wastes scarce HBM under sparse traffic, while GPU time sharing places model initialization and weight loading on the cold-start path. Spatial GPU sharing such as multi-instance GPU (MIG) provides isolation and accounting, but each slice has too little HBM for modern LLM weights.
We observe that high-bandwidth CPU--GPU interconnects, such as NVLink-C2C (C2C) in NVIDIA GH200 and GB200 Superchips, change the memory constraint: model weights can reside in CPU memory and be streamed on demand to MIG instances, shifting model residency from scarce HBM to abundant host memory. Leveraging this capability, we present C2CServe, a request-granularity serverless LLM serving system that allows MIG instances to switch models across requests without reloading weights into HBM. C2CServe introduces HybridGEMM, a heterogeneous-memory-aware GEMM kernel that adapts data access patterns to balance HBM and C2C bandwidth across MIG partitions using a single tuning knob. To mitigate shared-C2C contention, C2CServe further uses a hierarchical scheduler that coordinates model placement, input chunking, and kernel selection with online feedback control. On GH200, C2CServe reduces cold-start latency by up to 7.1x for dense models and 4.6x for MoE models compared with state-of-the-art serverless LLM serving systems, while maintaining over 95\% TTFT and TPOT attainment under C2C contention.2026-05-19T07:34:08ZShutian LuoAli Zafar SadiqRui YangMingye ZhangHaiying ShenWei WangYue Chenghttp://arxiv.org/abs/2604.25679v2Embedded Rust or C Firmware? Lessons from an Industrial Microcontroller Use Case with Ariel OS2026-05-18T21:18:57ZAs Rust gains traction for developing safer systems software, a reality check for the microcontroller hardware segment becomes necessary. How ready is the Rust ecosystem for this segment? Can Rust compete with C in practice? This paper reports on an IoT industrial case study that contributes to answering these questions. Two teams concurrently developing the same functionality (one in C, one in Rust) are analyzed over a period of several months. A comparative analysis of their approaches, results, and iterative efforts is provided. The analysis and measurements on hardware indicate no strong reason to prefer C over Rust for microcontroller firmware on the basis of memory footprint or execution speed. Furthermore, Ariel OS is shown to provide an efficient and portable system runtime in Rust whose footprint is smaller than that of the state-of-the-art bare-metal C stack traditionally used in this context. It is concluded that Rust is a sound choice today for firmware development in this domain.2026-04-28T14:09:11ZBipin ThapaDaniele AlfonsoLorenzo BiniLicio MapelliKaspar SchleiserRomain FouquetEmmanuel Baccellihttp://arxiv.org/abs/2506.16042v2OSWorld-Human: Benchmarking the Efficiency of Computer-Use Agents2026-05-18T17:55:59ZGenerative AI is being leveraged to solve a variety of computer-use tasks involving desktop applications. State-of-the-art systems have focused solely on improving accuracy on leading benchmarks. However, these systems are practically unusable due to extremely high end-to-end latency (e.g., tens of minutes) for tasks that typically take humans just a few minutes to complete. To understand the cause behind this and to guide future developments of computer agents, we conduct the first study on the temporal performance of computer-use agents on OSWorld, the flagship benchmark in computer-use AI. We find that large model calls for planning, reflection, and judging account for most of the overall latency, and as an agent uses more steps to complete a task, each successive step can take 3x longer than steps at the beginning of a task. We then construct OSWorld Human, a manually annotated version of the original OSWorld dataset that contains a human-determined trajectory for each task. We evaluate 16 agents on their efficiency using OSWorld Human and found that even the best agents take 2.7-4.3x more steps than necessary.2025-06-19T05:26:40ZReyna AbhyankarQi QiYiying Zhanghttp://arxiv.org/abs/2605.18066v1TIDAL: Recovering Temporal Phase for Cloud Block Storage Placement from LLM-Derived Semantics2026-05-18T08:49:16ZCloud Virtual Disk (CVD) placement in Cloud Block Storage (CBS) is critical for resource efficiency and performance isolation. Existing schemes prioritize spatial load balancing by dispersing disks across pods based on configuration-derived load estimates. However, overload risk in CBS is fundamentally temporal. Even when average load is balanced, pods can still suffer transient congestion when the peaks of co-located disks align in time. Achieving complementary placement, which co-locates CVDs with offset peaks, is hard at provisioning time because new disks have no history from which to infer temporal phase. We present TIDAL, a CVD placement framework that recovers phase-aware signals for cold-start placement from an underused source: tenant-provided names and identifiers in provisioning metadata. TIDAL first uses LLMs to recover application semantics from noisy metadata such as project, VM, and disk names. It then translates these semantics into phase-aware temporal signals to guide complementary placement. To satisfy control-plane constraints, TIDAL adopts an offline-to-online design with teacher-student distillation, regex-based filtering, and prefix-aware caching, enabling CPU-only inference with millisecond-level latency. Evaluations driven by production traces show that TIDAL reduces overload frequency by 79.1% and P95 overload duration by 73.7% compared with the strongest baselines.2026-05-18T08:49:16ZDifan TanChanglin WanJiawen LiuHua WangKe Zhouhttp://arxiv.org/abs/2605.17992v1PipeANN-Filter: An Efficient Filtered Vector Search System on SSD2026-05-18T07:48:53ZWe propose PipeANN-Filter, an efficient filtered vector search system on SSD. Unlike existing systems that explore only valid vectors (i.e., those satisfying the attribute constraints) during search, PipeANN-Filter explores a superset of valid vectors, and performs attribute verification after getting the top-k closest result vectors. This allows PipeANN-Filter to leverage probabilistic data structures (e.g., Bloom filters) to identify the superset, trading off a small number of false-positive vector explorations for a massive reduction in SSD I/O for attribute reading. Evaluations show that PipeANN-Filter improves search latency and throughput compared to state-of-the-art systems. PipeANN-Filter is open-source at https://github.com/thustorage/PipeANN2026-05-18T07:48:53ZHao GuoJiwu ShuYouyou Lu