https://arxiv.org/api/WoJox1Y+ocPdFJKuacNuVK47Ov02026-03-30T08:41:15Z12881515http://arxiv.org/abs/2603.13945v1A Case for CATS: A Conductor-driven Asymmetric Transport Scheme for Semantic Prioritization2026-03-14T13:36:15ZStandard transport protocols like TCP operate as a blind, FIFO conveyor belt for data, a model that is increasingly suboptimal for latency-sensitive and interactive applications. This paper challenges this model by introducing CATS (Conductor-driven Asymmetric Transport Scheme), a framework that provides TCP with the semantic awareness necessary to prioritize critical content. By centralizing scheduling intelligence in a transport-native "Conductor", CATS significantly improves user-perceived performance by delivering essential data first. This architecture directly confronts a cascade of historical performance workarounds and their limitations, including the high overhead of parallel connections in HTTP/1.1, the transport-layer Head-of-Line blocking in HTTP/2, and the observed implementation heterogeneity of prioritization in HTTP/3 over QUIC. Built upon TCP BBR, our ns-3 implementation demonstrates this principle by reducing the First Contentful Paint by over 78% in a representative webpage download configured as a deliberate worst-case scenario, with no penalty to total page load time compared to the baseline.2026-03-14T13:36:15Z2025 6th International Conference on Innovative Computing (ICIC)Syed Muhammad Aqdas Rizvi10.1109/ICIC68258.2025.11413235http://arxiv.org/abs/2603.13110v1AgentRM: An OS-Inspired Resource Manager for LLM Agent Systems2026-03-13T16:07:20ZLarge Language Model (LLM) agent systems have experienced rapid adoption across diverse domains, yet they suffer from critical user experience problems that limit their practical deployment. Through an empirical analysis of over 40,000 GitHub issues from six major agent frameworks (OpenClaw, AutoGen, CrewAI, LangGraph, Codex, Claude Code), we identify two fundamental resource management challenges: (1) scheduling failures leading to system unresponsiveness due to blocking, zombie processes, and rate limit cascades, and (2) context degradation causing agent "amnesia" from unbounded memory growth and poor retention policies. Drawing inspiration from decades of operating systems research, we present AgentRM, a middleware resource manager that treats agent resources analogously to OS resources. AgentRM employs a Multi-Level Feedback Queue (MLFQ) scheduler with zombie reaping and rate-limit-aware admission control, coupled with a three-tier Context Lifecycle Manager that implements adaptive compaction and hibernation mechanisms. Our evaluation demonstrates significant improvements: AgentRM-MLFQ reduces P95 latency by 86%, decreases lane waste by 96%, and increases throughput by 168% while eliminating zombie agents (0 vs. 29 baseline). AgentRM-CLM achieves 100% key information retention with 95% quality score compared to 65.1% retention and 87% quality for existing approaches, albeit with higher compaction costs (34,330 vs. 17,212 tokens).2026-03-13T16:07:20ZJianshu Shehttp://arxiv.org/abs/2603.11438v1NCCLbpf: Verified, Composable Policy Execution for GPU Collective Communication2026-03-12T02:03:55ZNCCL is the de facto standard for collective GPU communication in large-scale distributed training, relying heavily on plugins to customize runtime behavior. However, these plugins execute as unverified native code within NCCL's address space, risking job crashes, silent state corruption, and downtime from restarts during policy updates. Inspired by kernel extensibility models, we introduce NCCLbpf, a verified, high-performance extension framework embedding a userspace eBPF runtime directly into NCCL's existing plugin interfaces, without modifying NCCL itself. NCCLbpf offers load-time static verification to prevent unsafe plugin execution, structured cross-plugin maps enabling composable policies and closed-loop adaptation, and atomic policy hot-reloads eliminating downtime previously required for policy updates. Evaluations on 8x NVIDIA B300 GPUs connected via NVLink demonstrate that NCCLbpf imposes just 80-130 ns overhead per tuner decision (less than 0.03% of collective latency), prevents all tested unsafe plugin behaviors at load-time, and enables a message-size-aware eBPF policy that improves AllReduce throughput by up to 27% over NCCL's default in the 4-128 MiB range.2026-03-12T02:03:55ZYusheng Zhenghttp://arxiv.org/abs/2602.13692v2ThunderAgent: A Simple, Fast and Program-Aware Agentic Inference System2026-03-10T20:57:47ZLarge language models(LLMs) are now used to power complex multi-turn agentic workflows. Existing systems run agentic inference by loosely assembling isolated components: an LLM inference engine (e.g., vLLM) and a tool orchestrator (e.g., Kubernetes). Although agentic workflows involve multiple LLM and tool requests, these systems schedule and allocate resources separately on a per-request basis, without end-to-end knowledge of the workflow. This leads to sub-optimal management of KV cache and tool execution environments. To address the challenges, we propose ThunderAgent, a fast, simple, and program-aware agentic inference system. We first abstract agentic workflows as LLM Programs, enabling a unified view of heterogeneous resources, including KV caches, system states, and external tool assets such as disk memory and network ports. Built upon this abstraction, ThunderAgent introduces a program-aware scheduler and a tool resource manager designed to maximize KV cache hit rates, mitigate memory imbalances, and enable asynchronous environment preparation. Evaluations across coding, routing, and scientific discovery agents demonstrate that ThunderAgent achieves 1.5-3.6x throughput improvements in serving, 1.8-3.9x in RL rollout, and up to 4.2x disk memory savings compared to state-of-the-art inference systems. To facilitate reproducibility and support future development, we open-source the system implementations of the whole ThunderAgent at: https://github.com/Agentic-Kinetics/ThunderAgent.2026-02-14T09:26:41ZHao KangZiyang LiXinyu YangWeili XuYinfang ChenJunxiong WangBeidi ChenTushar KrishnaChenfeng XuSimran Arorahttp://arxiv.org/abs/2603.09738v1Ensuring Data Freshness in Multi-Rate Task Chains Scheduling2026-03-10T14:45:16ZIn safety-critical autonomous systems, data freshness presents a fundamental design challenge. While the Logical Execution Time (LET) paradigm ensures compositional determinism, it often does so at the cost of injected latency, degrading the phase margin of high-frequency control loops. Furthermore, mapping heterogeneous, multi-rate sensor fusion requirements onto rigid task-centric schedules typically implies in resource-inefficient oversampling. This paper proposes a Task-based scheduling framework extended with data freshness constraints. Unlike traditional models, scheduling decisions are driven by the lifespan of data. We introduce task offset based on the data freshness constraint to order data production in a Just-in-Time (JIT) fashion: the completion of the production of data with strictest data freshness constraint is delayed to the instant its consumers will be ready to use it. This allows for flexible task release offsets. We introduce a formal methodology to decompose Data Dependency Graphs into Dominant Paths by tracing the strictest data freshness constraints backward from the actuators. Based on this decomposition, we propose a Consensus Offset Search algorithm that synchronizes shared producers and private predecessors. This approach enforces end-to-end data freshness without the artificial latency of LET buffering. We formally prove that this offset-based alignment preserves the 100\% schedulability capacity of Global EDF, ensuring data freshness while eliminating the computational overhead of redundant sampling.2026-03-10T14:45:16ZJosé Luis Conradi HoffmannAntônio Augusto Fröhlichhttp://arxiv.org/abs/2603.09046v1FlexServe: A Fast and Secure LLM Serving System for Mobile Devices with Flexible Resource Isolation2026-03-10T00:31:25ZDevice-side Large Language Models (LLMs) have witnessed explosive growth, offering higher privacy and availability compared to cloud-side LLMs. During LLM inference, both model weights and user data are valuable, and attackers may even compromise the OS kernel to steal them. ARM TrustZone is the de facto hardware-based isolation technology on mobile devices, used to protect sensitive applications from a compromised OS. However, protecting LLM inference with TrustZone incurs significant overhead due to its inflexible isolation of memory and the NPU. To address these challenges, this paper introduces FlexServe, a fast and secure LLM serving system for mobile devices. It first introduces a Flexible Resource Isolation mechanism to construct Flexible Secure Memory (Flex-Mem) and Flexible Secure NPU (Flex-NPU). Both memory pages and the NPU can be efficiently switched between unprotected and protected modes. Based on these mechanisms, FlexServe designs a fast and secure LLM inference framework within TrustZone's secure world. The LLM-Aware Memory Management and Secure Inference Pipeline are introduced to accelerate inference. A Multi-Model Scheduler is proposed to optimize multi-model workflows. We implement a prototype of FlexServe and compare it with two TrustZone-based strawman designs. The results show that FlexServe achieves an average $10.05\times$ speedup in Time to First Token (TTFT) compared to the strawman, and an average $2.44\times$ TTFT speedup compared to an optimized strawman with pipeline and secure NPU enabled. For multi-model agent workflows, the end-to-end speedup is up to $24.30\times$ and $4.05\times$ compared to the strawman and optimized strawman, respectively.2026-03-10T00:31:25Z13 pages, 11 figuresYinpeng WuYitong ChenLixiang WangJinyu GuZhichao HuaYubin Xiahttp://arxiv.org/abs/2603.09023v1The Missing Memory Hierarchy: Demand Paging for LLM Context Windows2026-03-09T23:38:32ZThe context window of a large language model is not memory. It is L1 cache: a small, fast, expensive resource that the field treats as the entire memory system. There is no L2, no virtual memory, no paging. Every tool definition, every system prompt, and every stale tool result occupies context for the lifetime of the session. The result is measurable: across 857 production sessions and 4.45 million effective input tokens, 21.8% is structural waste.
We present Pichay, a demand paging system for LLM context windows. Implemented as a transparent proxy between client and inference API, Pichay interposes on the message stream to evict stale content, detect page faults when the model re-requests evicted material, and pin working-set pages identified by fault history. In offline replay across 1.4 million simulated evictions, the fault rate is 0.0254%. In live production deployment over 681turns, the system reduces context consumption by up to 93% (5,038KB to 339KB); under extreme sustained pressure, the system remains operational but exhibits the expected thrashing pathology, with repeated fault-in of evicted content.
The key observation is that the problems the field faces, such as context limits, attention degradation, cost scaling, lost state across sessions, are virtual memory problems wearing different clothes. The solutions exist: working set theory (Denning, 1968), demand paging, fault-driven replacement policies, and memory hierarchies with multiple eviction-managed levels. We describe the architecture of a full memory hierarchy for LLM systems (L1 through persistent storage), report on the first three levels deployed in production use (L1 eviction, L2 fault-driven pinning, L3 model-initiated conversation compaction), and identify cross-session memory as the remaining frontier.2026-03-09T23:38:32ZTony Masonhttp://arxiv.org/abs/2603.08400v1Trust Nothing: RTOS Security without Run-Time Software TCB (Extended Version)2026-03-09T13:59:27ZEmbedded devices face an ever-expanding threat landscape: vulnerabilities in application software, operating system kernels, and peripherals threaten the embedded device integrity. Existing computer-architectural defenses fully consider at most two of these threat vectors in their security model.
This paper aims at addressing this gap using a novel capability architecture. To this end, we combine a token capability approach suitable for building an untrusted operating system with protection against malicious devices without requiring hardware changes to peripherals.
First, we develop and evaluate a full FPGA implementation of our capability architecture around legacy hardware components. Further, we present a soft real-time operating system based on Zephyr that has no run-time software TCB. To this end, we disaggregate Zephyr's subsystems into small, mutually isolated components. All subsystems that exist at run time, including scheduler, allocator and DMA drivers, and all peripherals are fully untrusted. We believe that our work offers a foundation for more rigorous security-by-design in tomorrow's security-critical embedded devices.2026-03-09T13:59:27ZEric AckermannSven Bugielhttp://arxiv.org/abs/2506.08528v4EROICA: Online Performance Troubleshooting for Large-scale Model Training2026-03-09T05:14:48ZTroubleshooting performance problems of large model training (LMT) is immensely challenging, due to unprecedented scales of modern GPU clusters, the complexity of software-hardware interactions, and the data intensity of the training process. Existing troubleshooting approaches designed for traditional distributed systems or datacenter networks fall short and can hardly apply to real-world training systems. In this paper, we present EROICA, the first online troubleshooting system that provides both fine-grained observation based on profiling, and coverage of all machines in GPU clusters, to diagnose performance issues in production, including both hardware and software problems (or the mixture of both). EROICA effectively summarizes runtime behavior patterns of LMT function executions via online profiling, and leverages differential observability to localize the root cause with minimal production impact. EROICA has been deployed as a production service for large-scale GPU clusters of ~100,000 GPUs for 1.5 years. It has diagnosed a variety of difficult performance issues with 97.5% success.2025-06-10T07:46:14ZYu GuanZhiyu YinHaoyu ChenSheng ChengChaojie YangKun QianTianyin XuPengcheng ZhangYang ZhangHanyu ZhaoYong LiWei LinDennis CaiEnnan Zhaihttp://arxiv.org/abs/2603.07750v1Structured Gossip: A Partition-Resilient DNS for Internet-Scale Dynamic Networks2026-03-08T17:54:36ZNetwork partitions pose fundamental challenges to distributed name resolution in mobile ad-hoc networks (MANETs) and edge computing. Existing solutions either require active coordination that fails to scale, or use unstructured gossip with excessive overhead. We present \textit{Structured Gossip DNS}, exploiting DHT finger tables to achieve partition resilience through \textbf{passive stabilization}. Our approach reduces message complexity from $O(n)$ to $O(n/\log n)$ while maintaining $O(\log^2 n)$ convergence. Unlike active protocols requiring synchronous agreement, our passive approach guarantees eventual consistency through commutative operations that converge regardless of message ordering. The system handles arbitrary concurrent partitions via version vectors, eliminating global coordination and enabling billion-node deployments.2026-03-08T17:54:36ZRejected from ACM SIGMOD 2026 Demo TrackPriyanka SinhaDilys Thomashttp://arxiv.org/abs/2603.07683v1Mitigating the Memory Bottleneck with Machine Learning-Driven and Data-Aware Microarchitectural Techniques2026-03-08T15:34:25ZModern applications process massive data volumes that overwhelm the storage and retrieval capabilities of memory systems, making memory the primary performance and energy-efficiency bottleneck of computing systems. Although many microarchitectural techniques attempt to hide or tolerate long memory access latency, rapidly growing data footprints continue to outpace technology scaling, requiring more effective solutions. This dissertation shows that modern processors observe large amounts of application and system data during execution, yet many microarchitectural mechanisms make decisions largely independent of this information. Through four case studies, we demonstrate that such data-agnostic design leads to substantial missed opportunities for improving performance and energy efficiency.
To address this limitation, this dissertation advocates shifting microarchitecture design from data-agnostic to data-informed. We propose mechanisms that (1) learn policies from observed execution behavior (data-driven design) and (2) exploit semantic characteristics of application data (data-aware design). We apply lightweight machine learning techniques and previously underexplored data characteristics across four processor components: a reinforcement learning-based hardware data prefetcher that learns memory access patterns online; a perceptron predictor that identifies memory requests likely to access off-chip memory; a reinforcement learning mechanism that coordinates data prefetching and off-chip prediction; and a mechanism that exploits repeatability in memory addresses and loaded values to eliminate predictable load instructions.
Our extensive evaluation shows that the proposed techniques significantly improve performance and energy efficiency compared to prior state-of-the-art approaches.2026-03-08T15:34:25ZRahul Berahttp://arxiv.org/abs/2603.18030v1Quine: Realizing LLM Agents as Native POSIX Processes2026-03-08T05:32:46ZCurrent LLM agent frameworks often implement isolation, scheduling, and communication at the application layer, even though these mechanisms are already provided by mature operating systems. Instead of introducing another application-layer orchestrator, this paper presents Quine, a runtime architecture and reference implementation that realizes LLM agents as native POSIX processes. The mapping is explicit: identity is PID, interface is standard streams and exit status, state is memory, environment variables, and filesystem, and lifecycle is fork/exec/exit. A single executable implements this model by recursively spawning fresh instances of itself. By grounding the agent abstraction in the OS process model, Quine inherits isolation, composition, and resource control directly from the kernel, while naturally supporting recursive delegation, context renewal via exec, and shell-native composition. The design also exposes where the POSIX process model stops: processes provide a robust substrate for execution, but not a complete runtime model for cognition. In particular, the analysis points toward two immediate extensions beyond process semantics: task-relative worlds and revisable time. A reference implementation of Quine is publicly available on GitHub.2026-03-08T05:32:46Z10 pages, 3 figures. Reference implementation available on https://github.com/kehao95/quineHao Kehttp://arxiv.org/abs/2602.19433v3Why iCloud Fails: The Category Mistake of Cloud Synchronization2026-03-07T21:42:23ZiCloud Drive presents a filesystem interface but implements cloud synchronization semantics that diverge from POSIX in fundamental ways. This divergence is not an implementation bug; it is a Category Mistake -- the same one that pervades distributed computing wherever Forward-In-Time-Only (FITO) assumptions are embedded into protocol design. Parker et al. showed in 1983 that network partitioning destroys mutual consistency; iCloud adds a user interface that conceals this impossibility behind a facade of seamlessness. This document presents a unified analysis of why iCloud fails when composed with Time Machine, git, automated toolchains, and general-purpose developer workflows, supported by direct evidence including documented corruption events and a case study involving 366 GB of divergent state accumulated through normal use. We show that the failures arise from five interlocking incompatibilities rooted in a single structural error: the projection of a distributed causal graph onto a linear temporal chain. We then show how the same Category Mistake, when it occurs in network fabrics as link flapping, destroys topology knowledge through epistemic collapse. Finally, we argue that Open Atomic Ethernet (OAE) transactional semantics -- bilateral, reversible, and conservation-preserving -- provide the structural foundation for resolving these failures, not by defeating physics, but by aligning protocol behavior with physical reality.2026-02-23T02:03:03Z28 pages, 7 figures, 36 referencesPaul Borrillhttp://arxiv.org/abs/2603.03403v2Sharing is caring: Attestable and Trusted Workflows out of Distrustful Components2026-03-07T13:29:00ZConfidential computing protects data in use within Trusted Execution Environments (TEEs), but current TEEs provide little support for secure communication between components. As a result, pipelines of independently developed and deployed TEEs must trust one another to avoid the leakage of sensitive information they exchange -- a fragile assumption that is unrealistic for modern cloud workloads.
We present Mica, a confidential computing architecture that decouples confidentiality from trust. Mica provides tenants with explicit mechanisms to define, restrict, and attest all communication paths between components, ensuring that sensitive data cannot leak through shared resources or interactions. We implement Mica on Arm CCA using existing primitives, requiring only modest changes to the trusted computing base. Our extension adds a policy language to control and attest communication paths among Realms and with the untrusted world via shared protected and unprotected memory and control transfers.
Our evaluation shows that Mica supports realistic cloud pipelines with only a small increase to the trusted computing base while providing strong, attestable confidentiality guarantees.2026-03-03T14:53:48ZAmir Al SadiSina AbdollahiAdrien GhosnHamed HaddadiMarios Kogiashttp://arxiv.org/abs/2603.07030v1Improved Leakage Abuse Attacks in Searchable Symmetric Encryption with eBPF Monitoring2026-03-07T04:23:46ZSearchable Symmetric Encryption (SSE) allows users to search over encrypted data stored on untrusted servers, like cloud providers. While SSE hides the content of queries and documents, it still leaks patterns, such as how often a query is made. These leakages have been shown to enable leakage abuse attacks, but recent defenses have made such attacks harder to carry out. In this work, we explore how system-level monitoring using eBPF (Extended Berkeley Packet Filter) can be used to uncover new forms of leakage that go beyond what is typically captured in SSE threat models. By observing low-level system behavior during search operations, we show that an attacker can gain additional insights into query behavior, document access, and processing flow. We define a new leakage pattern based on these observations and demonstrate how they can strengthen existing attacks. Our findings suggest that system-level leakages present a practical threat to SSE deployments and must be considered when designing defenses. This work serves as a step toward bridging the gap between theoretical SSE security and the realities of system-level exposure.2026-03-07T04:23:46Z7 pages, 1 figureChinecherem Dimobi