https://arxiv.org/api/wrNfXXHnOfoV4khkVItZAPkjUB0 2026-06-21T13:51:38Z 1379 135 15 http://arxiv.org/abs/2506.08528v4 EROICA: Online Performance Troubleshooting for Large-scale Model Training 2026-03-09T05:14:48Z

Troubleshooting performance problems of large model training (LMT) is immensely challenging, due to unprecedented scales of modern GPU clusters, the complexity of software-hardware interactions, and the data intensity of the training process. Existing troubleshooting approaches designed for traditional distributed systems or datacenter networks fall short and can hardly apply to real-world training systems. In this paper, we present EROICA, the first online troubleshooting system that provides both fine-grained observation based on profiling, and coverage of all machines in GPU clusters, to diagnose performance issues in production, including both hardware and software problems (or the mixture of both). EROICA effectively summarizes runtime behavior patterns of LMT function executions via online profiling, and leverages differential observability to localize the root cause with minimal production impact. EROICA has been deployed as a production service for large-scale GPU clusters of ~100,000 GPUs for 1.5 years. It has diagnosed a variety of difficult performance issues with 97.5% success.

2025-06-10T07:46:14Z Yu Guan Zhiyu Yin Haoyu Chen Sheng Cheng Chaojie Yang Kun Qian Tianyin Xu Pengcheng Zhang Yang Zhang Hanyu Zhao Yong Li Wei Lin Dennis Cai Ennan Zhai http://arxiv.org/abs/2603.07750v1 Structured Gossip: A Partition-Resilient DNS for Internet-Scale Dynamic Networks 2026-03-08T17:54:36Z

Network partitions pose fundamental challenges to distributed name resolution in mobile ad-hoc networks (MANETs) and edge computing. Existing solutions either require active coordination that fails to scale, or use unstructured gossip with excessive overhead. We present \textit{Structured Gossip DNS}, exploiting DHT finger tables to achieve partition resilience through \textbf{passive stabilization}. Our approach reduces message complexity from $O(n)$ to $O(n/\log n)$ while maintaining $O(\log^2 n)$ convergence. Unlike active protocols requiring synchronous agreement, our passive approach guarantees eventual consistency through commutative operations that converge regardless of message ordering. The system handles arbitrary concurrent partitions via version vectors, eliminating global coordination and enabling billion-node deployments.

2026-03-08T17:54:36Z Rejected from ACM SIGMOD 2026 Demo Track Priyanka Sinha Dilys Thomas http://arxiv.org/abs/2603.07683v1 Mitigating the Memory Bottleneck with Machine Learning-Driven and Data-Aware Microarchitectural Techniques 2026-03-08T15:34:25Z

Modern applications process massive data volumes that overwhelm the storage and retrieval capabilities of memory systems, making memory the primary performance and energy-efficiency bottleneck of computing systems. Although many microarchitectural techniques attempt to hide or tolerate long memory access latency, rapidly growing data footprints continue to outpace technology scaling, requiring more effective solutions. This dissertation shows that modern processors observe large amounts of application and system data during execution, yet many microarchitectural mechanisms make decisions largely independent of this information. Through four case studies, we demonstrate that such data-agnostic design leads to substantial missed opportunities for improving performance and energy efficiency. To address this limitation, this dissertation advocates shifting microarchitecture design from data-agnostic to data-informed. We propose mechanisms that (1) learn policies from observed execution behavior (data-driven design) and (2) exploit semantic characteristics of application data (data-aware design). We apply lightweight machine learning techniques and previously underexplored data characteristics across four processor components: a reinforcement learning-based hardware data prefetcher that learns memory access patterns online; a perceptron predictor that identifies memory requests likely to access off-chip memory; a reinforcement learning mechanism that coordinates data prefetching and off-chip prediction; and a mechanism that exploits repeatability in memory addresses and loaded values to eliminate predictable load instructions. Our extensive evaluation shows that the proposed techniques significantly improve performance and energy efficiency compared to prior state-of-the-art approaches.

2026-03-08T15:34:25Z Rahul Bera http://arxiv.org/abs/2602.19433v3 Why iCloud Fails: The Category Mistake of Cloud Synchronization 2026-03-07T21:42:23Z

iCloud Drive presents a filesystem interface but implements cloud synchronization semantics that diverge from POSIX in fundamental ways. This divergence is not an implementation bug; it is a Category Mistake -- the same one that pervades distributed computing wherever Forward-In-Time-Only (FITO) assumptions are embedded into protocol design. Parker et al. showed in 1983 that network partitioning destroys mutual consistency; iCloud adds a user interface that conceals this impossibility behind a facade of seamlessness. This document presents a unified analysis of why iCloud fails when composed with Time Machine, git, automated toolchains, and general-purpose developer workflows, supported by direct evidence including documented corruption events and a case study involving 366 GB of divergent state accumulated through normal use. We show that the failures arise from five interlocking incompatibilities rooted in a single structural error: the projection of a distributed causal graph onto a linear temporal chain. We then show how the same Category Mistake, when it occurs in network fabrics as link flapping, destroys topology knowledge through epistemic collapse. Finally, we argue that Open Atomic Ethernet (OAE) transactional semantics -- bilateral, reversible, and conservation-preserving -- provide the structural foundation for resolving these failures, not by defeating physics, but by aligning protocol behavior with physical reality.

2026-02-23T02:03:03Z 28 pages, 7 figures, 36 references Paul Borrill http://arxiv.org/abs/2603.03403v2 Sharing is caring: Attestable and Trusted Workflows out of Distrustful Components 2026-03-07T13:29:00Z

Confidential computing protects data in use within Trusted Execution Environments (TEEs), but current TEEs provide little support for secure communication between components. As a result, pipelines of independently developed and deployed TEEs must trust one another to avoid the leakage of sensitive information they exchange -- a fragile assumption that is unrealistic for modern cloud workloads. We present Mica, a confidential computing architecture that decouples confidentiality from trust. Mica provides tenants with explicit mechanisms to define, restrict, and attest all communication paths between components, ensuring that sensitive data cannot leak through shared resources or interactions. We implement Mica on Arm CCA using existing primitives, requiring only modest changes to the trusted computing base. Our extension adds a policy language to control and attest communication paths among Realms and with the untrusted world via shared protected and unprotected memory and control transfers. Our evaluation shows that Mica supports realistic cloud pipelines with only a small increase to the trusted computing base while providing strong, attestable confidentiality guarantees.

2026-03-03T14:53:48Z Amir Al Sadi Sina Abdollahi Adrien Ghosn Hamed Haddadi Marios Kogias http://arxiv.org/abs/2603.07030v1 Improved Leakage Abuse Attacks in Searchable Symmetric Encryption with eBPF Monitoring 2026-03-07T04:23:46Z

Searchable Symmetric Encryption (SSE) allows users to search over encrypted data stored on untrusted servers, like cloud providers. While SSE hides the content of queries and documents, it still leaks patterns, such as how often a query is made. These leakages have been shown to enable leakage abuse attacks, but recent defenses have made such attacks harder to carry out. In this work, we explore how system-level monitoring using eBPF (Extended Berkeley Packet Filter) can be used to uncover new forms of leakage that go beyond what is typically captured in SSE threat models. By observing low-level system behavior during search operations, we show that an attacker can gain additional insights into query behavior, document access, and processing flow. We define a new leakage pattern based on these observations and demonstrate how they can strengthen existing attacks. Our findings suggest that system-level leakages present a practical threat to SSE deployments and must be considered when designing defenses. This work serves as a step toward bridging the gap between theoretical SSE security and the realities of system-level exposure.

2026-03-07T04:23:46Z 7 pages, 1 figure Chinecherem Dimobi http://arxiv.org/abs/2604.09592v1 EdgeWeaver: Accelerating IoT Application Development Across Edge-Cloud Continuum 2026-03-04T05:35:53Z

The rise of complex, latency-sensitive IoT applications across the Edge-Cloud continuum exposes the limitations of current Function-as-a-Service (FaaS) platforms in seamlessly addressing the complexity, heterogeneity, and intermittent connectivity of Edge-Cloud environments. Developers are left to manage integration and Quality of Service (QoS) enforcement manually, rendering application development complicated and costly. To overcome these limitations, we introduce the EdgeWeaver platform that offers a unified "object" abstraction that is seamlessly distributed across the continuum to encapsulate application logic, state, and QoS. EdgeWeaver automates "class" deployment across edge and cloud by composing established distributed algorithms (e.g., Raft, CRDTs)-enabling developers to declaratively express QoS (e.g., availability and consistency) desires that, in turn, guide internal resource allocation, function placement, and runtime adaptation to fulfill them. We implement a prototype of EdgeWeaver and evaluate it under diverse settings and using human subjects. Results show that EdgeWeaver boosts development productivity by 31%, while declaratively enforcing strong consistency and achieving 9 nines availability, 10,000X higher than the current standard, with negligible performance impact.

2026-03-04T05:35:53Z Published in IPDPS 2026 Conference Pawissanutt Lertpongrujikorn Juahn Kwon Hai Duc Nguyen Mohsen Amini Salehi http://arxiv.org/abs/2603.03271v1 Virtual-Memory Assisted Buffer Management In Tiered Memory 2026-03-03T18:56:52Z

Tiered memory architectures have gained significant traction in the database community in recent years. In these architectures, the on-chip DRAM of the host processor is typically referred to as local memory, and forms the primary tier. Additional byte-addressable, cache-coherent memory resources, collectively referred to as remote memory (RMem, for short), form one or more secondary tiers. RMem is slower than local DRAM but faster than disk, e.g., NUMA memory located on a remote socket, chiplet-attached memory, and memory attached via high-performance interconnect protocols, e.g., RDMA and CXL. In this paper, we discuss how traditional two-tier (DRAM-Disk) virtual-memory assisted Buffer Management techniques generalize to an $n$-tier setting (DRAM-RMem-Disk). We present vmcache$^n$, an $n$-tier virtual-memory-assisted buffer pool that leverages the virtual memory subsystem and operating system calls to migrate pages across memory tiers. In this setup, page migration can become a bottleneck. To address this limitation, we introduce the move_pages2 system call that provides vmcache$^n$ with fine-grained control over the page migration process. Experiments show that vmcache$^n$ can achieve up to 4$\times$ higher query throughput over vmcache for TPC-C workloads.

2026-03-03T18:56:52Z Yeasir Rayhan Walid G. Aref http://arxiv.org/abs/2603.02145v1 Machine Learning (ML) library in Linux kernel 2026-03-02T18:07:35Z

Linux kernel is a huge code base with enormous number of subsystems and possible configuration options that results in unmanageable complexity of elaborating an efficient configuration. Machine Learning (ML) is approach/area of learning from data, finding patterns, and making predictions without implementing algorithms by developers that can introduce a self-evolving capability in Linux kernel. However, introduction of ML approaches in Linux kernel is not easy way because there is no direct use of floating-point operations (FPU) in kernel space and, potentially, ML models can be a reason of significant performance degradation in Linux kernel. Paper suggests the ML infrastructure architecture in Linux kernel that can solve the declared problem and introduce of employing ML models in kernel space. Suggested approach of kernel ML library has been implemented as Proof Of Concept (PoC) project with the goal to demonstrate feasibility of the suggestion and to design the interface of interaction the kernel-space ML model proxy and the ML model user-space thread.

2026-03-02T18:07:35Z Viacheslav Dubeyko http://arxiv.org/abs/2603.00378v1 OBASE: Object-Based Address-Space Engineering to Improve Memory Tiering 2026-02-27T23:35:24Z

Hardware and OS mechanisms for memory tiering are widely deployed, yet datacenters still overprovision DRAM. The root cause is hotness fragmentation: allocators place objects by size rather than access pattern, so hot and cold objects become interleaved within the same pages. A single hot object marks its page as active, trapping surrounding cold data in expensive DRAM. Our analysis of Google production workloads shows that up to 97% of the bytes in active pages are cold and unreclaimable. We propose address-space engineering: dynamically reorganizing virtual memory so that hot objects cluster into uniformly hot pages and cold objects into uniformly cold pages. We present OBASE, a compiler-runtime system for unmanaged languages that serves as an object-aware frontend for page-aware OS backends. OBASE tracks accesses via lightweight pointer instrumentation and migrates objects at runtime using a lock-free protocol that is safe under concurrency. By reorganizing the address space, OBASE enables unmodified backends (kswapd, TMO, TPP, Memtis) to tier memory effectively. Across ten concurrent data structures, six backends, and production traces from Meta and Twitter, OBASE improves page utilization by 2-4x and reduces memory footprint by up to 70%, with only 2-5% overhead.

2026-02-27T23:35:24Z Vinay Banakar Suli Yang Kan Wu Andrea C. Arpaci-Dusseau Remzi H. Arpaci-Dusseau Kimberly Keeton http://arxiv.org/abs/2603.00356v1 Token Management in Multi-Tenant AI Inference Platforms 2026-02-27T22:44:09Z

Multi-tenant AI inference platforms must balance resource utilization against service-level guarantees under variable demand. Conventional approaches fail to achieve this balance: dedicated endpoints strand capacity on idle models, while rate limits ignore the heterogeneous cost of inference requests. We introduce \emph{token pools}, a control-plane abstraction that represents inference capacity as explicit entitlements expressed in inference-native units (token throughput, KV cache, concurrency). Unlike rate limits, which govern request admission without regard to execution cost, token pools authorize both admission and autoscaling from the same capacity model, ensuring consistency between what is promised and what is provisioned. The abstraction captures burst modes across multiple dimensions invisible to conventional throttling. Dynamic per-entitlement limits on each burst dimension enable fine-grained control over resource consumption while permitting work-conserving backfill by low-priority traffic. The design supports priority-aware allocation, service tiers with differentiated guarantees, and debt-based fairness mechanisms, all without modifying the underlying inference runtime or cluster scheduler. In experiments on a Kubernetes cluster with vLLM backends, token pools maintain a bounded P99 latency for guaranteed workloads during overload by selectively throttling spot traffic, while a baseline without admission control experiences unbounded latency degradation across all workloads. A second experiment demonstrates debt-based fair-share convergence among elastic workloads with heterogeneous SLO requirements during capacity scarcity.

2026-02-27T22:44:09Z 10 pages, 6 figures William J. Cunningham http://arxiv.org/abs/2601.01265v3 CounterPoint: Using Hardware Event Counters to Refute and Refine Microarchitectural Assumptions (Extended Version) 2026-02-26T22:04:31Z

Hardware event counters offer the potential to reveal not only performance bottlenecks but also detailed microarchitectural behavior. In practice, this promise is undermined by their vague specifications, opaque designs, and multiplexing noise, making event counter data hard to interpret. We introduce CounterPoint, a framework that tests user-specified microarchitectural models - expressed as $μ$path Decision Diagrams - for consistency with performance counter data. When mismatches occur, CounterPoint pinpoints plausible microarchitectural features that could explain them, using multi-dimensional counter confidence regions to mitigate multiplexing noise. We apply CounterPoint to the Haswell Memory Management Unit as a case study, shedding light on multiple undocumented and underdocumented microarchitectural behaviors. These include a load-store queue-side TLB prefetcher, merging page table walkers, abortable page table walks, and more. Overall, CounterPoint helps experts reconcile noisy hardware performance counter measurements with their mental model of the microarchitecture - uncovering subtle, previously hidden hardware features along the way.

2026-01-03T19:24:00Z This is an extended version of a paper which has been accepted to the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems conference (ASPLOS, March 2026). 20 pages, 20 figures, 8 tables Nick Lindsay Yale University Caroline Trippel Stanford University Anurag Khandelwal Yale University Abhishek Bhattacharjee Yale University 10.1145/3779212.3790145 http://arxiv.org/abs/2602.22402v1 Contextual Memory Virtualisation: DAG-Based State Management and Structurally Lossless Trimming for LLM Agents 2026-02-25T20:52:52Z

As large language models engage in extended reasoning tasks, they accumulate significant state -- architectural mappings, trade-off decisions, codebase conventions -- within the context window. This understanding is lost when sessions reach context limits and undergo lossy compaction. We propose Contextual Memory Virtualisation (CMV), a system that treats accumulated LLM understanding as version-controlled state. Borrowing from operating system virtual memory, CMV models session history as a Directed Acyclic Graph (DAG) with formally defined snapshot, branch, and trim primitives that enable context reuse across independent parallel sessions. We introduce a three-pass structurally lossless trimming algorithm that preserves every user message and assistant response verbatim while reducing token counts by a mean of 20% and up to 86% for sessions with significant overhead by stripping mechanical bloat such as raw tool outputs, base64 images, and metadata. A single-user case-study evaluation across 76 real-world coding sessions demonstrates that trimming remains economically viable under prompt caching, with the strongest gains in mixed tool-use sessions, which average 39% reduction and reach break-even within 10 turns. A reference implementation is available at https://github.com/CosmoNaught/claude-code-cmv.

2026-02-25T20:52:52Z 11 pages. 6 figures. Introduces a DAG-based state management system for LLM agents. Evaluation on 76 coding sessions shows up to 86% token reduction (mean 20%) while remaining economically viable under prompt caching. Includes reference implementation for Claude Code Cosmo Santoni http://arxiv.org/abs/2602.20826v1 Exploiting Dependency and Parallelism: Real-Time Scheduling and Analysis for GPU Tasks 2026-02-24T12:01:57Z

With the rapid advancement of Artificial Intelligence, the Graphics Processing Unit (GPU) has become increasingly essential across a growing number of safety-critical application domains. Applying a GPU is indispensable for parallel computing; however, the complex data dependencies and resource contention across kernels within a GPU task may unpredictably delay its execution time. To address these problems, this paper presents a scheduling and analysis method for Directed Acyclic Graph (DAG)-structured GPU tasks. Given a DAG representation, the proposed scheduling scales the kernel-level parallelism and establishes inter-kernel dependencies to provide a reduced and predictable DAG response time. The corresponding timing analysis yields a safe yet nonpessimistic makespan bound without any assumption on kernel priorities. The proposed method is implemented using the standard CUDA API, requiring no additional software or hardware support. Experimental results under synthetic and real-world benchmarks demonstrate that the proposed approach effectively reduces the worst-case makespan and measured task execution time compared to the existing methods up to 32.8% and 21.3%, respectively.

2026-02-24T12:01:57Z Yuanhai Zhang Songyang He Ruizhe Gou Mingyue Cui Boyang Li Shuai Zhao Kai Huang http://arxiv.org/abs/2602.20214v1 Right to History: A Sovereignty Kernel for Verifiable AI Agent Execution 2026-02-23T07:09:36Z

AI agents increasingly act on behalf of humans, yet no existing system provides a tamper-evident, independently verifiable record of what they did. As regulations such as the EU AI Act begin mandating automatic logging for high-risk AI systems, this gap carries concrete consequences -- especially for agents running on personal hardware, where no centralized provider controls the log. Extending Floridi's informational rights framework from data about individuals to actions performed on their behalf, this paper proposes the Right to History: the principle that individuals are entitled to a complete, verifiable record of every AI agent action on their own hardware. The paper formalizes this principle through five system invariants with structured proof sketches, and implements it in PunkGo, a Rust sovereignty kernel that unifies RFC 6962 Merkle tree audit logs, capability-based isolation, energy-budget governance, and a human-approval mechanism. Adversarial testing confirms all five invariants hold. Performance evaluation shows sub-1.3 ms median action latency, ~400 actions/sec throughput, and 448-byte Merkle inclusion proofs at 10,000 log entries.

2026-02-23T07:09:36Z 22 pages, 3 figures, 7 tables. Open-source: https://github.com/PunkGo/punkgo-kernel Jing Zhang