GoCkpt: Gradient-Assisted Multi-Step overlapped Checkpointing for Efficient LLM Training

2025-11-10T12:31:54Z

The accuracy of large language models (LLMs) improves with increasing model size, but increasing model complexity also poses significant challenges to training stability. Periodic checkpointing is a key mechanism for fault recovery and is widely used in LLM training. However, traditional checkpointing strategies often pause or delay GPU computation during checkpoint saving for checkpoint GPU-CPU transfer, resulting in significant training interruptions and reduced training throughput. To address this issue, we propose GoCkpt, a method to overlap checkpoint saving with multiple training steps and restore the final checkpoint on the CPU. We transfer the checkpoint across multiple steps, each step transfers part of the checkpoint state, and we transfer some of the gradient data used for parameter updates. After the transfer is complete, each partial checkpoint state is updated to a consistent version on the CPU, thus avoiding the checkpoint state inconsistency problem caused by transferring checkpoints across multiple steps. Furthermore, we introduce a transfer optimization strategy to maximize GPU-CPU bandwidth utilization and SSD persistence throughput. This dual optimization overlapping saves across steps and maximizing I/O efficiency significantly reduces invalid training time. Experimental results show that GoCkpt can increase training throughput by up to 38.4% compared to traditional asynchronous checkpoint solutions in the industry. We also find that GoCkpt can reduce training interruption time by 86.7% compared to the state-of-the-art checkpoint transfer methods, which results in a 4.8% throughput improvement.

Preemption-Enhanced Benchmark Suite for FPGAs

2025-11-10T05:58:45Z

Field-Programmable Gate Arrays (FPGAs) have become essential in cloud computing due to their reconfigurability, energy efficiency, and ability to accelerate domain-specific workloads. As FPGA adoption grows, research into task scheduling and preemption techniques has intensified. However, the field lacks a standardized benchmarking framework for consistent and reproducible evaluation. Many existing studies propose innovative scheduling or preemption mechanisms but often rely on proprietary or synthetic benchmarks, limiting generalizability and making comparison difficult. This methodical fragmentation hinders effective evaluation of scheduling strategies and preemption in multi-tenant FPGA environments. This paper presents the first open-source preemption-enabled benchmark suite for evaluating FPGA preemption strategies and testing new scheduling algorithms, without requiring users to create preemption workloads from scratch. The suite includes 27 diverse applications spanning cryptography, AI/ML, computation-intensive workloads, communication systems, and multimedia processing. Each benchmark integrates comprehensive context-saving and restoration mechanisms, facilitating reproducible research and consistent comparisons. Our suite not only simplifies testing FPGA scheduling policies but also benefits OS research by enabling the evaluation of scheduling fairness, resource allocation efficiency, and context-switching performance in multi-tenant FPGA systems, ultimately supporting the development of better operating systems and scheduling policies for FPGA-based environments. We also provide guidelines for adding new benchmarks, enabling future research to expand and refine FPGA preemption and scheduling evaluation.

Guidelines for Building Indexes on Partially Cache-Coherent CXL Shared Memory

2025-11-09T16:55:00Z

The \emph{Partial Cache-Coherence (PCC)} model maintains hardware cache coherence only within subsets of cores, enabling large-scale memory sharing with emerging memory interconnect technologies like Compute Express Link (CXL). However, PCC's relaxation of global cache coherence compromises the correctness of existing single-machine software. This paper focuses on building consistent and efficient indexes on PCC platforms. We present that existing indexes designed for cache-coherent platforms can be made consistent on PCC platforms following SP guidelines, i.e., we identify \emph{sync-data} and \emph{protected-data} according to the index's concurrency control mechanisms, and synchronize them accordingly. However, conversion with SP guidelines introduces performance overhead. To mitigate the overhead, we identify several unique performance bottlenecks on PCC platforms, and propose P$^3$ guidelines (i.e., using Out-of-\underline{P}lace update, Re\underline{P}licated shared variable, S\underline{P}eculative Reading) to improve the efficiency of converted indexes on PCC platforms. With SP and P$^3$ guidelines, we convert and optimize two indexes (CLevelHash and BwTree) for PCC platforms. Evaluation shows that converted indexes' throughput improves up to 16$\times$ following P$^3$ guidelines, and the optimized indexes outperform their message-passing-based and disaggregated-memory-based counterparts by up to 16$\times$ and 19$\times$.

Towards Timing Isolation for Mixed-Criticality Communication in Software-Defined Vehicles

2025-11-04T10:13:45Z

As the automotive industry transitions toward centralized Linux-based architectures, ensuring the predictable execution of mixed-criticality applications becomes essential. However, concurrent use of the Linux network stack introduces interference, resulting in unpredictable latency and jitter. To address this challenge, we present a layered software architecture that enforces timing isolation for Ethernet-based data exchange between mixed-criticality applications on Linux-based automotive control units. Our approach integrates traffic prioritization strategies at the middleware layer, the network stack layer, and the hardware layer to achieve isolation across the full software stack. At the middleware layer, we implement a fixed-priority, non-preemptive scheduler to manage publishers of varying criticality. At the network layer, we leverage the express data path (XDP) to route high-priority data directly from the network interface driver into critical application memory, bypassing the standard Linux network stack. At the hardware layer, we dedicate a network interface card (NIC) queue exclusively to real-time traffic. We demonstrate how our architecture performs in a Data Distribution Service (DDS)-based system. Our evaluation shows that the approach leads to consistent and predictable latencies for real-time traffic, even under heavy interference from best-effort applications.

Fast Networks for High-Performance Distributed Trust

2025-11-01T02:10:13Z

Organizations increasingly need to collaborate by performing a computation on their combined dataset, while keeping their data hidden from each other. Certain kinds of collaboration, such as collaborative data analytics and AI, require a level of performance beyond what current cryptographic techniques for distributed trust can provide. This is because the organizations run software in different trust domains, which can require them to communicate over WANs or the public Internet. In this paper, we explore how to instead run such applications using fast datacenter-type LANs. We show that, by carefully redesigning distributed trust frameworks for LANs, we can achieve up to order-of-magnitude better performance than naïvely using a LAN. Then, we develop deployment models for Distributed But Proximate Trust (DBPT) that allow parties to use a LAN while remaining physically and logically distinct. These developments make secure collaborative data analytics and AI significantly more practical and set new research directions for developing systems and cryptographic theory for high-performance distributed trust.

LiteCUA: Computer as MCP Server for Computer-Use Agent on AIOS

2025-10-31T21:48:48Z

We present AIOS 1.0, a novel platform designed to advance computer-use agent (CUA) capabilities through environmental contextualization. While existing approaches primarily focus on building more powerful agent frameworks or enhancing agent models, we identify a fundamental limitation: the semantic disconnect between how language models understand the world and how computer interfaces are structured. AIOS 1.0 addresses this challenge by transforming computers into contextual environments that language models can natively comprehend, implementing a Model Context Protocol (MCP) server architecture to abstract computer states and actions. This approach effectively decouples interface complexity from decision complexity, enabling agents to reason more effectively about computing environments. To demonstrate our platform's effectiveness, we introduce LiteCUA, a lightweight computer-use agent built on AIOS 1.0 that achieves a 14.66% success rate on the OSWorld benchmark, outperforming several specialized agent frameworks despite its simple architecture. Our results suggest that contextualizing computer environments for language models represents a promising direction for developing more capable computer-use agents and advancing toward AI that can interact with digital systems.

Fix: externalizing network I/O in serverless computing

2025-10-31T19:05:42Z

We describe a system for serverless computing where users, programs, and the underlying platform share a common representation of a computation: a deterministic procedure, run in an environment of well-specified data or the outputs of other computations. This representation externalizes I/O: data movement over the network is performed exclusively by the platform. Applications can describe the precise data needed at each stage, helping the provider schedule tasks and network transfers to reduce starvation. The design suggests an end-to-end argument for outsourced computing, shifting the service model from ``pay-for-effort'' to ``pay-for-results.''

Supply Chain Exploitation of Secure ROS 2 Systems: A Proof-of-Concept on Autonomous Platform Compromise via Keystore Exfiltration

2025-10-31T17:27:10Z

This paper presents a proof-of-concept supply chain attack against the Secure ROS 2 (SROS 2) framework, demonstrated on a Quanser QCar2 autonomous vehicle platform. A Trojan-infected Debian package modifies core ROS 2 security commands to exfiltrate newly generated keystore credentials via DNS in base64-encoded chunks to an attacker-controlled nameserver. Possession of these credentials enables the attacker to rejoin the SROS 2 network as an authenticated participant and publish spoofed control or perception messages without triggering authentication failures. We evaluate this capability on a secure ROS 2 Humble testbed configured for a four-stop-sign navigation routine using an Intel RealSense camera for perception. Experimental results show that control-topic injections can cause forced braking, sustained high-speed acceleration, and continuous turning loops, while perception-topic spoofing can induce phantom stop signs or suppress real detections. The attack generalizes to any data distribution service (DDS)-based robotic system using SROS 2, highlighting the need for both supply chain integrity controls and runtime semantic validation to safeguard autonomous systems against insider and impersonation threats.

A Practical-Driven Framework for Transitioning Drive-by-Wire to Autonomous Driving Systems: A Case Study with a Chrysler Pacifica Hybrid Vehicle

2025-10-31T04:58:08Z

Transitioning from a Drive-by-Wire (DBW) system to a fully autonomous driving system (ADS) involves multiple stages of development and demands robust positioning and sensing capabilities. This paper presents a practice-driven framework for facilitating the DBW-to-ADS transition using a 2022 Chrysler Pacifica Hybrid Minivan equipped with cameras, LiDAR, GNSS, and onboard computing hardware configured with the Robot Operating System (ROS) and Autoware.AI. The implementation showcases offline autonomous operations utilizing pre-recorded LiDAR and camera data, point clouds, and vector maps, enabling effective localization and path planning within a structured test environment. The study addresses key challenges encountered during the transition, particularly those related to wireless-network-assisted sensing and positioning. It offers practical solutions for overcoming software incompatibility constraints, sensor synchronization issues, and limitations in real-time perception. Furthermore, the integration of sensing, data fusion, and automation is emphasized as a critical factor in supporting autonomous driving systems in map generation, simulation, and training. Overall, the transition process outlined in this work aims to provide actionable strategies for researchers pursuing DBW-to-ADS conversion. It offers direction for incorporating real-time perception, GNSS-LiDAR-camera integration, and fully ADS-equipped autonomous vehicle operations, thus contributing to the advancement of robust autonomous vehicle technologies.

Oneiros: KV Cache Optimization through Parameter Remapping for Multi-tenant LLM Serving

2025-10-29T21:56:19Z

KV cache accelerates LLM inference by avoiding redundant computation, at the expense of memory. To support larger KV caches, prior work extends GPU memory with CPU memory via CPU-offloading. This involves swapping KV cache between GPU and CPU memory. However, because the cache updates dynamically, such swapping incurs high CPU memory traffic. We make a key observation that model parameters remain constant during runtime, unlike the dynamically updated KV cache. Building on this, we introduce Oneiros, which avoids KV cache swapping by remapping, and thereby repurposing, the memory allocated to model parameters for KV cache. This parameter remapping is especially beneficial in multi-tenant environments, where the memory used for the parameters of the inactive models can be more aggressively reclaimed. Exploiting the high CPU-GPU bandwidth offered by the modern hardware, such as the NVIDIA Grace Hopper Superchip, we show that Oneiros significantly outperforms state-of-the-art solutions, achieving a reduction of 44.8%-82.5% in tail time-between-token latency, 20.7%-99.3% in tail time-to-first-token latency, and 6.6%-86.7% higher throughput compared to vLLM. Source code of Oneiros is available at https://github.com/UT-SysML/Oneiros/.

Modeling and Scheduling of Fusion Patterns in Autonomous Driving Systems (Extended Version)

2025-10-27T22:05:54Z

In Autonomous Driving Systems (ADS), Directed Acyclic Graphs (DAGs) are widely used to model complex data dependencies and inter-task communication. However, existing DAG scheduling approaches oversimplify data fusion tasks by assuming fixed triggering mechanisms, failing to capture the diverse fusion patterns found in real-world ADS software stacks. In this paper, we propose a systematic framework for analyzing various fusion patterns and their performance implications in ADS. Our framework models three distinct fusion task types: timer-triggered, wait-for-all, and immediate fusion, which comprehensively represent real-world fusion behaviors. Our Integer Linear Programming (ILP)-based approach enables an optimization of multiple real-time performance metrics, including reaction time, time disparity, age of information, and response time, while generating deterministic offline schedules directly applicable to real platforms. Evaluation using real-world ADS case studies, Raspberry Pi implementation, and randomly generated DAGs demonstrates that our framework handles diverse fusion patterns beyond the scope of existing work, and achieves substantial performance improvements in comparable scenarios.

Unlocking True Elasticity for the Cloud-Native Era with Dandelion

2025-10-27T15:18:44Z

Elasticity is fundamental to cloud computing, as it enables quickly allocating resources to match the demand of each workload as it arrives, rather than pre-provisioning resources to meet performance objectives. However, even serverless platforms -- which boot sandboxes in 10s to 100s of milliseconds -- are not sufficiently elastic to avoid over-provisioning expensive resources. Today's FaaS platforms rely on pre-provisioning many idle sandboxes in memory to reduce the occurrence of slow, cold starts. A key obstacle for high elasticity is booting a guest OS and configuring features like networking in sandboxes, which are required to expose an isolated POSIX-like interface to user functions. Our key insight is that redesigning the interface for applications in the cloud-native era enables co-designing a much more efficient and elastic execution system. Now is a good time to rethink cloud abstractions as developers are building applications to be cloud-native. Cloud-native applications typically consist of user-provided compute logic interacting with cloud services (for storage, AI inference, query processing, etc) exposed over REST APIs. Hence, we propose Dandelion, an elastic cloud platform with a declarative programming model that expresses applications as DAGs of pure compute functions and higher-level communication functions. Dandelion can securely execute untrusted user compute functions in lightweight sandboxes that cold start in hundreds of microseconds, since pure functions do not rely on extra software environments such as a guest OS. Dandelion makes it practical to boot a sandbox on-demand for each request, decreasing performance variability by two to three orders of magnitude compared to Firecracker and reducing committed memory by 96% on average when running the Azure Functions trace.

Jenga: Responsive Tiered Memory Management without Thrashing

2025-10-26T23:21:44Z

A heterogeneous memory has a single address space with fast access to some addresses (a fast tier of DRAM) and slow access to other addresses (a capacity tier of CXL-attached memory or NVM). A tiered memory system aims to maximize the number of accesses to the fast tier via page migrations between the fast and capacity tiers. Unfortunately, previous tiered memory systems can perform poorly due to (1) allocating hot and cold objects in the same page and (2) abrupt changes in hotness measurements that lead to thrashing. This paper presents Jenga, a tiered memory system that addresses both problems. Jenga's memory allocator uses a novel context-based page allocation strategy. Jenga's accurate measurements of page hotness enable it to react to memory access behavior changes in a timely manner while avoiding thrashing. Compared to the best previous tiered memory system, Jenga runs memory-intensive applications 28% faster across 10 applications, when the fast tier capacity matches the working set size, at a CPU overhead of <3% of a single core and a memory overhead of <0.3%

LatticeHashForest: An Efficient Data Structure for Repetitive Data and Operations

2025-10-26T05:05:00Z

Analysis of entire programs as a single unit, or whole-program analysis, involves propagation of large amounts of information through the control flow of the program. This is especially true for pointer analysis, where, unless significant compromises are made in the precision of the analysis, there is a combinatorial blowup of information. One of the key problems we observed in our own efforts to this end is that a lot of duplicate data was being propagated, and many low-level data structure operations were repeated a large number of times. We present what we consider to be a novel and generic data structure, LatticeHashForest (LHF), to store and operate on such data in a manner that eliminates a majority of redundant computations and duplicate data in scenarios similar to those encountered in compilers and program optimization. LHF differs from similar work in this vein, such as hash-consing, ZDDs, and BDDs, by not only providing a way to efficiently operate on large, aggregate structures, but also modifying the elements of such structures in a manner that they can be deduplicated immediately. LHF also provides a way to perform a nested construction of elements such that they can be deduplicated at multiple levels, cutting down the need for additional, nested computations. We provide a detailed structural description, along with an abstract model of this data structure. An entire C++ implementation of LHF is provided as an artifact along with evaluations of LHF using examples and benchmark programs. We also supply API documentation and a user manual for users to make independent applications of LHF. Our main use case in the realm of pointer analysis shows memory usage reduction to an almost negligible fraction, and speedups beyond 4x for input sizes approaching 10 million when compared to other implementations.

Tidying Up the Address Space

2025-10-22T16:50:49Z

Memory tiering in datacenters does not achieve its full potential due to hotness fragmentation -- the intermingling of hot and cold objects within memory pages. This fragmentation prevents page-based reclamation systems from distinguishing truly hot pages from pages containing mostly cold objects, fundamentally limiting memory efficiency despite highly skewed accesses. We introduce address-space engineering: dynamically reorganizing application virtual address spaces to create uniformly hot and cold regions that any page-level tiering backend can manage effectively. HADES demonstrates this frontend/backend approach through a compiler-runtime system that tracks and migrates objects based on access patterns, requiring minimal developer intervention. Evaluations across ten data structures achieve up to 70% memory reduction with 3% performance overhead, showing that address space engineering enables existing reclamation systems to reclaim memory aggressively without performance degradation.