https://arxiv.org/api/ww7lGXQegjvoloSaiSh+6qWbbk02026-04-07T16:08:09Z2791327015http://arxiv.org/abs/2602.08199v2Fork, Explore, Commit: OS Primitives for Agentic Exploration2026-03-19T04:38:30ZAI agents increasingly perform agentic exploration: pursuing multiple solution paths in parallel and committing only the successful one. Because each exploration path may modify files and spawn processes, agents require isolated environments with atomic commit and rollback semantics for both filesystem state and process state. We introduce the branch context, a new OS abstraction that provides: (1) copy-on-write state isolation with independent filesystem views and process groups, (2) a structured lifecycle of fork, explore, and commit/abort, (3) first-commit-wins resolution that automatically invalidates sibling branches, and (4) nestable contexts for hierarchical exploration. We realize branch contexts in Linux through two complementary components. First, BranchFS is a FUSE-based filesystem that gives each branch context an isolated copy-on-write workspace, with O(1) creation, atomic commit to the parent, and automatic sibling invalidation, all without root privileges. BranchFS is open sourced in https://github.com/multikernel/branchfs, along with a Python integration library, BranchContext, that provides ready-to-use agent exploration patterns. Second, branch() is a proposed Linux syscall that spawns processes into branch contexts with reliable termination, kernel-enforced sibling isolation, and first-commit-wins coordination. Preliminary evaluation of BranchFS shows sub-350 us branch creation independent of base filesystem size, and modification-proportional commit overhead (under 1 ms for small changes).2026-02-09T01:46:52ZCong WangYusheng Zhenghttp://arxiv.org/abs/2405.11440v3A Model Consistency-Based Countermeasure to GAN-Based Data Poisoning Attack in Federated Learning2026-03-19T04:19:58ZIn federated learning (FL), although the original intention of available but not visible data is to allay data privacy concerns, it potentially brings new security threats, particularly poisoning attacks that target such not visible local data. Intuitively, such data poisoning attacks have great potential in stealthily degrading global FL outcomes, and are expected to be even stealthier if being enhanced by generative models like generative adversarial networks (GANs). However, existing defense methods have not been thoroughly challenged in this regard and generally fail to be aware of a local generation of seemingly legitimate poisoned data. With a growing concern on potentially stealthier attacks, in this paper, a cost-effective defense mechanism named Model Consistency-Based Defense (MCD) is proposed, which offers a comprehensive examination of available local models across multiple feature dimensions, providing an indirect yet effective means of identifying hidden data poisoning attackers. To push the limit of MCD against stealthier attacks, we propose a new GAN-based data poisoning attack model named VagueGAN and an unsupervised variant of it, which can be flexibly deployed to generate seemingly legitimate but noisy poisoned data. The consistency of GAN outputs revealed by VagueGAN helps strengthen MCD to work against stealthier GAN-based attacks as well as other mainstream ones. Extensive experiments on multiple open datasets (MNIST, Fashion-MNIST, CIFAR-10, CIFAR-100, and Mini-Imagenet) indicate that our attack method better balances the trade-off between attack effectiveness and stealthiness with low complexity. More importantly, our defense mechanism is shown to be more competent in identifying a variety of poisoned data, particularly stealthier GAN-poisoned ones.2024-05-19T04:23:40Z18 pages, 16 figuresWei SunBo GaoKe XiongYuwei WangPingyi FanKhaled Ben Letaiefhttp://arxiv.org/abs/2603.01179v2A402: Binding Cryptocurrency Payments to Service Execution for Agentic Commerce2026-03-19T03:37:28ZThe rapid proliferation of autonomous AI agents is driving a shift toward agentic commerce, where agents are expected to autonomously invoke and pay for services. While blockchain-based payments offer a programmable foundation for such interactions, the recently proposed x402 standard fails to enforce end-to-end atomicity across service execution, payment, and result delivery.
In this paper, we present A402, a trust-minimized payment architecture that securely binds cryptocurrency payments to service execution. A402 introduces Atomic Service Channels (ASCs), a new channel protocol that integrates service execution into payment channels, enabling real-time, high-frequency micropayments for agentic commerce. Within each ASC, A402 employs an atomic exchange protocol based on TEE-assisted adaptor signatures, ensuring that payments are finalized if and only if the requested service is correctly executed and the corresponding result is delivered. To further ensure privacy, A402 incorporates a TEE-based Liquidity Vault that privately manages the lifecycle of ASCs and aggregates their settlements into a single on-chain transaction, revealing only aggregated balances.
We implement A402 and evaluate it against x402 with integrations on both Bitcoin and Ethereum. Our results show that A402 delivers orders-of-magnitude performance and on-chain cost improvements over x402 while providing trust-minimized security guarantees.2026-03-01T16:45:22ZYue LiLei WangKaixuan WangZhiqiang YangKe WangZhi GuanJianbo Gaohttp://arxiv.org/abs/2601.08082v2Hierarchical Precision and Recursion for Accelerating Symmetric Linear Solves on MXUs2026-03-19T03:32:53ZSymmetric linear solves are fundamental to a wide range of scientific and engineering applications, from climate modeling and structural analysis to machine learning and optimization. These workloads often rely on Cholesky (POTRF) decomposition and its supporting operations, triangular solves (TRSM) and symmetric rank-k updates (SYRK), which together form the computational core for solving symmetric positive-definite systems. To accelerate these kernels, we present a portable, mixed-precision solver designed for Matrix Processing Units (MXUs), including NVIDIA Tensor Cores (H200) and AMD Matrix Cores (MI300X). Our algorithm builds on a nested recursive formulation in which Cholesky exposes parallelism through recursive decomposition of its TRSM and SYRK sub-problems. This structure yields a hierarchical recursion that maximizes GEMM throughput while enabling fine-grained control over numerical precision. We introduce a custom recursive data structure that assigns low-precision FP16 arithmetic to large off-diagonal blocks, while preserving high precision on diagonal blocks to ensure numerical stability. The solver is implemented in Julia, leveraging array programming, multiple dispatch, and dynamic type inference to enable seamless expression of mixed-precision computation. This design provides a high-level, hardware-agnostic interface while efficiently interfacing with low-level vendor libraries for backend portability. On H200, our recursive FP64 SYRK achieves a 14x speedup over cuBLAS, while mixed-precision delivers up to 27x speedup in SYRK and 5x in TRSM over full-precision baselines. This results in a 5x overall speedup for Cholesky versus cuSOLVER FP64, with 100x better accuracy than pure FP16 while retaining 88% of its peak speedup. Comparable performance and accuracy trends are observed on MI300X, demonstrating broad applicability across GPUs.2026-01-12T23:46:20Z10 pages, 11 figuresVicki CarricaRabab AlomairyEvelyne RingootAlan Edelmanhttp://arxiv.org/abs/2502.00340v2Unlocking Full Efficiency of Token Filtering in Large Language Model Training2026-03-19T03:23:04ZToken filtering has been proposed to enhance the utility of large language models (LLMs) by eliminating inconsequential tokens during training. While usingfewer tokens is expected to reduce computational workloads, existing methods have not yet achieved a real-world efficiency boost. This is primarily due to two factors: (1) existing work has inadequate sparsity for speedup, and (2) token filtering operates within a sparsity range that is non-standard in existing machine learning (ML) libraries and thus cannot be efficiently supported. This paper presents Centrifuge, a system that leverages algorithm and system co-design to unleash the full efficiency of token filtering in LLM training. At the algorithm level, Centrifuge filters activations of inconsequential tokens in the attention backward kernel to amplify the sparsity in backward computation. At the system level, Centrifuge proposes an automatic workflow that transforms sparse GEMM into dimension-reduced dense GEMM for optimized efficiency using standard ML libraries. Evaluations on models with various scales--from 1.1B to 40B--demonstrate that Centrifuge reduces backpropagation time by up to 49.9\% and end-to-end training time by up to 34.7\% when filtering 50\% of tokens. Utility assessments indicate that Centrifuge preserves the utility benefits of token filtering and significantly enhances model performance by up to 26.6\% compared to standard training. Centrifuge is designed for seamless integration into existing LLM training frameworks, enabling systems already utilizing token filtering to accelerate training with just one line of code.2025-02-01T06:57:01ZDi ChaiPengbo LiFeiyuan ZhangYilun JinHan TianKaiqiang XuBinhang YuanDian ShenJunxue ZhangKai Chenhttp://arxiv.org/abs/2603.18383v1From Servers to Sites: Compositional Power Trace Generation of LLM Inference for Infrastructure Planning2026-03-19T01:01:41ZDatacenter operators and electrical utilities rely on power traces at different spatiotemporal scales. Operators use fine-grained traces for provisioning, facility management, and scheduling, while utilities use site-level load profiles for capacity and interconnection planning. Existing datacenter power models do not capture LLM inference workloads, in which GPUs shift rapidly among compute-intensive prefill, lower-power decode, and idle states, and facility demand depends on how these states evolve and synchronize across many devices. We show that LLM inference power can be represented compositionally through two components: workload-driven transitions among operating states and configuration-specific power distributions within those states. Building on this observation, we develop a trace-generation framework that learns from measured traces and synthesizes power profiles for new traffic conditions and serving configurations. These traces aggregate from GPU servers to rack-, row-, and facility-scale load profiles at the temporal granularity required by the study.
Across multiple LLMs, tensor-parallel settings, and GPU generations, our framework achieves median absolute energy error below 5% for most configurations while preserving temporal autocorrelation structure. The resulting traces support downstream analyses including oversubscription, power modulation, and utility-facing load characterization, enabling infrastructure evaluations that flat nameplate assumptions and static trace replay cannot support.2026-03-19T01:01:41ZGrant WilkinsFiodar KazhamiakaRam Rajagopalhttp://arxiv.org/abs/2412.15411v5Sparse Checkpointing for Fast and Reliable MoE Training2026-03-19T00:31:05ZAs large language models scale, training them requires thousands of GPUs over extended durations--making frequent failures an inevitable reality. While checkpointing remains the primary fault-tolerance mechanism, existing methods fall short when applied to Mixture-of-Experts (MoE) models. Due to their substantially larger training state, MoE models exacerbate checkpointing overheads, often causing costly stalls or prolonged recovery that severely degrade training efficiency.
We present MoEvement, a distributed, in-memory checkpointing system tailored for MoE models. MoEvement is built on three key ideas: (1) sparse checkpointing, which incrementally snapshots subsets of experts across iterations to reduce overhead; (2) a sparse-to-dense checkpoint conversion mechanism that incrementally reconstructs consistent dense checkpoints from sparse snapshots; and (3) upstream logging of activations and gradients at pipeline-stage boundaries, enabling localized recovery without re-executing unaffected workers. Evaluations across diverse MoE models with up to 64 experts show that MoEvement reduces checkpointing overhead by up to $4\times$ and recovery overhead by up to $31\times$ compared to state-of-the-art approaches, sustaining ETTR $\ge 0.94$ even under frequent failures (MTBF as low as 10 minutes) and delivering up to $8\times$ overall training speedup, all without compromising synchronous training semantics. Overall, MoEvement offers a robust and scalable fault-tolerance solution for the next generation of sparsely activated models.2024-12-19T21:34:44ZNSDI'26 | Camera-ReadySwapnil GandhiChristos Kozyrakishttp://arxiv.org/abs/2603.12214v2WORKSWORLD: A Domain for Integrated Numeric Planning and Scheduling of Distributed Pipelined Workflows2026-03-18T21:48:39ZThis work pursues automated planning and scheduling of distributed data pipelines, or workflows. We develop a general workflow and resource graph representation that includes both data processing and sharing components with corresponding network interfaces for scheduling. Leveraging these graphs, we introduce WORKSWORLD, a new domain for numeric domain-independent planners designed for permanently scheduled workflows, like ingest pipelines. Our framework permits users to define data sources, available workflow components, and desired data destinations and formats without explicitly declaring the entire workflow graph as a goal. The planner solves a joint planning and scheduling problem, producing a plan that both builds the workflow graph and schedules its components on the resource graph. We empirically show that a state-of-the-art numeric planner running on commodity hardware with one hour of CPU time and 30GB of memory can solve linear-chain workflows of up to 14 components across eight sites.2026-03-12T17:34:04ZTo be published in Proceedings of the International Conference on Automated Planning and Scheduling Volume 36 (2026)Taylor PaulWilliam Reglihttp://arxiv.org/abs/2505.01821v5Edge-Cloud Collaborative Computing on Distributed Intelligence and Model Optimization: A Survey2026-03-18T13:31:23ZEdge-cloud collaborative computing (ECCC) has emerged as a pivotal paradigm for addressing the computational demands of modern intelligent applications, integrating cloud resources with edge devices to enable efficient, low-latency processing. Recent advancements in AI, particularly deep learning and large language models (LLMs), have dramatically enhanced the capabilities of these distributed systems, yet introduce significant challenges in model deployment and resource management. In this survey, we comprehensive examine the intersection of distributed intelligence and model optimization within edge-cloud environments, providing a structured tutorial on fundamental architectures, enabling technologies, and emerging applications. Additionally, we systematically analyze model optimization approaches, including compression, adaptation, and neural architecture search, alongside AI-driven resource management strategies that balance performance, energy efficiency, and latency requirements. We further explore critical aspects of privacy protection and security enhancement within ECCC systems and examines practical deployments through diverse applications, spanning autonomous driving, healthcare, and industrial automation. Performance analysis and benchmarking techniques are also thoroughly explored to establish evaluation standards for these complex systems. Furthermore, the review identifies critical research directions including LLMs deployment, 6G integration, neuromorphic computing, and quantum computing, offering a roadmap for addressing persistent challenges in heterogeneity management, real-time processing, and scalability. By bridging theoretical advancements and practical deployments, this survey offers researchers and practitioners a holistic perspective on leveraging AI to optimize distributed computing environments, fostering innovation in next-generation intelligent systems.2025-05-03T13:55:38ZAccepted by IEEE ComST. 45 pages, 13 figures, 10 tablesJing LiuYao DuKun YangJiaqi WuYan WangXiping HuZehua WangYang LiuPeng SunAzzedine BoukercheVictor C. M. Leung10.1109/COMST.2026.3669216http://arxiv.org/abs/2210.06154v2Aergia: Leveraging Heterogeneity in Federated Learning Systems2026-03-18T13:06:42ZFederated Learning (FL) is a popular approach for distributed deep learning that prevents the pooling of large amounts of data in a central server. FL relies on clients to update a global model using their local datasets. Classical FL algorithms use a central federator that, for each training round, waits for all clients to send their model updates before aggregating them. In practical deployments, clients might have different computing powers and network capabilities, which might lead slow clients to become performance bottlenecks. Previous works have suggested to use a deadline for each learning round so that the federator ignores the late updates of slow clients, or so that clients send partially trained models before the deadline. To speed up the training process, we instead propose Aergia, a novel approach where slow clients (i) freeze the part of their model that is the most computationally intensive to train; (ii) train the unfrozen part of their model; and (iii) offload the training of the frozen part of their model to a faster client that trains it using its own dataset. The offloading decisions are orchestrated by the federator based on the training speed that clients report and on the similarities between their datasets, which are privately evaluated thanks to a trusted execution environment. We show through extensive experiments that Aergia maintains high accuracy and significantly reduces the training time under heterogeneous settings by up to 27% and 53% compared to FedAvg and TiFL, respectively.2022-10-12T12:59:18ZThis paper is accepted at the 23rd ACM/IFIP International Middleware Conference (Middleware '22). Updated version has minor textual improvementsBart CoxLydia Y. ChenJérémie Decouchanthttp://arxiv.org/abs/2603.17614v1A mechanism design overview of Sedna2026-03-18T11:25:48ZSedna is a coded multi-proposer consensus protocol in which a sender shards a transaction payload into rateless symbols and disseminates them across parallel proposer lanes, providing high throughput and ``until decode'' privacy. This paper studies a sharp incentive failure in such systems. A cartel of lane proposers can withhold the bundles addressed to its lanes, slowing the chain's symbol accumulation while privately pooling the missing symbols. Because finalized symbols become public, the cartel's multi-slot information lead is governed by a chain level delay event where the chain fails to accumulate the $κ$ bundles needed for decoding by the honest horizon $t^\star=\lceil κ/m\rceil$. We characterize the resulting delay probability with KL-type large deviation bounds and show a knife edge pathology when the slack $Δ=t^\star m-κ$ is zero such that withholding a single bundle suffices to push inclusion into the next slot with high probability.
We propose \textsf{PIVOT-$K$}, a Sedna native pivotal bundle bounty that concentrates rewards on the $κ$ bundles that actually trigger decoding, and we derive explicit incentive compatibility conditions against partial and coalition deviations. We further show that an adaptive sender ``ratchet'' that excludes lanes whose tickets were not redeemed collapses multi-slot withholding into a first slot deficit when $t^\star\ge 2$, reducing the required bounty by orders of magnitude. We close by bounding irreducible within slot decode races and providing parameter guidance and numerical illustrations. Our results show that for realistic parameters Sedna can reduce MEV costs to 0.04\% of the transaction value.2026-03-18T11:25:48ZBenjamin MarshAlejandro Ranchal-Pedrosahttp://arxiv.org/abs/2603.11101v2Thousand-GPU Large-Scale Training and Optimization Recipe for AI-Native Cloud Embodied Intelligence Infrastructure2026-03-18T08:30:14ZEmbodied intelligence is a key step towards Artificial General Intelligence (AGI), yet its development faces multiple challenges including data, frameworks, infrastructure, and evaluation systems. To address these issues, we have, for the first time in the industry, launched a cloud-based, thousand-GPU distributed training platform for embodied intelligence, built upon the widely adopted LeRobot framework, and have systematically overcome bottlenecks across the entire pipeline. At the data layer, we have restructured the data pipeline to optimize the flow of embodied training data. In terms of training, for the GR00T-N1.5 model, utilizing thousand-GPU clusters and data at the scale of hundreds of millions, the single-round training time has been reduced from 15 hours to just 22 minutes, achieving a 40-fold speedup. At the model layer, by combining variable-length FlashAttention and Data Packing, we have moved from sample redundancy to sequence integration, resulting in a 188% speed increase; π-0.5 attention optimization has accelerated training by 165%; and FP8 quantization has delivered a 140% speedup. On the infrastructure side, relying on high-performance storage, a 3.2T RDMA network, and a Ray-driven elastic AI data lake, we have achieved deep synergy among data, storage, communication, and computation. We have also built an end-to-end evaluation system, creating a closed loop from training to simulation to assessment. This framework has already been fully validated on thousand-GPU clusters, laying a crucial technical foundation for the development and application of next-generation autonomous intelligent robots, and is expected to accelerate the arrival of the era of human-machine integration.2026-03-11T09:09:35ZYongjian GuoYunxuan MaHaoran SunZhong GuanShuai DiJing LongWanting XuXiaodong BaiWen HuangYucheng GuoChen ZhouQiming YangMingxi LuoTianyun ZhaoHedan YangSong WangXiaomeng TianXiaolong XiangZhen SunYu WeiLuqiao WangYuzhen LiChenfeng GuJunwu XiongYicheng Gonghttp://arxiv.org/abs/2603.17456v1Multi-stage Flow Scheduling for LLM Serving2026-03-18T07:53:28ZMeeting stringent Time-To-First-Token (TTFT) requirements is crucial for LLM applications. To improve efficiency, modern LLM serving systems adopt disaggregated architectures with diverse parallelisms, introducing complex multi-stage workflows involving reusable KV-block retrieval, collective communication, and P2D transfer. Flows from dependent stages overlap within and across requests on shared bottleneck links, making TTFT highly susceptible to network contention and necessitating stage-aware scheduling. Unfortunately, most existing works schedule flows in a stage-agnostic manner, leading to uncoordinated contention that constitutes a primary cause of SLO violations.
In this paper, we present MFS, a holistic multi-stage flow scheduling mechanism designed to maximize TTFT SLO attainment. At its core, MFS approximates the Least-Laxity-First (LLF) scheduling policy without requiring precise knowledge of a request's remaining slack. It achieves this through a Defer-and-Promote principle implemented through a Reverse Multi-Level Queue (RMLQ) structure. By dynamically promoting task precedence as effective laxity diminishes, MFS prioritizes flows with less laxity while preventing requests with loose SLOs from prematurely consuming network bandwidth. We implement MFS as a pluggable module integrated into vLLM, and evaluate it on a 8-server, 32-GPU testbed as well as through large-scale simulations. Our results demonstrate that MFS effectively outperforms state-of-the-art baselines, improving the TTFT SLO attainment by 1.2x--2.4x.2026-03-18T07:53:28Z18 pages, 14 figuresYijun SunHong Kong University of Science and TechnologyXudong LiaoHong Kong University of Science and TechnologySongrun XieHong Kong University of Science and TechnologyHao ChenShanghai Jiao Tong UniversityHan TianUniversity of Science and Technology of ChinaWenxue LiHong Kong University of Science and TechnologyYiming ZhangShanghai Jiao Tong UniversityKai ChenHong Kong University of Science and Technologyhttp://arxiv.org/abs/2603.17435v1ZipServ: Fast and Memory-Efficient LLM Inference with Hardware-Aware Lossless Compression2026-03-18T07:21:21ZLossless model compression holds tremendous promise for alleviating the memory and bandwidth bottlenecks in bit-exact Large Language Model (LLM) serving. However, existing approaches often result in substantial inference slowdowns due to fundamental design mismatches with GPU architectures: at the kernel level, variable-length bitstreams produced by traditional entropy codecs break SIMT parallelism; at the system level, decoupled pipelines lead to redundant memory traffic. We present ZipServ, a lossless compression framework co-designed for efficient LLM inference. ZipServ introduces Tensor-Core-Aware Triple Bitmap Encoding (TCA-TBE), a novel fixed-length format that enables constant-time, parallel decoding, together with a fused decompression-GEMM (ZipGEMM) kernel that decompresses weights on-the-fly directly into Tensor Core registers. This "load-compressed, compute-decompressed" design eliminates intermediate buffers and maximizes compute intensity. Experiments show that ZipServ reduces the model size by up to 30%, achieves up to 2.21x kernel-level speedup over NVIDIA's cuBLAS, and expedites end-to-end inference by an average of 1.22x over vLLM. ZipServ is the first lossless compression system that provides both storage savings and substantial acceleration for LLM inference on GPUs.2026-03-18T07:21:21ZASPLOS'26 Accepted PaperRuibo FanXiangrui YuXinglin PanZeyu LiWeile LuoQiang WangWei WangXiaowen Chuhttp://arxiv.org/abs/2603.17280v1The 1/W Law: An Analytical Study of Context-Length Routing Topology and GPU Generation Gains for LLM Inference Energy Efficiency2026-03-18T02:15:40ZHow many tokens can a GPU inference cluster deliver per watt? Across deployments of identical hardware, the answer varies by 40x -- not because of software inefficiency, but because of the serving context window. We derive the 1/W law: tokens per watt halves every time the context window doubles. A larger context window shrinks the KV-cache concurrency limit while leaving GPU power draw roughly unchanged. At 64K context, an H100 holds 16 sequences in flight (tok/W = 1.5); at 4K context, the same H100 holds 256 sequences (tok/W = 17.6).
Routing topology -- which determines the effective context window each GPU services -- is a more powerful energy lever than buying newer hardware. Working from published H100 power measurements, a calibrated logistic power model, and a roofline throughput model, we derive these results analytically using the inference-fleet-sim framework; no new hardware experiments were conducted. Two-pool context-length routing (FleetOpt) delivers roughly 2.5x better tok/W over a homogeneous fleet, while upgrading from H100 to B200 delivers roughly 1.7x. The gains are independent: combining FleetOpt with B200 yields 4.25x over the H100 homogeneous baseline. B200/H200 numbers are analytical projections (+-20% uncertainty); H100 results are calibrated to published measurements.
For MoE models, active-parameter weight streaming adds a third lever. Qwen3-235B-A22B (22B active) reaches roughly 37.8 tok/W at 8K context on H100 -- 5.1x better than Llama-3.1-70B -- because decode time scales with activated weights, not total parameters. MoE dispatch overhead is excluded, so this is an upper bound.2026-03-18T02:15:40ZWork in progressHuamin ChenXunzhuo LiuYuhan LiuJunchen JiangBowei HeXue Liu