https://arxiv.org/api/36OhdJhU0SpW5AzePKC+7kyK2Hs 2026-06-10T22:14:44Z 28838 360 15 http://arxiv.org/abs/2605.23389v1 AlignedServe: Orchestrating Prefix-aware Batching to Build a High-throughput and Computing-efficient LLM Serving System 2026-05-22T09:00:45Z

High-throughput inference serving is essential for applications built on large language models (LLMs). Existing serving frameworks reduce request-level and batch-level bubbles through batching and scheduling, but often overlook bubbles within each decode iteration. Tokens generated in the same iteration may incur different costs because they depend on KV caches of different lengths; tokens with long KV caches can become bottlenecks and delay the next iteration. We propose AlignedServe, an LLM serving framework built around prefix-aware batching. It groups requests with similar KV-cache lengths into the same batch to reduce iteration-level bubbles. To support this policy efficiently, AlignedServe uses large CPU memory to maintain sufficient in-flight requests for batching and applies a batch-level scheduling policy to reduce batch-level bubbles. It also introduces a GPU-Prefetch-For-GPU architecture, where one GPU prefetches KV cache for another to reduce CPU-to-GPU transfer latency. Experiments on synthetic and application workloads show that AlignedServe improves decoding throughput by up to 1.98 times and reduces latency by up to 7.4 times over state-of-the-art systems.

2026-05-22T09:00:45Z Fengyao Bai Hongbin Zhang Zhitao Chen Jiangsu Du Zhiguang Chen Yutong Lu 10.1145/3802009 http://arxiv.org/abs/2605.17076v2 S-Bus: Automatic Read-Set Reconstruction for Multi-Agent LLM State Coordination 2026-05-22T08:53:52Z

We address concurrency control for LLM agents sharing mutable state over HTTP, where agents cannot be modified to declare read sets. S-Bus is an HTTP middleware whose central mechanism, a server-side DeliveryLog, reconstructs each agent's read set at commit time from observed HTTP GET traffic. The consistency property it provides -- Observable-Read Isolation (ORI), a partial causal consistency over the HTTP-observable read projection -- prevents Structural Race Conditions in dedicated-shard topologies. Three contributions. (C1) DeliveryLog mechanism with three-tier mechanised evidence: TLAPS proves ReadSetSoundness and ORICommitSafety (modulo one typing axiom); exhaustive TLC at N=3 explores 20,763,484 states with zero violations; Dafny discharges 9 inductive lemmas. (C2) Empirical safety parity against PostgreSQL 17 SERIALIZABLE and Redis 7 WATCH/MULTI: zero Type-I corruptions across 884,110 commit attempts (427,308 under active contention). (C3) ORI is semantically neutral in dedicated-shard workloads but harmful in single-shard collaborative writing because preservation propagates concurrent contradictions. v2 update: the PH-3 LLM judge is now independently validated against a human annotator (Zahid Hussain, Mindgigs Peshawar) on 400 (step, shard) pairs at strict kappa=0.93 (n=93, 96.8% raw agreement). Inter-LLM-judge agreement is kappa=0.46 (boundary variance). Agent self-reports over-claim shard usage by 32% (LLM judge) to 49% (human annotator). The SJ-v4 semantic-quality rubric remains single-judge LLM-only. Source code, formal proofs, harness, annotation data: https://github.com/sajjadanwar0/sbus

2026-05-16T16:46:27Z v2: LLM judge validated against human annotator (Zahid Hussain, Mindgigs Peshawar) on PH-3 at strict kappa=0.93 (n=93, 96.8% agreement); over-claim refined to 32% (LLM) / 49% (human). Adds Exp.PG-Comparison Rust-Native and Workload-B chi2=1094.98. 24 pages, 23 tables. Annotation data attached as arXiv ancillary files Sajjad Khan http://arxiv.org/abs/2605.23348v1 XWind: A Cross-site Router for Large Language Model Inference Serving at Renewable Energy Farms 2026-05-22T08:08:47Z

AI power demand is growing at an unprecedented rate while power grids are often ailing and struggle to keep up. Grid expansion comes with high capital expenditure and long-distance transmission losses, yet there is abundant renewable energy at the source, just not matched to demand. This paper proposes a complementary AI infrastructure deployment model, AI Greenferencing, that brings modular AI compute to renewable energy sources, focusing on wind, allowing AI footprint expansion, generating local behind-the-meter demand for renewable sites, and helping ease the growing strain on power utilities. Our feasibility analysis shows that 890+ GW of wind capacity lies within 50 ms network round trip time of Azure data centers, and that site-wise right-sizing combined with spatial complementarity of wind energy keeps aggregate fleet utilization on par with traditional deployments. To serve inference requests under variable wind power, we build XWind, a lightweight, reactive, and workload-agnostic AI inference router that uses only real-time signals: inference latency, KV-cache utilization, and queue depth, to dynamically configure sites and distribute requests. Evaluated on a real 64-GPU A100 testbed emulating three wind-powered sites with Azure production traces, XWind reduces P99 end-to-end latency by up to 52% over the strongest contender (also our idea) and by up to 98% over baselines such as power-capping and GPU idling, with consistent gains across workload types, load levels, and GPU generations.

2026-05-22T08:08:47Z Tella Rajashekhar Reddy Atharva Deshmukh Liangcheng Yu Chaojie Zhang Mike Shepperd Rohan Gandhi Anjaly Parayil Srinivasan Iyengar Ajay Manchepalli Debopam Bhattacherjee http://arxiv.org/abs/2605.23297v1 Ontological Knowledge Blocks: Executable Compliance and Profile-Based Validation for Trustworthy AI Systems 2026-05-22T07:14:31Z

AI-enabled services deployed in critical digital infrastructure are subject to governance obligations spanning transparency, accountability, fairness, and traceability. Compliance today remains documentation-centric: obligations are described in prose, audits rely on static checklists, and verification depends on manual review. Such approaches do not scale to automated AI systems. This paper introduces Ontological Knowledge Blocks (OKBs), a programmable governance infrastructure that compiles regulatory obligations into machine-checkable constraints over structured evidence graphs. We formalize an OKB as a 5-tuple that binds normative obligations to an RDF/OWL concept schema, executable SHACL validation rules, explicit evidence requirements, and PROV-O provenance links. A deterministic regulatory compiler translates structured Intermediate Representation (IR) records into composable KB modules, enabling profile-based governance reconfiguration without modifying service code. We implement two prototypes and evaluate them in an AI-assisted HPC resource allocation scenario across 24 validation runs and four governance profiles. Results demonstrate profile-sensitive validation, strictly additive violation accumulation, SHACL validation latency between 12.6 ms and 100.3 ms, and profile equivalence testing confirming Combined as the strictly most comprehensive profile. All artefacts are released as open source.

2026-05-22T07:14:31Z 6 pages, 3 figures. Accepted at the Security, Trust and Privacy for Software and Applications (STPSA) Workshop, IEEE COMPSAC 2026, Madrid, Spain, July 7-10, 2026 Aasish Kumar Sharma Julian M. Kunkel http://arxiv.org/abs/2601.21198v2 ZipMoE: Efficient On-Device MoE Serving via Lossless Compression and Cache-Affinity Scheduling 2026-05-22T06:53:25Z

While Mixture-of-Experts (MoE) architectures substantially bolster the expressive power of large-language models, their prohibitive memory footprint severely impedes the practical deployment on resource-constrained edge devices, especially when model behavior must be preserved without relying on lossy quantization. In this paper, we present ZipMoE, an efficient and semantically lossless on-device MoE serving system. ZipMoE exploits the synergy between the hardware properties of edge devices and the statistical redundancy inherent to MoE parameters via a caching-scheduling co-design with provable performance guarantee. Fundamentally, our design shifts the paradigm of on-device MoE inference from an I/O-bound bottleneck to a compute-centric workflow that enables efficient parallelization. We implement a prototype of ZipMoE and conduct extensive experiments on representative edge computing platforms using popular open-source MoE models and real-world workloads. Our evaluation reveals that ZipMoE achieves up to $72.77\%$ inference latency reduction and up to $6.76\times$ higher throughput than the state-of-the-art systems.Our code is available at: https://github.com/npnothard/ZipMoE-ICML26.

2026-01-29T02:51:59Z ICML 2026 Yuchen Yang Yaru Zhao Pu Yang Shaowei Wang Zhi-Hua Zhou http://arxiv.org/abs/2605.23162v1 SolarChain: Bridging Physical Law, Verifiable Trust, and Sustainable Markets for Urban Energy Resilience 2026-05-22T02:30:54Z

Urban decarbonization requires scaling rooftop solar across millions of fragmented producers, yet cities face a fundamental tension: energy data is easily manipulated, and economic incentives often reward speculation rather than actual infrastructure deployment. We present SolarChain, a platform that resolves both problems by anchoring digital accountability to the thermodynamic limits of solar energy conversion. Using real-time meteorological data, geospatial coordinates, and first-principles calculations of solar yield, the system establishes a hard physical boundary for every panel's maximum possible output; any reported generation exceeding this limit is automatically rejected before entering the shared ledger. This trustless verification enables a peer-to-peer marketplace with programmatic reward structures that continuously reinvest value into equipment maintenance and market liquidity, preventing the speculative hoarding that typically destabilizes blockchain-based marketplaces. When electricity is consumed, the corresponding digital credits are permanently retired in direct proportion to physical energy dissipation, creating an auditable one-to-one mapping between urban consumption and carbon accounting. Deployed across heterogeneous city nodes, the prototype demonstrates resilience against data injection attacks while lowering capital barriers for community-level solar expansion. Beyond energy, the framework offers a general model for coordinating economic activity with physical law in any domain where distributed infrastructure demands both data integrity and sustainable investment. We release the data and code as open-access on GitHub.

2026-05-22T02:30:54Z Shilin Ou Yifan Xu Zhenshan Zhang Luyao Zhang Ming-Chun Huang http://arxiv.org/abs/2605.11215v2 ReCoVer: Resilient LLM Pre-Training System via Fault-Tolerant Collective and Versatile Workload 2026-05-22T00:14:51Z

Pre-training large language models on massive GPU clusters has made hardware faults routine rather than rare, driving the need for resilient training systems. Yet existing frameworks either focus on specific parallelism schemes or risk drifting away from a failure-free training trajectory. We propose ReCoVer, a resilient LLM pre-training system that upholds a single invariant: each iteration keeps the number of microbatches constant, ensuring per-iteration gradients remain stochastically equivalent to a failure-free run. The framework is organized as three decoupled protocol layers: (1) Fault-tolerant collectives that isolate faults from propagating across replicas; (2) in-step fine-grained recovery that preserves intra-iteration progress and prevents gradient corruption; (3) versatile-workload policy that dynamically redistributes microbatch quotas across the survivors. The design is parallelism-agnostic, integrating directly with both 3D parallelism and Hybrid Sharded Data Parallel (HSDP) as a drop-in substrate. We evaluate our implementation on end-to-end pre-training tasks for up to 512 GPUs, ReCoVer successfully preserves the training trajectory from a failure-free reference despite of 256 GPUs lost spread across the run. For comparison with checkpoint-and-restart baselines, ReCoVer demonstrates $2.23\times$ higher effective throughput after successive failures. This advantage results in ReCoVer processing 74.9% more tokens at 234 GPU-hours, with the gap widening as the training prolongs.

2026-05-11T20:28:31Z Preprint Ziyue Liu Zhengyang Wang Ruijie Zhang Avinash Maurya Hui Zhou Paul Hovland Sheng Di Franck Cappello Bogdan Nicolae Zheng Zhang http://arxiv.org/abs/2605.23109v1 Inductive Deductive Synthesis: Enabling AI to Generate Formally Verified Systems 2026-05-22T00:05:36Z

AI agents increasingly excel at generating, testing, and refining code. However, they fall short on tasks requiring formal guarantees of full coverage that testing alone cannot provide. Distributed systems are a prime example: properties such as consistency between reads and writes must hold under every possible interleaving of events. Mechanized formal verification can guarantee such correctness, but typically demands months to years of expert effort. As evidence, even SOTA coding agents (Codex with GPT-5.4 and Claude Code with Opus 4.6) succeed on only 2/7 distributed key-value-store specifications. In this paper, we present the first effective approach to addressing this gap, Inductive Deductive Synthesis (IDS), which jointly and incrementally synthesizes implementation and proof, and learns from failed attempts to systematically try promising strategies. Built as an agentic LLM system, IDS achieves 7/7 in about 6.8 hours and $106 per spec on average, roughly 200x faster than expert effort and 17% cheaper than SOTA agents. IDS further incorporates performance feedback into the same loop, yielding implementations up to 3x faster than published verified systems.

2026-05-22T00:05:36Z Shubham Agarwal Alexander Krentsel Shu Liu Mert Cemri Audrey Cheng Rui Meng Tomas Pfister Chun-Liang Li Sylvia Ratnasamy Aditya Parameswaran Matei Zaharia Ion Stoica Mohsen Lesani http://arxiv.org/abs/2605.20532v2 Hybrid Edge-HPC Systems for Low-Latency Data-Driven Inference 2026-05-21T23:53:53Z

Emerging cyber-physical systems increasingly require low-latency inference from streaming sensor data while maintaining models that reflect complex and evolving physical processes. In many domains, however, model updates depend on high-fidelity simulations and training executed on remote high-performance computing (HPC) systems under batch scheduling. This creates a fundamental mismatch between the responsiveness required at the edge and the cost, throughput, and availability of simulation-driven model updates. We present RBF (Reverse Backfill), a hybrid edge-HPC learning and inference architecture that integrates low-latency edge inference with asynchronous, simulation-driven model improvement. RBF targets simulation-bounded settings in which model updates are constrained by simulation throughput and HPC scheduling delays, and reinterprets HPC backfilling by using opportunistic computation to improve model accuracy rather than system utilization. RBF decouples inference from simulation and training by deploying lightweight surrogate models at the edge while incorporating improved models asynchronously as they become available. The architecture supports pluggable surrogate models and orchestrates computation across heterogeneous infrastructure spanning edge devices, private 5G, cloud, and HPC resources. We instantiate RBF using a real-world digital agriculture deployment that couples edge sensing with computational fluid dynamics (CFD) simulations to infer airflow patterns in a large agricultural screenhouse. Our evaluation characterizes end-to-end system behavior under realistic constraints, quantifying simulation latency, training cost, inference throughput, and the impact of delayed model updates on prediction accuracy. Results demonstrate that RBF enables continuous, low-latency inference while improving model fidelity over time despite delayed and irregular model updates.

2026-05-19T22:09:29Z Liubov Kurafeeva Ryan Hartung Benjamin Carter Alan Subedi Avhishek Biswas Michael Fay Shantenu Jha Chandra Krintz Andre Merzky Douglas Thain Memet Can Vuran Rich Wolski http://arxiv.org/abs/2605.22778v1 AI-Driven Multi-Region Provisioning for Cloud Services Using Spot Fleets 2026-05-21T17:34:27Z

Cloud service platforms increasingly rely on elastic infrastructures to support dynamic workloads. Spot instances provide discounted computing resources but introduce uncertainty due to dynamic pricing, resource availability, and interruption risks that vary across geographical regions. In Amazon Web Services, the EC2 Spot Service simplifies fleet provisioning through allocation strategies, but it cannot estimate fleet costs before deployment and restricts provisioning to a single region. This paper presents an AI-driven provisioning service for multi-region spot fleets. The proposed approach combines monitoring of provisioning plans with predictive models to estimate fleet configurations and prices before launch, enabling cost-aware deployment decisions across regions while preserving the operational behavior of the EC2 Spot Service. The system was validated with fleets of up to 1500 vCPUs. Experimental results show a prediction accuracy of 99.79% compared to the EC2 Spot Service and potential cost savings of up to 64% by exploiting regional price variability.

2026-05-21T17:34:27Z Javier Fabra Enrique Molina-Giménez Pedro García-López http://arxiv.org/abs/2605.07985v2 Dooly: Configuration-Agnostic, Redundancy-Aware Profiling for LLM Inference Simulation 2026-05-21T13:49:15Z

Selecting the optimal LLM inference configuration requires evaluation across hardware, serving engines, attention backends, and model architectures, since no single choice performs best across all workloads. Profile-based simulators are the standard tool, yet they hardcode their operation set to a specific configuration and re-profile every operation from scratch, making exploration prohibitively expensive. This cost stems from a missing structural understanding: every input dimension of each operation is fixed by the model configuration or determined by the incoming request. Many model-configuration values (e.g., head size, layer count) recur across models, so the same operation runs in many configurations; a single sweep over the request-dependent dimensions can serve them all. We present Dooly, which exploits this structure to achieve configuration-agnostic, redundancy-aware profiling. Dooly performs a single inference pass, labels each input dimension with its origin via taint propagation, and selectively profiles only operations absent from its latency database; stateful operations such as attention are isolated by reusing the serving engine's own initialization code, eliminating manual instrumentation. It builds latency regression models based on the database, which becomes a drop-in backend for existing simulators. Across two GPU platforms, three attention backends, and diverse model architectures, Dooly achieves simulation accuracy within 5% MAPE for TTFT and 8% for TPOT while reducing profiling GPU-hours by 56.4% across 12 models compared to the existing profiling approach. We have open-sourced Dooly at https://github.com/dooly-project.

2026-05-08T16:44:47Z Joon Ha Kim Geon-Woo Kim Anoop Rachakonda Daehyeok Kim http://arxiv.org/abs/2605.22491v1 Relay-Based Synchronization of Replicated Data Types in Opportunistic Networks 2026-05-21T13:44:50Z

In Opportunistic Networks (OppNets), the dissemination of information can only rely on transient pairwise radio contacts between mobile devices (peers). Designing distributed applications that can run in such conditions is a challenge, but replicated data types, and in particular Conflict-free Replicated Data Types (CRDTs), can help meet this challenge. A CRDT is inherently replicated data type whose replicas can be updated locally, yet eventually converge thanks to an anti-entropy algorithm that allows all replicas to synchronize in the background. Whether the replicas of a CRDT can actually converge in an OppNet, and how fast they can converge, depend on the occurrence of radio contacts between mobile devices. In this paper we investigate the idea of using mobile relays as a means to boost the convergence of stated-based CRDT replicas in an OppNet. New protocols are presented that allow the synchronization of replicas and relays, and new metrics are defined to observe and characterize the convergence of replicas. Simulation results show that using relays can significantly improve this convergence, and even make it possible in scenarios where the replicas alone would be unable to converge.

2026-05-21T13:44:50Z 33 pages Frédéric Guidec Yves Mahéo http://arxiv.org/abs/2605.22428v1 Exploiting Multicast for Accelerating Collective Communication 2026-05-21T12:50:20Z

Reducing collective communication latency is a critical goal for large model training and inference in both academia and industry. Many-to-many communications, such as AllGather and AlltoAll (dispatch), are core components of modern parallelization strategies. State-of-the-art implementations of these communications rely on unicast-based writes and transmit duplicate copies of the same data across physical links for multiple receivers. This redundant transmission congests network bottlenecks and degrades end-to-end latency. We present MultiWrite, a novel many-to-many transmission semantic that eliminates redundant packets to directly reduce operator latency. MultiWrite adopts multicast principles while addressing critical limitations of traditional multicast for AI workloads. These limitations include heavy management plane overhead and ecosystem compatibility issues. We implement MultiWrite on Ascend NPUs. Long-term stress tests demonstrate that our MultiWrite-based operators achieve up to 33% latency reduction on commercially deployed devices.

2026-05-21T12:50:20Z Chao Xu Xu Zhang Zihang Luo Yuyan Wu Guoxin Qian Yufeng Yao Chihyung Wang Jingbin Zhou http://arxiv.org/abs/2605.22426v1 Monotone Erasure Codes 2026-05-21T12:48:05Z

Erasure codes are a critical component in reliable storage systems today, and many blockchain systems use consensus protocols that involve erasure codes to reduce their communication cost. Existing erasure codes rely on a threshold failure assumption, but recent blockchain systems have departed from this simple model and use generalized failure assumptions. This paper introduces monotone erasure codes that respect arbitrary trust assumptions on a set of nodes. The paper first describes a method for constructing a monotone erasure code from any access structure given by a monotone Boolean formula. Next, the relevant notion of a linear monotone erasure code is introduced, which works on vectors over a finite field and where the encoding is a linear operation. We then focus on constructing linear monotone erasure codes: We give an efficient algorithm to construct linear monotone erasure codes for any access structure, and we show how to efficiently construct linear monotone erasure codes for the special case of partitioned access structures with minimal storage overhead. Last but not least, this work also shows how to use monotone erasure codes to obtain a communication-efficient, generalized version of the well-known asynchronous verifiable information dispersal (AVID) primitive, which is a key building block for developing efficient reliable broadcast and consensus protocols.

2026-05-21T12:48:05Z Vivien Bammert Annalisa Cimatti Orestis Alpos Giuliano Losa Christian Cachin http://arxiv.org/abs/2605.22416v1 Asymmetric Virtual Memory Paging for Hybrid Mamba-Transformer Inference 2026-05-21T12:37:34Z

Hybrid language models like Jamba mix attention layers with State Space Models (SSMs), creating two memory cache types with opposite profiles: Key-Value (KV) caches grow linearly with sequence length, while SSM states stay fixed per layer. Current inference engines handle this poorly. Unified pools pad SSM states to attention page sizes, wasting up to 7.3x capacity. Static dual pools cannot adapt when prompt distributions shift between requests. We present Asymmetric Virtual Memory Paging (AVMP). The allocator separates the two cache types into physically distinct pools behind a unified virtual address space, and migrates capacity between pools when one runs out. Migration triggers only on allocation failure, keeping behavior deterministic. We evaluate AVMP across 270 synthetic cells plus 60 cells of ShareGPT trace replay on an RTX 3060 12GB. Out-of-Memory events drop 7.6% and request throughput improves 1.83x to 13.3x across synthetic workloads and 2.36x on ShareGPT. All gains hold under paired-bootstrap 95% confidence intervals. A phase-time breakdown reveals two distinct mechanisms: shorter OOM recovery on capacity-pressured workloads, and faster allocation calls on KV-heavy workloads. Implementation is pure Python; Triton integration is future work.

2026-05-21T12:37:34Z 11 pages, 8 figures, 6 tables. Code and reproducibility artifacts at https://github.com/codepawl/cachepawl An Xuan Nguyen