https://arxiv.org/api/nm9PwbT2msKAVq0bfxrHAttvcCU2026-04-07T19:23:17Z2791330015http://arxiv.org/abs/2512.16455v2AI4EOSC: a Federated Cloud Platform for Artificial Intelligence in Scientific Research2026-03-17T09:16:12ZIn this paper, we describe a federated compute platform dedicated to support Artificial Intelligence in scientific workloads. Putting the effort into reproducible deployments, it delivers consistent, transparent access to a federation of physically distributed e-Infrastructures. Through a comprehensive service catalogue, the platform is able to offer an integrated user experience covering the full Machine Learning lifecycle, including model development (with dedicated interactive development environments), training (with GPU resources, annotation tools, experiment tracking, and federated learning support) and deployment (covering a wide range of deployment options all along the Cloud Continuum). The platform also provides tools for traceability and reproducibility of AI models, integrates with different Artificial Intelligence model providers, datasets and storage resources, allowing users to interact with the broader Machine Learning ecosystem. Finally, it is easily customizable to lower the adoption barrier by external communities.2025-12-18T12:20:31ZIgnacio HerediaÁlvaro López GarcíaFernando Aguilar GómezDiego AguirreCaterina Alarcón MarínKhadijeh AlibabaeiLisana BerberiMiguel CaballerAmanda CalatravaPedro CastroAlessandro CostantiniMario DavidJaime Díez Stefan DlugolinskyBorja Esteban SanchisGiacinto DonvitoLeonhard DudaSaúl FernandezAndrés Heredia CanalesValentin KozlovSergio LangaritaJoão MachadoGermán MoltóDaniel San MartínMartin ŠelengGiang NguyenMarcin PłóciennikMarta Obregón RuizSusana Rebolledo RuizVicente RodriguezJudith Sáinz-Pardo DíazViet Tranhttp://arxiv.org/abs/2603.12831v2Serving Hybrid LLM Loads with SLO Guarantees Using CPU-GPU Attention Piggybacking2026-03-17T03:24:14ZNowadays, service providers often deploy multiple types of LLM services within shared clusters. While the service colocation improves resource utilization, it introduces significant interference risks for latency-sensitive (LS) services-which have strict SLO requirements for inference latency-and severely constrain the service capacity of best-effort (BE) services due to limited available memory. To address interference, existing systems typically rely on reserving headroom to constrain BE resource usage. However, this approach's coarse granularity compromises the SLO compliance of the latency-sensitive service and unnecessarily restricts the generation potential of the best effort service.
In this paper, we propose OmniServe, a novel LLM serving system that efficiently harnesses both CPU and GPU resources to mitigate interference and improve throughput. Central to OmniServe is the Attention Piggybacking mechanism, which effectively offloads the Attention computation of BE services to CPUs on the fly. This mechanism also facilitates asynchronous communication between CPU and GPU streams, preventing GPUs from being blocked while aggregating Attention results. Additionally, OmniServe incorporates a dynamic batching control policy to adapt to fluctuating request arrivals, facilitating Dense module computation using layer-wise batching. Experimental results show that OmniServe improves the SLO attainment rate for LS services by up to $1.48\times$ while enhancing BE serving throughput by up to $9.85\times$ compared to state-of-the-art systems.2026-03-13T09:32:56ZZizhao MoJunlin ChenHuanle XuChengzhong Xu10.1145/3802107http://arxiv.org/abs/2603.16054v1inference-fleet-sim: A Queueing-Theory-Grounded Fleet Capacity Planner for LLM Inference2026-03-17T01:44:04ZSizing a GPU fleet for LLM inference is harder than it looks. The obvious questions -- how many GPUs, which type, where to split a two-pool fleet -- have no closed-form answers. They depend on the full token-length distribution, the routing policy, and queueing dynamics that turn ugly under heavy-tailed workloads. Existing tools optimize per-engine configuration for a fixed GPU count; none of them address the upstream question of how many GPUs to buy and how to arrange them.
inference-fleet-sim fills that gap. It combines analytical M/G/c queueing with discrete-event simulation (DES) to find the minimum-cost fleet configuration that empirically meets a P99 TTFT SLO. It includes a physics-informed GPU performance model covering A10G, A100, and H100 across monolithic, two-pool-routed, and disaggregated topologies, all without requiring access to real hardware. We run the tool on seven fleet-planning scenarios drawn from two public workload traces (LMSYS, Azure) and one synthetic agent-heavy trace. Each one surfaces a result that simple analysis gets wrong -- the right split threshold, the cheapest GPU type, whether an apparently idle fleet is actually broken -- and shows why joint simulation of queueing, routing, and hardware is necessary to find it.2026-03-17T01:44:04ZWork in progressHuamin ChenXunzhuo LiuYuhan LiuJunchen JiangBowei HeXue Liuhttp://arxiv.org/abs/2603.15530v1DUET: Disaggregated Hybrid Mamba-Transformer LLMs with Prefill and Decode-Specific Packages2026-03-16T16:56:01ZLarge language models operate in distinct compute-bound prefill followed by memory bandwidth-bound decode phases. Hybrid Mamba-Transformer models inherit this asymmetry while adding state space model (SSM) recurrences and element-wise operations that map poorly to matmul-centric accelerators. This mismatch causes performance bottlenecks, showing that a homogeneous architecture cannot satisfy all requirements. We introduce DUET, a disaggregated accelerator that assigns prefill and decode phases to specialized packages. The Prefill package utilizes systolic array chiplets with off-package memory for efficient large matrix multiplications and long-sequence SSMs. The Decode package utilizes vector-unit arrays with high-bandwidth in-package memory to accelerate token-by-token SSM and vector-matrix multiplications. Both architectures are runtime-configurable to support hybrid models with mixed Mamba and attention layers. Evaluations on Nemotron-H-56B, Zamba2-7B, and Llama3-8B across four workloads show that DUET achieves 4x faster time to first token, 1.4x higher throughput, and 1.5x lower time between tokens over the B200 GPU.2026-03-16T16:56:01ZPaper accepted for publication at the Design Automation Conference (DAC) 2026 conferenceAlish KananiSangwan LeeHan LyuJiahao LinJaehyun ParkUmit Y. Ograshttp://arxiv.org/abs/2603.15486v1Cuckoo-GPU: Accelerating Cuckoo Filters on Modern GPUs2026-03-16T16:15:13ZApproximate Membership Query (AMQ) structures are essential for high-throughput systems in databases, networking, and bioinformatics. While Bloom filters offer speed, they lack support for deletions. Existing GPU-based dynamic alternatives, such as the Two-Choice Filter (TCF) and GPU Quotient Filter (GQF), enable deletions but incur severe performance penalties. We present Cuckoo-GPU, an open-source, high-performance Cuckoo filter library for GPUs. Instead of prioritizing cache locality, Cuckoo-GPU embraces the inherently random access pattern of Cuckoo hashing to fully saturate global memory bandwidth. Our design features a lock-free architecture built on atomic compare-and-swap operations, paired with a novel breadth-first search-based eviction heuristic that minimizes thread divergence and bounds sequential memory accesses during high-load insertions. Evaluated on NVIDIA GH200 (HBM3) and RTX PRO 6000 Blackwell (GDDR7) systems, Cuckoo-GPU closes the performance gap between append-only and dynamic AMQ structures. It achieves insertion, query, and deletion throughputs up to 378x (4.1x), 6x (34.7x), and 258x (107x) higher than GQF (TCF) on the same hardware, respectively, and delivers up to a 350x speedup over the fastest available multi-threaded CPU-based Cuckoo filter implementation. Moreover, its query throughput rivals that of the append-only GPU-based Blocked Bloom filter - demonstrating that dynamic AMQ structures can be deployed on modern accelerators without sacrificing performance.2026-03-16T16:15:13ZTim DortmannMarkus ViethBertil Schmidthttp://arxiv.org/abs/2603.15400v1Multi-Objective Load Balancing for Heterogeneous Edge-Based Object Detection Systems2026-03-16T15:15:56ZThe rapid proliferation of the Internet of Things (IoT) and smart applications has led to a surge in data generated by distributed sensing devices. Edge computing is a mainstream approach to managing this data by pushing computation closer to the data source, typically onto resource-constrained devices such as single-board computers (SBCs). In such environments, the unavoidable heterogeneity of hardware and software makes effective load balancing particularly challenging. In this paper, we propose a multi-objective load balancing method tailored to heterogeneous, edge-based object detection systems. We study a setting in which multiple device-model pairs expose distinct accuracy, latency, and energy profiles, while both request intensity and scene complexity fluctuate over time. To handle this dynamically varying environment, our approach uses a two-stage decision mechanism: it first performs accuracy-aware filtering to identify suitable device-model candidates that provide accuracy within the acceptable range, and then applies a weighted-sum scoring function over expected latency and energy consumption to select the final execution target. We evaluate the proposed load balancer through extensive experiments on real-world datasets, comparing against widely used baseline strategies. The results indicate that the proposed multi-objective load balancing method halves energy consumption and achieves an 80% reduction in end-to-end latency, while incurring only a modest, up to 10%, decrease in detection accuracy relative to an accuracy-centric baseline.2026-03-16T15:15:56ZDaghash K. AlqahtaniMaria A. RodriguezMuhammad Aamir CheemaAdel N. Toosihttp://arxiv.org/abs/2603.15183v1Token Coherence: Adapting MESI Cache Protocols to Minimize Synchronization Overhead in Multi-Agent LLM Systems2026-03-16T12:20:06ZMulti-agent LLM orchestration incurs synchronization costs scaling as O(n x S x |D|) in agents, steps, and artifact size under naive broadcast -- a regime I term broadcast-induced triply-multiplicative overhead. I argue this pathology is a structural residue of full-state rebroadcast, not an inherent property of multi-agent coordination.
The central claim: synchronization cost explosion in LLM multi-agent systems maps with formal precision onto the cache coherence problem in shared-memory multiprocessors, and MESI-protocol invalidation transfers to artifact synchronization under minimal structural modification.
I construct the Artifact Coherence System (ACS) and prove the Token Coherence Theorem: lazy invalidation attenuates cost by at least S/(n + W(d_i)) when S > n + W(d_i), converting O(n x S x |D|) to O((n + W) x |D|). A TLA+-verified protocol enforces single-writer safety, monotonic versioning, and bounded staleness across ~2,400 explored states.
Simulation across four workload configurations yields token savings of 95.0% +/- 1.3% at V=0.05, 92.3% +/- 1.4% at V=0.10, 88.3% +/- 1.5% at V=0.25, and 84.2% +/- 1.3% at V=0.50 -- each exceeding the theorem's conservative lower bounds. Savings of ~81% persist at V=0.9, contrary to the predicted collapse threshold.
Contributions: (1) formal MESI-to-artifact state mapping; (2) Token Coherence Theorem as savings lower bound; (3) TLA+-verified protocol with three proven invariants; (4) characterization of conditional artifact access semantics resolving the always-read objection; (5) reference Python implementation integrating with LangGraph, CrewAI, and AutoGen via thin adapter layers.2026-03-16T12:20:06Z25 pages. Code and reproduction scripts at https://github.com/hipvlady/agent-coherenceVladyslav Parakhinhttp://arxiv.org/abs/2511.22333v3PAT: Accelerating LLM Decoding via Prefix-Aware Attention with Resource Efficient Multi-Tile Kernel2026-03-16T10:24:27ZLLM serving is increasingly dominated by decode attention, which is a memory-bound operation due to massive KV cache loading from global memory. Meanwhile, real-world workloads exhibit substantial, hierarchical shared prefixes across requests (e.g., system prompts, tools/templates, RAG). Existing attention implementations fail to fully exploit prefix sharing: one-query-per-CTA execution repeatedly loads shared prefix KV cache, while one-size-fits-all tiling leaves on-chip resources idle and exacerbates bubbles for uneven KV lengths. These choices amplify memory bandwidth pressure and stall memory-bound decode attention.
This paper introduces PAT, a prefix-aware attention kernel implementation for LLM decoding that organizes execution with a pack-forward-merge paradigm. PAT packs queries by shared prefix to reduce repeated memory accesses, runs a customized multi-tile kernel to achieve high resource efficiency. It further applies practical multi-stream forwarding and KV splitting to reduce resource bubbles. The final merge performs online softmax with negligible overhead. We implement PAT as an off-the-shelf plugin for vLLM. Evaluation on both real-world and synthetic workloads shows that PAT reduces attention latency by 53.5% on average and TPOT by 17.0-93.1% under the same configurations against state-of-the-art attention kernels. PAT's source code is publicly available at https://github.com/flashserve/PAT.2025-11-27T11:10:30ZAccepted by ASPLOS'26, code available at https://github.com/flashserve/PATJinjun YiZhixin ZhaoYitao HuKe YanWeiwei SunHao WangLaiping ZhaoYuhao ZhangWenxin LiKeqiu Li10.1145/3779212.3790200http://arxiv.org/abs/2603.26691v1SCALE-TRACK: Asynchronous Euler-Lagrange particle tracking on heterogeneous computing architecture2026-03-16T08:44:56ZEuler-Lagrange (EL) simulations provide a direct and robust framework for modeling disperse multiphase flows. However, they are computationally expensive. While various approaches have attempted to leverage heterogeneous computing architectures, they have encountered scalability limitations. We present SCALE-TRACK, a scalable two-way coupled EL particle tracking algorithm, designed to exploit heterogeneous exascale computing environments. With asynchronous coupling, cache-friendly data structures, and chunk-based partitioning, we address key limitations of existing EL implementations. Validations against an analytical solution and a conventional EL implementation demonstrate the accuracy of the proposed algorithms. On a local workstation, we simulated 1.4 billion particles in a test case featuring a single graphics processing unit (GPU). Scaling runs on an HPC (high-performance computing) cluster show excellent strong and weak scaling, with up to 256 billion particles being tracked on up to 256 GPUs. This represents a significant advancement for EL simulations, enabling high-fidelity simulations on local workstations and pushing the limits on HPC systems. The software is released as open source and is publicly available.2026-03-16T08:44:56Z19 pages, 8 figures. Submitted to Journal of Computational PhysicsSilvio SchmalfußSergey LesnikHenrik RuscheDennis Niedermeierhttp://arxiv.org/abs/2603.14826v1Protecting Distributed Blockchain with Twin-Field Quantum Key Distribution: A Quantum Resistant Approach2026-03-16T05:07:31ZQuantum computing provides the feasible multi-layered security challenges to classical blockchain systems. Whereas, quantum-secured blockchains relied on quantum key distribution (QKD) to establish secure channels can address this potential threat. This paper presents a scalable quantum-resistant blockchain architecture designed to address the connectivity and distance limitations of the QKD integrated quantum networks. By leveraging the twin-field (TF) QKD protocol within a measurement-device-independent (MDI) topology, the proposed framework can optimize the infrastructure complexity from quadratic to linear scaling. This architecture effectively integrates information-theoretic security with distributed consensus mechanisms, allowing the system to overcome the fundamental rate-loss limits inherent in traditional point-to-point links. The proposed scheme offers a theoretically sound and feasible solution for deploying large-scale and long-distance consortium.2026-03-16T05:07:31ZXuan LiYing Guohttp://arxiv.org/abs/2603.14806v1Fold-CP: A Context Parallelism Framework for Biomolecular Modeling2026-03-16T04:20:01ZUnderstanding cellular machinery requires atomic-scale reconstruction of large biomolecular assemblies. However, predicting the structures of these systems has been constrained by hardware memory requirements of models like AlphaFold 3, imposing a practical ceiling of a few thousand residues that can be processed on a single GPU. Here we present NVIDIA BioNeMo Fold-CP, a context parallelism framework that overcomes this barrier by distributing the inference and training pipelines of co-folding models across multiple GPUs. We use the Boltz models as open source reference architectures and implement custom multidimensional primitives that efficiently parallelize both the dense triangular updates and the irregular, data-dependent pattern of window-batched local attention. Our approach achieves efficient memory scaling; for an N-token input distributed across P GPUs, per-device memory scales as $O(N^2/P)$, enabling the structure prediction of assemblies exceeding 30,000 residues on 64 NVIDIA B300 GPUs. We demonstrate the scientific utility of this approach through successful developer use cases: Fold-CP enabled the scoring of over 90% of Comprehensive Resource of Mammalian protein complexes (CORUM) database, as well as folding of disease-relevant PI4KA lipid kinase complex bound to an intrinsically disordered region without cropping. By providing a scalable pathway for modeling massive systems with full global context, Fold-CP represents a significant step toward the realization of a virtual cell.2026-03-16T04:20:01Z23 pages, 10 figuresDejun LinSimon ChuVishanth IyerYouhan LeeJohn St JohnKevin BoydBrian RolandXiaowei RenGuoqing ZhouZhonglin CaoPolina BinderYuliya ZhautouskayaJakub ZakrzewskiMaximilian StadlerKyle GionYuxing PengXi ChenTianjing ZhangPhilipp JunkMichelle DimonPaweł GniewekFabian OrtegaMcKinley PolenIvan GrubisicAli BashirGraham HoltDanny KovtunMatthias GrassLuca NaefRui WangJian PengAnthony CostaSaee PaliwalEddie CallejaTimur RvachovNeha TadimetiRoy TalEmine Kucukbenlihttp://arxiv.org/abs/2303.06324v2Comprehensive Deadlock Prevention for GPU Collective Communication2026-03-16T03:24:54ZDistributed deep neural network training necessitates efficient GPU collective communications, which are inherently susceptible to deadlocks. GPU collective deadlocks arise easily in distributed deep learning applications when multiple collectives circularly wait for each other. GPU collective deadlocks pose a significant challenge to the correct functioning and efficiency of distributed deep learning, and no general effective solutions are currently available. Only in specific scenarios, ad-hoc methods, making an application invoke collectives in a consistent order across GPUs, can be used to prevent circular collective dependency and deadlocks.
This paper presents DFCCL, a novel GPU collective communication library that provides a comprehensive approach for GPU collective deadlock prevention while maintaining high performance. DFCCL achieves preemption for GPU collectives at the bottom library level, effectively preventing deadlocks even if applications cause circular collective dependency. DFCCL ensures high performance with its execution and scheduling methods for collectives. Experiments show that DFCCL effectively prevents GPU collective deadlocks in various situations. Moreover, extensive evaluations demonstrate that DFCCL delivers performance comparable to or superior to NCCL, the state-of-the-art collective communication library highly optimized for NVIDIA GPUs.2023-03-11T06:45:47ZLichen PanJuncheng LiuYongquan FuJinhui YuanRongkai ZhangPengze LiZhen Xiao10.1145/3689031.3717466http://arxiv.org/abs/2602.21566v2Epoch-based Optimistic Concurrency Control in Geo-replicated Databases2026-03-16T02:58:29ZGeo-distribution is essential for modern online applications to ensure service reliability and high availability. However, supporting high-performance serializable transactions in geo-replicated databases remains a significant challenge. This difficulty stems from the extensive over-coordination inherent in distributed atomic commitment, concurrency control, and fault-tolerance replication protocols under high network latency.
To address these challenges, we introduce Minerva, a unified distributed concurrency control designed for highly scalable multi-leader replication. Minerva employs a novel epoch-based asynchronous replication protocol that decouples data propagation from the commitment process, enabling continuous transaction replication. Optimistic concurrency control is used to allow any replicas to execute transactions concurrently and commit without coordination. In stead of aborting transactions when conflicts are detected, Minerva uses deterministic re-execution to resolve conflicts, ensuring serializability without sacrificing performance. To further enhance concurrency, we construct a conflict graph and use a maximum weight independent set algorithm to select the optimal subset of transactions for commitment, minimizing the number of re-executed transactions. Our evaluation demonstrates that Minerva significantly outperforms state-of-the-art replicated databases, achieving over $3\times$ higher throughput in scalability experiments and $2.8\times$ higher throughput during a high network latency simulation with the TPC-C benchmark.2026-02-25T04:44:50ZYunhao MaoHarunari TakataMichail BachrasYuqiu ZhangShiquan ZhangGengrui ZhangHans-Arno Jacobsen10.1145/3802052http://arxiv.org/abs/2603.14729v1DeFRiS: Silo-Cooperative IoT Applications Scheduling via Decentralized Federated Reinforcement Learning2026-03-16T02:02:38ZNext-generation IoT applications increasingly span across autonomous administrative entities, necessitating silo-cooperative scheduling to leverage diverse computational resources while preserving data privacy. However, realizing efficient cooperation faces significant challenges arising from infrastructure heterogeneity, Non-IID workload shifts, and the inherent risks of adversarial environments. Existing approaches, relying predominantly on centralized coordination or independent learning, fail to address the incompatibility of state-action spaces across heterogeneous silos and lack robustness against malicious attacks. This paper proposes DeFRiS, a Decentralized Federated Reinforcement Learning framework for robust and scalable Silo-cooperative IoT application scheduling. DeFRiS integrates three synergistic innovations: (i) an action-space-agnostic policy utilizing candidate resource scoring to enable seamless knowledge transfer across heterogeneous silos; (ii) a silo-optimized local learning mechanism combining Generalized Advantage Estimation (GAE) with clipped policy updates to resolve sparse delayed reward challenges; and (iii) a Dual-Track Non-IID robust decentralized aggregation protocol leveraging gradient fingerprints for similarity-aware knowledge transfer and anomaly detection, and gradient tracking for optimization momentum. Extensive experiments on a distributed testbed with 20 heterogeneous silos and realistic IoT workloads demonstrate that DeFRiS significantly outperforms state-of-the-art baselines, reducing average response time by 6.4% and energy consumption by 7.2%, while lowering tail latency risk (CVaR$_{0.95}$) by 10.4% and achieving near-zero deadline violations. Furthermore, DeFRiS achieves over 3 times better performance retention as the system scales and over 8 times better stability in adversarial environments compared to the best-performing baseline.2026-03-16T02:02:38ZZhiyu WangMohammad GoudarziMingming GongRajkumar Buyyahttp://arxiv.org/abs/2603.14690v1Can you keep a secret? A new protocol for sender-side enforcement of causal message delivery2026-03-16T00:48:21ZProtocols for causal message delivery are widely used in distributed systems. Traditionally, causal delivery can be enforced either on the message sender's side or on the receiver's side. The traditional sender-side approach avoids the message metadata overhead of the receiver-side approach, but is more conservative than necessary. We present Cykas ("Can you keep a secret?"), a new protocol for sender-side enforcement of causal delivery that sidesteps the conservativeness of the traditional sender-side approach by allowing eager sending of messages and constraining the behavior of their recipients. We implemented the Cykas protocol in Rust and checked the safety and liveness of our implementation using the Stateright implementation-level model checker. Our experiments show that for applications involving long-running jobs, Cykas has a performance advantage: Cykas lets long-running jobs start (and end) earlier, leading to shorter overall execution time compared to the traditional sender-side approach.2026-03-16T00:48:21ZTo be presented at PaPoC 2026Yan TongNathan LiittschwagerLindsey Kuper