https://arxiv.org/api/nm9PwbT2msKAVq0bfxrHAttvcCU 2026-04-07T19:23:17Z 27913 300 15 http://arxiv.org/abs/2512.16455v2 AI4EOSC: a Federated Cloud Platform for Artificial Intelligence in Scientific Research 2026-03-17T09:16:12Z In this paper, we describe a federated compute platform dedicated to support Artificial Intelligence in scientific workloads. Putting the effort into reproducible deployments, it delivers consistent, transparent access to a federation of physically distributed e-Infrastructures. Through a comprehensive service catalogue, the platform is able to offer an integrated user experience covering the full Machine Learning lifecycle, including model development (with dedicated interactive development environments), training (with GPU resources, annotation tools, experiment tracking, and federated learning support) and deployment (covering a wide range of deployment options all along the Cloud Continuum). The platform also provides tools for traceability and reproducibility of AI models, integrates with different Artificial Intelligence model providers, datasets and storage resources, allowing users to interact with the broader Machine Learning ecosystem. Finally, it is easily customizable to lower the adoption barrier by external communities. 2025-12-18T12:20:31Z Ignacio Heredia Álvaro López García Fernando Aguilar Gómez Diego Aguirre Caterina Alarcón Marín Khadijeh Alibabaei Lisana Berberi Miguel Caballer Amanda Calatrava Pedro Castro Alessandro Costantini Mario David Jaime Díez Stefan Dlugolinsky Borja Esteban Sanchis Giacinto Donvito Leonhard Duda Saúl Fernandez Andrés Heredia Canales Valentin Kozlov Sergio Langarita João Machado Germán Moltó Daniel San Martín Martin Šeleng Giang Nguyen Marcin Płóciennik Marta Obregón Ruiz Susana Rebolledo Ruiz Vicente Rodriguez Judith Sáinz-Pardo Díaz Viet Tran http://arxiv.org/abs/2603.12831v2 Serving Hybrid LLM Loads with SLO Guarantees Using CPU-GPU Attention Piggybacking 2026-03-17T03:24:14Z Nowadays, service providers often deploy multiple types of LLM services within shared clusters. While the service colocation improves resource utilization, it introduces significant interference risks for latency-sensitive (LS) services-which have strict SLO requirements for inference latency-and severely constrain the service capacity of best-effort (BE) services due to limited available memory. To address interference, existing systems typically rely on reserving headroom to constrain BE resource usage. However, this approach's coarse granularity compromises the SLO compliance of the latency-sensitive service and unnecessarily restricts the generation potential of the best effort service. In this paper, we propose OmniServe, a novel LLM serving system that efficiently harnesses both CPU and GPU resources to mitigate interference and improve throughput. Central to OmniServe is the Attention Piggybacking mechanism, which effectively offloads the Attention computation of BE services to CPUs on the fly. This mechanism also facilitates asynchronous communication between CPU and GPU streams, preventing GPUs from being blocked while aggregating Attention results. Additionally, OmniServe incorporates a dynamic batching control policy to adapt to fluctuating request arrivals, facilitating Dense module computation using layer-wise batching. Experimental results show that OmniServe improves the SLO attainment rate for LS services by up to $1.48\times$ while enhancing BE serving throughput by up to $9.85\times$ compared to state-of-the-art systems. 2026-03-13T09:32:56Z Zizhao Mo Junlin Chen Huanle Xu Chengzhong Xu 10.1145/3802107 http://arxiv.org/abs/2603.16054v1 inference-fleet-sim: A Queueing-Theory-Grounded Fleet Capacity Planner for LLM Inference 2026-03-17T01:44:04Z Sizing a GPU fleet for LLM inference is harder than it looks. The obvious questions -- how many GPUs, which type, where to split a two-pool fleet -- have no closed-form answers. They depend on the full token-length distribution, the routing policy, and queueing dynamics that turn ugly under heavy-tailed workloads. Existing tools optimize per-engine configuration for a fixed GPU count; none of them address the upstream question of how many GPUs to buy and how to arrange them. inference-fleet-sim fills that gap. It combines analytical M/G/c queueing with discrete-event simulation (DES) to find the minimum-cost fleet configuration that empirically meets a P99 TTFT SLO. It includes a physics-informed GPU performance model covering A10G, A100, and H100 across monolithic, two-pool-routed, and disaggregated topologies, all without requiring access to real hardware. We run the tool on seven fleet-planning scenarios drawn from two public workload traces (LMSYS, Azure) and one synthetic agent-heavy trace. Each one surfaces a result that simple analysis gets wrong -- the right split threshold, the cheapest GPU type, whether an apparently idle fleet is actually broken -- and shows why joint simulation of queueing, routing, and hardware is necessary to find it. 2026-03-17T01:44:04Z Work in progress Huamin Chen Xunzhuo Liu Yuhan Liu Junchen Jiang Bowei He Xue Liu http://arxiv.org/abs/2603.15530v1 DUET: Disaggregated Hybrid Mamba-Transformer LLMs with Prefill and Decode-Specific Packages 2026-03-16T16:56:01Z Large language models operate in distinct compute-bound prefill followed by memory bandwidth-bound decode phases. Hybrid Mamba-Transformer models inherit this asymmetry while adding state space model (SSM) recurrences and element-wise operations that map poorly to matmul-centric accelerators. This mismatch causes performance bottlenecks, showing that a homogeneous architecture cannot satisfy all requirements. We introduce DUET, a disaggregated accelerator that assigns prefill and decode phases to specialized packages. The Prefill package utilizes systolic array chiplets with off-package memory for efficient large matrix multiplications and long-sequence SSMs. The Decode package utilizes vector-unit arrays with high-bandwidth in-package memory to accelerate token-by-token SSM and vector-matrix multiplications. Both architectures are runtime-configurable to support hybrid models with mixed Mamba and attention layers. Evaluations on Nemotron-H-56B, Zamba2-7B, and Llama3-8B across four workloads show that DUET achieves 4x faster time to first token, 1.4x higher throughput, and 1.5x lower time between tokens over the B200 GPU. 2026-03-16T16:56:01Z Paper accepted for publication at the Design Automation Conference (DAC) 2026 conference Alish Kanani Sangwan Lee Han Lyu Jiahao Lin Jaehyun Park Umit Y. Ogras http://arxiv.org/abs/2603.15486v1 Cuckoo-GPU: Accelerating Cuckoo Filters on Modern GPUs 2026-03-16T16:15:13Z Approximate Membership Query (AMQ) structures are essential for high-throughput systems in databases, networking, and bioinformatics. While Bloom filters offer speed, they lack support for deletions. Existing GPU-based dynamic alternatives, such as the Two-Choice Filter (TCF) and GPU Quotient Filter (GQF), enable deletions but incur severe performance penalties. We present Cuckoo-GPU, an open-source, high-performance Cuckoo filter library for GPUs. Instead of prioritizing cache locality, Cuckoo-GPU embraces the inherently random access pattern of Cuckoo hashing to fully saturate global memory bandwidth. Our design features a lock-free architecture built on atomic compare-and-swap operations, paired with a novel breadth-first search-based eviction heuristic that minimizes thread divergence and bounds sequential memory accesses during high-load insertions. Evaluated on NVIDIA GH200 (HBM3) and RTX PRO 6000 Blackwell (GDDR7) systems, Cuckoo-GPU closes the performance gap between append-only and dynamic AMQ structures. It achieves insertion, query, and deletion throughputs up to 378x (4.1x), 6x (34.7x), and 258x (107x) higher than GQF (TCF) on the same hardware, respectively, and delivers up to a 350x speedup over the fastest available multi-threaded CPU-based Cuckoo filter implementation. Moreover, its query throughput rivals that of the append-only GPU-based Blocked Bloom filter - demonstrating that dynamic AMQ structures can be deployed on modern accelerators without sacrificing performance. 2026-03-16T16:15:13Z Tim Dortmann Markus Vieth Bertil Schmidt http://arxiv.org/abs/2603.15400v1 Multi-Objective Load Balancing for Heterogeneous Edge-Based Object Detection Systems 2026-03-16T15:15:56Z The rapid proliferation of the Internet of Things (IoT) and smart applications has led to a surge in data generated by distributed sensing devices. Edge computing is a mainstream approach to managing this data by pushing computation closer to the data source, typically onto resource-constrained devices such as single-board computers (SBCs). In such environments, the unavoidable heterogeneity of hardware and software makes effective load balancing particularly challenging. In this paper, we propose a multi-objective load balancing method tailored to heterogeneous, edge-based object detection systems. We study a setting in which multiple device-model pairs expose distinct accuracy, latency, and energy profiles, while both request intensity and scene complexity fluctuate over time. To handle this dynamically varying environment, our approach uses a two-stage decision mechanism: it first performs accuracy-aware filtering to identify suitable device-model candidates that provide accuracy within the acceptable range, and then applies a weighted-sum scoring function over expected latency and energy consumption to select the final execution target. We evaluate the proposed load balancer through extensive experiments on real-world datasets, comparing against widely used baseline strategies. The results indicate that the proposed multi-objective load balancing method halves energy consumption and achieves an 80% reduction in end-to-end latency, while incurring only a modest, up to 10%, decrease in detection accuracy relative to an accuracy-centric baseline. 2026-03-16T15:15:56Z Daghash K. Alqahtani Maria A. Rodriguez Muhammad Aamir Cheema Adel N. Toosi http://arxiv.org/abs/2603.15183v1 Token Coherence: Adapting MESI Cache Protocols to Minimize Synchronization Overhead in Multi-Agent LLM Systems 2026-03-16T12:20:06Z Multi-agent LLM orchestration incurs synchronization costs scaling as O(n x S x |D|) in agents, steps, and artifact size under naive broadcast -- a regime I term broadcast-induced triply-multiplicative overhead. I argue this pathology is a structural residue of full-state rebroadcast, not an inherent property of multi-agent coordination. The central claim: synchronization cost explosion in LLM multi-agent systems maps with formal precision onto the cache coherence problem in shared-memory multiprocessors, and MESI-protocol invalidation transfers to artifact synchronization under minimal structural modification. I construct the Artifact Coherence System (ACS) and prove the Token Coherence Theorem: lazy invalidation attenuates cost by at least S/(n + W(d_i)) when S > n + W(d_i), converting O(n x S x |D|) to O((n + W) x |D|). A TLA+-verified protocol enforces single-writer safety, monotonic versioning, and bounded staleness across ~2,400 explored states. Simulation across four workload configurations yields token savings of 95.0% +/- 1.3% at V=0.05, 92.3% +/- 1.4% at V=0.10, 88.3% +/- 1.5% at V=0.25, and 84.2% +/- 1.3% at V=0.50 -- each exceeding the theorem's conservative lower bounds. Savings of ~81% persist at V=0.9, contrary to the predicted collapse threshold. Contributions: (1) formal MESI-to-artifact state mapping; (2) Token Coherence Theorem as savings lower bound; (3) TLA+-verified protocol with three proven invariants; (4) characterization of conditional artifact access semantics resolving the always-read objection; (5) reference Python implementation integrating with LangGraph, CrewAI, and AutoGen via thin adapter layers. 2026-03-16T12:20:06Z 25 pages. Code and reproduction scripts at https://github.com/hipvlady/agent-coherence Vladyslav Parakhin http://arxiv.org/abs/2511.22333v3 PAT: Accelerating LLM Decoding via Prefix-Aware Attention with Resource Efficient Multi-Tile Kernel 2026-03-16T10:24:27Z LLM serving is increasingly dominated by decode attention, which is a memory-bound operation due to massive KV cache loading from global memory. Meanwhile, real-world workloads exhibit substantial, hierarchical shared prefixes across requests (e.g., system prompts, tools/templates, RAG). Existing attention implementations fail to fully exploit prefix sharing: one-query-per-CTA execution repeatedly loads shared prefix KV cache, while one-size-fits-all tiling leaves on-chip resources idle and exacerbates bubbles for uneven KV lengths. These choices amplify memory bandwidth pressure and stall memory-bound decode attention. This paper introduces PAT, a prefix-aware attention kernel implementation for LLM decoding that organizes execution with a pack-forward-merge paradigm. PAT packs queries by shared prefix to reduce repeated memory accesses, runs a customized multi-tile kernel to achieve high resource efficiency. It further applies practical multi-stream forwarding and KV splitting to reduce resource bubbles. The final merge performs online softmax with negligible overhead. We implement PAT as an off-the-shelf plugin for vLLM. Evaluation on both real-world and synthetic workloads shows that PAT reduces attention latency by 53.5% on average and TPOT by 17.0-93.1% under the same configurations against state-of-the-art attention kernels. PAT's source code is publicly available at https://github.com/flashserve/PAT. 2025-11-27T11:10:30Z Accepted by ASPLOS'26, code available at https://github.com/flashserve/PAT Jinjun Yi Zhixin Zhao Yitao Hu Ke Yan Weiwei Sun Hao Wang Laiping Zhao Yuhao Zhang Wenxin Li Keqiu Li 10.1145/3779212.3790200 http://arxiv.org/abs/2603.26691v1 SCALE-TRACK: Asynchronous Euler-Lagrange particle tracking on heterogeneous computing architecture 2026-03-16T08:44:56Z Euler-Lagrange (EL) simulations provide a direct and robust framework for modeling disperse multiphase flows. However, they are computationally expensive. While various approaches have attempted to leverage heterogeneous computing architectures, they have encountered scalability limitations. We present SCALE-TRACK, a scalable two-way coupled EL particle tracking algorithm, designed to exploit heterogeneous exascale computing environments. With asynchronous coupling, cache-friendly data structures, and chunk-based partitioning, we address key limitations of existing EL implementations. Validations against an analytical solution and a conventional EL implementation demonstrate the accuracy of the proposed algorithms. On a local workstation, we simulated 1.4 billion particles in a test case featuring a single graphics processing unit (GPU). Scaling runs on an HPC (high-performance computing) cluster show excellent strong and weak scaling, with up to 256 billion particles being tracked on up to 256 GPUs. This represents a significant advancement for EL simulations, enabling high-fidelity simulations on local workstations and pushing the limits on HPC systems. The software is released as open source and is publicly available. 2026-03-16T08:44:56Z 19 pages, 8 figures. Submitted to Journal of Computational Physics Silvio Schmalfuß Sergey Lesnik Henrik Rusche Dennis Niedermeier http://arxiv.org/abs/2603.14826v1 Protecting Distributed Blockchain with Twin-Field Quantum Key Distribution: A Quantum Resistant Approach 2026-03-16T05:07:31Z Quantum computing provides the feasible multi-layered security challenges to classical blockchain systems. Whereas, quantum-secured blockchains relied on quantum key distribution (QKD) to establish secure channels can address this potential threat. This paper presents a scalable quantum-resistant blockchain architecture designed to address the connectivity and distance limitations of the QKD integrated quantum networks. By leveraging the twin-field (TF) QKD protocol within a measurement-device-independent (MDI) topology, the proposed framework can optimize the infrastructure complexity from quadratic to linear scaling. This architecture effectively integrates information-theoretic security with distributed consensus mechanisms, allowing the system to overcome the fundamental rate-loss limits inherent in traditional point-to-point links. The proposed scheme offers a theoretically sound and feasible solution for deploying large-scale and long-distance consortium. 2026-03-16T05:07:31Z Xuan Li Ying Guo http://arxiv.org/abs/2603.14806v1 Fold-CP: A Context Parallelism Framework for Biomolecular Modeling 2026-03-16T04:20:01Z Understanding cellular machinery requires atomic-scale reconstruction of large biomolecular assemblies. However, predicting the structures of these systems has been constrained by hardware memory requirements of models like AlphaFold 3, imposing a practical ceiling of a few thousand residues that can be processed on a single GPU. Here we present NVIDIA BioNeMo Fold-CP, a context parallelism framework that overcomes this barrier by distributing the inference and training pipelines of co-folding models across multiple GPUs. We use the Boltz models as open source reference architectures and implement custom multidimensional primitives that efficiently parallelize both the dense triangular updates and the irregular, data-dependent pattern of window-batched local attention. Our approach achieves efficient memory scaling; for an N-token input distributed across P GPUs, per-device memory scales as $O(N^2/P)$, enabling the structure prediction of assemblies exceeding 30,000 residues on 64 NVIDIA B300 GPUs. We demonstrate the scientific utility of this approach through successful developer use cases: Fold-CP enabled the scoring of over 90% of Comprehensive Resource of Mammalian protein complexes (CORUM) database, as well as folding of disease-relevant PI4KA lipid kinase complex bound to an intrinsically disordered region without cropping. By providing a scalable pathway for modeling massive systems with full global context, Fold-CP represents a significant step toward the realization of a virtual cell. 2026-03-16T04:20:01Z 23 pages, 10 figures Dejun Lin Simon Chu Vishanth Iyer Youhan Lee John St John Kevin Boyd Brian Roland Xiaowei Ren Guoqing Zhou Zhonglin Cao Polina Binder Yuliya Zhautouskaya Jakub Zakrzewski Maximilian Stadler Kyle Gion Yuxing Peng Xi Chen Tianjing Zhang Philipp Junk Michelle Dimon Paweł Gniewek Fabian Ortega McKinley Polen Ivan Grubisic Ali Bashir Graham Holt Danny Kovtun Matthias Grass Luca Naef Rui Wang Jian Peng Anthony Costa Saee Paliwal Eddie Calleja Timur Rvachov Neha Tadimeti Roy Tal Emine Kucukbenli http://arxiv.org/abs/2303.06324v2 Comprehensive Deadlock Prevention for GPU Collective Communication 2026-03-16T03:24:54Z Distributed deep neural network training necessitates efficient GPU collective communications, which are inherently susceptible to deadlocks. GPU collective deadlocks arise easily in distributed deep learning applications when multiple collectives circularly wait for each other. GPU collective deadlocks pose a significant challenge to the correct functioning and efficiency of distributed deep learning, and no general effective solutions are currently available. Only in specific scenarios, ad-hoc methods, making an application invoke collectives in a consistent order across GPUs, can be used to prevent circular collective dependency and deadlocks. This paper presents DFCCL, a novel GPU collective communication library that provides a comprehensive approach for GPU collective deadlock prevention while maintaining high performance. DFCCL achieves preemption for GPU collectives at the bottom library level, effectively preventing deadlocks even if applications cause circular collective dependency. DFCCL ensures high performance with its execution and scheduling methods for collectives. Experiments show that DFCCL effectively prevents GPU collective deadlocks in various situations. Moreover, extensive evaluations demonstrate that DFCCL delivers performance comparable to or superior to NCCL, the state-of-the-art collective communication library highly optimized for NVIDIA GPUs. 2023-03-11T06:45:47Z Lichen Pan Juncheng Liu Yongquan Fu Jinhui Yuan Rongkai Zhang Pengze Li Zhen Xiao 10.1145/3689031.3717466 http://arxiv.org/abs/2602.21566v2 Epoch-based Optimistic Concurrency Control in Geo-replicated Databases 2026-03-16T02:58:29Z Geo-distribution is essential for modern online applications to ensure service reliability and high availability. However, supporting high-performance serializable transactions in geo-replicated databases remains a significant challenge. This difficulty stems from the extensive over-coordination inherent in distributed atomic commitment, concurrency control, and fault-tolerance replication protocols under high network latency. To address these challenges, we introduce Minerva, a unified distributed concurrency control designed for highly scalable multi-leader replication. Minerva employs a novel epoch-based asynchronous replication protocol that decouples data propagation from the commitment process, enabling continuous transaction replication. Optimistic concurrency control is used to allow any replicas to execute transactions concurrently and commit without coordination. In stead of aborting transactions when conflicts are detected, Minerva uses deterministic re-execution to resolve conflicts, ensuring serializability without sacrificing performance. To further enhance concurrency, we construct a conflict graph and use a maximum weight independent set algorithm to select the optimal subset of transactions for commitment, minimizing the number of re-executed transactions. Our evaluation demonstrates that Minerva significantly outperforms state-of-the-art replicated databases, achieving over $3\times$ higher throughput in scalability experiments and $2.8\times$ higher throughput during a high network latency simulation with the TPC-C benchmark. 2026-02-25T04:44:50Z Yunhao Mao Harunari Takata Michail Bachras Yuqiu Zhang Shiquan Zhang Gengrui Zhang Hans-Arno Jacobsen 10.1145/3802052 http://arxiv.org/abs/2603.14729v1 DeFRiS: Silo-Cooperative IoT Applications Scheduling via Decentralized Federated Reinforcement Learning 2026-03-16T02:02:38Z Next-generation IoT applications increasingly span across autonomous administrative entities, necessitating silo-cooperative scheduling to leverage diverse computational resources while preserving data privacy. However, realizing efficient cooperation faces significant challenges arising from infrastructure heterogeneity, Non-IID workload shifts, and the inherent risks of adversarial environments. Existing approaches, relying predominantly on centralized coordination or independent learning, fail to address the incompatibility of state-action spaces across heterogeneous silos and lack robustness against malicious attacks. This paper proposes DeFRiS, a Decentralized Federated Reinforcement Learning framework for robust and scalable Silo-cooperative IoT application scheduling. DeFRiS integrates three synergistic innovations: (i) an action-space-agnostic policy utilizing candidate resource scoring to enable seamless knowledge transfer across heterogeneous silos; (ii) a silo-optimized local learning mechanism combining Generalized Advantage Estimation (GAE) with clipped policy updates to resolve sparse delayed reward challenges; and (iii) a Dual-Track Non-IID robust decentralized aggregation protocol leveraging gradient fingerprints for similarity-aware knowledge transfer and anomaly detection, and gradient tracking for optimization momentum. Extensive experiments on a distributed testbed with 20 heterogeneous silos and realistic IoT workloads demonstrate that DeFRiS significantly outperforms state-of-the-art baselines, reducing average response time by 6.4% and energy consumption by 7.2%, while lowering tail latency risk (CVaR$_{0.95}$) by 10.4% and achieving near-zero deadline violations. Furthermore, DeFRiS achieves over 3 times better performance retention as the system scales and over 8 times better stability in adversarial environments compared to the best-performing baseline. 2026-03-16T02:02:38Z Zhiyu Wang Mohammad Goudarzi Mingming Gong Rajkumar Buyya http://arxiv.org/abs/2603.14690v1 Can you keep a secret? A new protocol for sender-side enforcement of causal message delivery 2026-03-16T00:48:21Z Protocols for causal message delivery are widely used in distributed systems. Traditionally, causal delivery can be enforced either on the message sender's side or on the receiver's side. The traditional sender-side approach avoids the message metadata overhead of the receiver-side approach, but is more conservative than necessary. We present Cykas ("Can you keep a secret?"), a new protocol for sender-side enforcement of causal delivery that sidesteps the conservativeness of the traditional sender-side approach by allowing eager sending of messages and constraining the behavior of their recipients. We implemented the Cykas protocol in Rust and checked the safety and liveness of our implementation using the Stateright implementation-level model checker. Our experiments show that for applications involving long-running jobs, Cykas has a performance advantage: Cykas lets long-running jobs start (and end) earlier, leading to shorter overall execution time compared to the traditional sender-side approach. 2026-03-16T00:48:21Z To be presented at PaPoC 2026 Yan Tong Nathan Liittschwager Lindsey Kuper