https://arxiv.org/api/zURsaTgTIE7qLX2n4h7Z1piw1oo 2026-03-22T13:00:42Z 27724 120 15 http://arxiv.org/abs/2401.16685v2 Communication-Efficient Multimodal Federated Learning: Joint Modality and Client Selection 2026-03-11T03:47:33Z Multimodal federated learning (MFL) aims to enrich model training in FL settings where clients are collecting measurements across multiple modalities. However, key challenges to MFL remain unaddressed, particularly in heterogeneous network settings where: (i) the set of modalities collected by each client is diverse, and (ii) communication limitations prevent clients from uploading all their locally trained modality encoders to the server. In this paper, we propose Multimodal Federated learning with joint Modality and Client selection (MFedMC), a communication-efficient MFL framework that tackles these challenges through a decoupled architecture and selective uploading. Unlike traditional holistic fusion approaches, MFedMC separates modality encoders and fusion modules: modality encoders are aggregated at the server for generalization across diverse client distributions, while fusion modules remain local to each client for personalized adaptation to individual modality configurations and data characteristics. Building on this decoupled design, our joint selection algorithm incorporates two main components: (a) A modality selection methodology for each client, which weighs (i) the impact of the modality, gauged by Shapley value analysis, (ii) the modality encoder size as a gauge of communication overhead, and (iii) the frequency of modality encoder updates, denoted recency, to enhance generalizability. (b) A client selection strategy for the server based on the local loss of modality encoders at each client. Experiments on five real-world datasets demonstrate that MFedMC achieves comparable accuracy to several baselines while reducing communication overhead by over 20$\times$. A demo video and our code are available at https://liangqiy.com/mfedmc/. 2024-01-30T02:16:19Z arXiv admin note: text overlap with arXiv:2310.07048 Liangqi Yuan Dong-Jun Han Su Wang Devesh Upadhyay Christopher G. Brinton http://arxiv.org/abs/2603.10353v1 S-HPLB: Efficient LLM Attention Serving via Sparsity-Aware Head Parallelism Load Balance 2026-03-11T03:03:58Z With the increasing volumes of Large Language Models (LLMs) and the expanding context lengths, attention computation has become a key performance bottleneck in LLM serving. For fast attention computation, recent practices often parallelize the attention heads on multiple GPUs, and also widely adopt attention sparsification to reduce the computation amount -- which selectively computes a subset of attention pairs under a preset sparsity budget. In this paper, we notice that attention heads of an LLM model often exhibit heterogeneous-yet-stable sparsity elasticities, which motivates us to enforce head-adaptive sparsity budgets to attain better efficiency while preserving high inference quality. Yet, from the system aspect, with heterogeneous sparsity levels, attention computation time on different heads would be inconsistent, yielding cross-GPU resource bubbles under head-parallel deployment. To further minimize such bubbles, we propose a novel attention deployment strategy called Sparsity-aware Head-Parallel Load Balance (S-HPLB). Experiments on long-context benchmark show that, S-HPLB can achieve a $2.88\times$ improvement in average attention computation latency without quality degradation. 2026-03-11T03:03:58Z Di Liu Yifei Liu Chen Chen Zhibin Yu Xiaoyi Fan Quan Chen Minyi Guo http://arxiv.org/abs/2603.10342v1 AgentServe: Algorithm-System Co-Design for Efficient Agentic AI Serving on a Consumer-Grade GPU 2026-03-11T02:23:04Z Large language models (LLMs) are increasingly deployed as AI agents that operate in short reasoning-action loops, interleaving model computation with external calls. Unlike traditional chat applications, these agentic workloads require inference serving systems to balance low latency, stable token emission, and throughput under multiple request arrivals from different AI agents. Recent deployments highlight a shift toward running small language models (SLMs) locally on consumer-grade GPUs, driven by privacy, compliance, and cost constraints. When heterogeneous requests overlap on a single GPU, long prefills and short decodes contend for resources, creating head-of-line blocking that destabilizes interactive performance. By analyzing agent workloads, we observe that their execution naturally separates into cold prefills, which process long system prompts, resume prefills, which append tool outputs to cached contexts, and short decodes, which are latency-critical. This mix intensifies contention compared to conventional chatbot serving. We present AgentServe, a single-GPU serving system that ensures stable multi-agent execution under such conditions by isolating prefills from decodes, applying dynamic budgeting to resume prefills, and allocating GPU resources through pre-established CUDA Green Context slots with adaptive control. Evaluation results show that AgentServe significantly improves latency stability while sustaining competitive throughput, achieving up to 2.8x TTFT improvement and 2.7x TPOT improvement over state-of-the-art baselines across different settings. 2026-03-11T02:23:04Z Yuning Zhang Yan Yan Nan Yang Dong Yuan http://arxiv.org/abs/2603.10242v1 ACE Runtime - A ZKP-Native Blockchain Runtime with Sub-Second Cryptographic Finality 2026-03-10T21:39:36Z Existing high performance blockchains verify one signature per transaction on the critical path, which creates O(N) verification cost, high hardware pressure, and difficult post quantum migration. This paper presents ACE Runtime, a ZKP native execution layer built on identity authorization separation. We replace per transaction signature checks with lightweight HMAC attestations in the hot path, then generate one aggregated zero knowledge finality certificate per block in an asynchronous prove stage. The system is organized as an Attest Execute Prove pipeline with two tier finality: soft finality from BFT voting and hard finality from proof verification. Under standard cryptographic assumptions, we provide formal arguments for attestation unforgeability and hard finality irreversibility. We also define a two phase timeout and backup proving path with witness availability gossip for liveness under builder failure. Quantitative results combine analytical modeling with reference implementation measurements. The prototype shows low CPU orchestration overhead, while model driven analysis projects constant per block verification cost, lower validator hardware requirements for non builders, and better bandwidth efficiency than per transaction signature designs. These results indicate that identity authorization separation is a practical architecture for sub second cryptographic finality with a clear path toward stronger post quantum components. 2026-03-10T21:39:36Z 23 pages, 3 figures, 14 tables Jian Sheng Wang http://arxiv.org/abs/2603.09875v1 The Bureaucracy of Speed: Structural Equivalence Between Memory Consistency Models and Multi-Agent Authorization Revocation 2026-03-10T16:37:02Z The temporal assumptions underpinning conventional Identity and Access Management collapse under agentic execution regimes. A sixty-second revocation window permits on the order of $6 \times 10^3$ unauthorized API calls at 100 ops/tick; at AWS Lambda scale, the figure approaches $6 \times 10^5$. This is a coherence problem, not merely a latency problem. We define a Capability Coherence System (CCS) and construct a state-mapping $\varphi : Σ_{\rm MESI} \to Σ_{\rm auth}$ preserving transition structure under bounded-staleness semantics. A safety theorem bounds unauthorized operations for the execution-count Release Consistency-directed Coherence (RCC) strategy at $D_{\rm rcc} \leq n$, independent of agent velocity $v$ -- a qualitative departure from the $O(v \cdot \mathrm{TTL})$ scaling of time-bounded strategies. Tick-based discrete event simulation across three business-contextualised scenarios (four strategies, ten deterministic seeds each) confirms: RCC achieves a $120\times$ reduction versus TTL-based lease in the high-velocity scenario (50 vs. 6,000 unauthorized operations), and $184\times$ under anomaly-triggered revocation. Zero bound violations across all 120 runs confirm the per-capability safety guarantee. Simulation code: https://github.com/hipvlady/prizm 2026-03-10T16:37:02Z 18 pages, 3 figures. Simulation code at https://github.com/hipvlady/prizm Vladyslav Parakhin http://arxiv.org/abs/2603.09833v1 Rate-Distortion Bounds for Heterogeneous Random Fields on Finite Lattices 2026-03-10T15:55:52Z Since Shannon's foundational work, rate-distortion theory has defined the fundamental limits of lossy compression. Classical results, derived for memoryless and stationary ergodic sources in the asymptotic regime, have shaped both transform and predictive coding architectures, as well as practical standards such as JPEG. Finite-blocklength refinements, initiated by the non-asymptotic achievability and converse bounds of Kostina and Verdu, provide precise characterizations under excess-distortion probability constraints, but primarily for memoryless or statistically homogeneous models. In contrast, error-bounded practical lossy compressors for scientific computing, such as SZ, ZFP, MGARD, and SPERR, are designed for finite, high-dimensional, spatially correlated, and statistically heterogeneous random fields. These compressors partition data into fixed-size tiles that are processed independently, making tile size a central architectural constraint. Structural heterogeneity, finite lattice effects, and tiling constraints are not addressed by existing finite-blocklength analyses. This paper introduces a finite-blocklength rate-distortion framework for heterogeneous random fields on finite lattices, explicitly accounting for the tile-based architectures used in high-performance scientific compressors. The field is modeled as piecewise homogeneous with regionwise stationary second-order statistics, and tiling constraints are incorporated directly into the source model. Under an excess-distortion probability criterion, we establish non-asymptotic achievability, converse bounds and derive a second-order expansion that quantifies the impact of spatial correlation, region geometry, heterogeneity, and tile size on the rate and dispersion. 2026-03-10T15:55:52Z Sujata Sinha Vishwas Rao Robert Underwood David Lenz Sheng Di Franck Cappello Lingjia Liu http://arxiv.org/abs/2603.09738v1 Ensuring Data Freshness in Multi-Rate Task Chains Scheduling 2026-03-10T14:45:16Z In safety-critical autonomous systems, data freshness presents a fundamental design challenge. While the Logical Execution Time (LET) paradigm ensures compositional determinism, it often does so at the cost of injected latency, degrading the phase margin of high-frequency control loops. Furthermore, mapping heterogeneous, multi-rate sensor fusion requirements onto rigid task-centric schedules typically implies in resource-inefficient oversampling. This paper proposes a Task-based scheduling framework extended with data freshness constraints. Unlike traditional models, scheduling decisions are driven by the lifespan of data. We introduce task offset based on the data freshness constraint to order data production in a Just-in-Time (JIT) fashion: the completion of the production of data with strictest data freshness constraint is delayed to the instant its consumers will be ready to use it. This allows for flexible task release offsets. We introduce a formal methodology to decompose Data Dependency Graphs into Dominant Paths by tracing the strictest data freshness constraints backward from the actuators. Based on this decomposition, we propose a Consensus Offset Search algorithm that synchronizes shared producers and private predecessors. This approach enforces end-to-end data freshness without the artificial latency of LET buffering. We formally prove that this offset-based alignment preserves the 100\% schedulability capacity of Global EDF, ensuring data freshness while eliminating the computational overhead of redundant sampling. 2026-03-10T14:45:16Z José Luis Conradi Hoffmann Antônio Augusto Fröhlich http://arxiv.org/abs/2603.10087v1 Pooling Engram Conditional Memory in Large Language Models using CXL 2026-03-10T14:13:02Z Engram conditional memory has emerged as a promising component for LLMs by decoupling static knowledge lookup from dynamic computation. Since Engram exhibits sparse access patterns and supports prefetching, its massive embedding tables are well-suited for offloading to lower-tier memory. In this paper, we propose using Compute Express Link (CXL) memory pool for Engram storage. Compared to RDMA, CXL provides fine-grained and low-latency access required by minimal and discrete retrieval patterns of Engram. We integrate the CXL-based Engram pool into SGLang, achieving near-DRAM end-to-end performance. This provides a scalable and cost-efficient storage solution for future Engram-integrated LLMs without compromising inference performance. 2026-03-10T14:13:02Z Submitted to EuroMLSys'26 Ruiyang Ma Teng Ma Zhiyuan Su Hantian Zha Xinpeng Zhao Xuchun Shang Xingrui Yi Zheng Liu Zhu Cao An Wu Zhichong Dou Ziqian Liu Daikang Kuang Guojie Luo http://arxiv.org/abs/2504.20067v2 Scalable and Performant Data Loading 2026-03-10T14:08:15Z We present SPDL (Scalable and Performant Data Loading), an open-source, framework-agnostic library designed for efficiently loading array data to GPU. Data loading is often a bottleneck in AI applications, and is challenging to optimize because it requires coordination of network calls, CPU-bound tasks, and GPU device transfer. On top of that, Python's GIL (Global Interpreter Lock) makes it difficult to gain performance improvement from multi-threading. We found that when data preprocessing functions release the GIL entirely, it is possible to execute them concurrently in a thread pool, thereby improving the workflow performance. Our benchmark shows that compared to the PyTorch DataLoader, SPDL can iterate through the ImageNet dataset 74% faster while using 38% less CPU and 50GB less memory. When training ViT-B/16 model, SPDL can send data to the GPU at a speed that does not starve the training. Additionally, when using SPDL on Python 3.13t, without changing any code, the throughput is further by improved by 33%, thanks to the disabled GIL. SPDL can improve the performance of current AI model training, and receives further performance improvements when Free-Threaded Python is adopted in production systems. SPDL is available at https://github.com/facebookresearch/spdl. 2025-04-23T19:59:43Z For the latest version of the software please visit https://facebookresearch.github.io/spdl/main/ Moto Hira Christian Puhrsch Valentin Andrei Roman Malinovskyy Gael Le Lan Abhinandan Krishnan Joseph Cummings Victor Bourgin Olga Gerasimova Miguel Martin Gokul Gunasekaran Yuta Inoue Alex J Turner Raghuraman Krishnamoorthi http://arxiv.org/abs/2603.09642v1 Multi-DNN Inference of Sparse Models on Edge SoCs 2026-03-10T13:16:59Z Modern edge applications increasingly require multi-DNN inference systems to execute tasks on heterogeneous processors, gaining performance from both concurrent execution and from matching each model to the most suited accelerator. However, existing systems support only a single model (or a few sparse variants) per task, which impedes the efficiency of this matching and results in high Service Level Objective violation rates. We introduce model stitching for multi-DNN inference systems, which creates model variants by recombining subgraphs from sparse models without re-training. We present a demonstrator system, SparseLoom, that shows model stitching can be deployed to SoCs. We show experimentally that SparseLoom reduces SLO violation rates by up to 74%, improves throughput by up to 2.31x, and lowers memory overhead by an average of 28% compared to state-of-the-art multi-DNN inference systems. 2026-03-10T13:16:59Z Jiawei Luo Di Wu Simon Dobson Blesson Varghese http://arxiv.org/abs/2603.09577v1 Randomized Distributed Function Computation (RDFC): Ultra-Efficient Semantic Communication Applications to Privacy 2026-03-10T12:23:50Z We establish the randomized distributed function computation (RDFC) framework, in which a sender transmits just enough information for a receiver to generate a randomized function of the input data. Describing RDFC as a form of semantic communication, which can be essentially seen as a generalized remote-source-coding problem, we show that security and privacy constraints naturally fit this model, as they generally require a randomization step. Using strong coordination metrics, we ensure (local differential) privacy for every input sequence and prove that such guarantees can be met even when no common randomness is shared between the transmitter and receiver. This work provides lower bounds on Wyner's common information (WCI), which is the communication cost when common randomness is absent, and proposes numerical techniques to evaluate the other corner point of the RDFC rate region for continuous-alphabet random variables with unlimited shared randomness. Experiments illustrate that a sufficient amount of common randomness can reduce the semantic communication rate by up to two orders of magnitude compared to the WCI point, while RDFC without any shared randomness still outperforms lossless transmission by a large margin. A finite blocklength analysis further confirms that the privacy parameter gap between the asymptotic and non-asymptotic RDFC methods closes exponentially fast with input length. Our results position RDFC as an energy-efficient semantic communication strategy for privacy-aware distributed computation systems. 2026-03-10T12:23:50Z Onur Günlü 10.1186/s13635-026-00223-z http://arxiv.org/abs/2603.09568v1 Case Study: Performance Analysis of a Virtualized XRootD Frontend in Large-Scale WAN Transfers 2026-03-10T12:13:01Z This paper presents a detailed case study of the T2_BR_SPRACE storage frontend architecture and its observed performance in high-intensity data transfers. The architecture is composed of a heterogeneous cluster of XRootD [1] Virtual Machines (VMs) with 10 Gb/s and 40 Gb/s links, which aggregate data from a 77 Gb/s dCache [2] backend via pNFS to an external 100 Gb/s WAN link. We describe the system configuration, including the use of the BBR [3] congestion control algorithm and TCP extensions [4]. Under peak production conditions, we observed the system sustaining an aggregate throughput of 51.3 Gb/s. An analysis of a specific data flow to Fermilab (FNAL) showed peaks of 41.5 Gb/s, validated by external monitoring tools (CERN). This study documents the performance of a complex virtualized architecture under real load. 2026-03-10T12:13:01Z J M da Silva M A Costa R L Iope http://arxiv.org/abs/2603.09555v1 Compiler-First State Space Duality and Portable $O(1)$ Autoregressive Caching for Inference 2026-03-10T12:03:00Z State-space model releases are typically coupled to fused CUDA and Triton kernels, inheriting a hard dependency on NVIDIA hardware. We show that Mamba-2's state space duality algorithm -- diagonal state structure, chunkable recurrence, and einsum-dominated compute with static control flow -- maps cleanly onto what XLA's fusion and tiling passes actually optimise, making custom kernels optional rather than required. We implement the full inference path (prefill, cached autoregressive decoding) as shaped standard primitives under XLA, without hand-written kernels, and realise the architecture's theoretical $O(1)$ state management as a compiled on-device cache requiring no host synchronisation during generation. The implementation runs unmodified on CPU, NVIDIA GPU, and Google Cloud TPU from a single JAX source. On TPU v6e across five model scales (130M--2.7B parameters), XLA-generated code reaches approximately 140 TFLOPS on single-stream prefill ($15%$ MFU) and up to $64%$ bandwidth utilisation on decode. Greedy decoding matches the PyTorch/CUDA reference token-for-token across 64 steps, with hidden-state agreement within float32 rounding tolerance. The pattern transfers to any SSM recurrence satisfying the same structural conditions, on any platform with a mature XLA backend. The implementation is publicly available at https://github.com/CosmoNaught/mamba2-jax and merged into the Bonsai JAX model library. 2026-03-10T12:03:00Z 18 pages, 6 figures. Code available at: https://github.com/CosmoNaught/mamba2-jax Cosmo Santoni http://arxiv.org/abs/2308.04604v2 A Survey on Decentralized Federated Learning 2026-03-10T10:59:15Z Federated learning (FL) enables collaborative training without pooling raw data, but standard FL relies on a central coordinator, which introduces a single point of failure and concentrates trust in the orchestration infrastructure. Decentralized federated learning (DFL) removes the coordinator and replaces client-server orchestration with peer-to-peer coordination, making learning dynamics topology-dependent and reshaping the associated security, privacy, and systems trade-offs. This survey systematically reviews DFL methods from 2018 through early 2026 and organizes them into two architectural families: traditional distributed FL and blockchain-based FL. We then propose a unified, challenge-driven taxonomy that maps both families to the core bottlenecks they primarily address, and we summarize prevailing evaluation practices and their limitations, exposing gaps in the literature. Finally, we distill lessons learned and outline research directions, emphasizing topology-aware threat models, privacy notions that reflect decentralized exposure, incentive mechanisms robust to manipulation, and the need to explicitly define whether the objective is a single global model or personalized solutions in decentralized settings. 2023-08-08T22:07:15Z Edoardo Gabrielli Anthony Di Pietro Dario Fenoglio Giovanni Pica Gabriele Tolomei http://arxiv.org/abs/2603.06170v2 Provuse: Platform-Side Function Fusion for Performance and Efficiency in FaaS Environments 2026-03-10T10:50:43Z Function-as-a-Service (FaaS) platforms provide scalable and cost-efficient execution but suffer from increased latency and resource overheads in complex applications comprising multiple functions, particularly due to double billing when functions call each other. This paper presents Provuse, a transparent, platform-side optimization that automatically performs function fusion at runtime for independently deployed functions, thereby eliminating redundant function instances. This approach reduces both cost and latency without requiring users to change any code. Provusetargets provider-managed FaaS platforms that retain control over function entry points and deployment artifacts, enabling transparent, runtime execution consolidation without developer intervention. We provide two implementations for this approach using the tinyFaaS platform as well as Kubernetes, demonstrating compatibility with container orchestration frameworks. An evaluation shows consistent improvements, achieving an average end-to-end latency reduction of 26.33% and a mean RAM usage reduction of 53.57%. These results indicate that automatic function fusion is an effective platform-side strategy for reducing latency and RAM consumption in composed FaaS applications, highlighting the potential of transparent infrastructure-level optimizations in serverless systems. 2026-03-06T11:28:40Z Accepted for publication at the 4th Workshop on SErverless Systems, Applications and MEthodologies (SESAME '26) Niklas Kowallik Natalie Carl Leon Pöllinger Wei Wang Sharan Santhanam David Bermbach