https://arxiv.org/api/zURsaTgTIE7qLX2n4h7Z1piw1oo2026-03-22T13:00:42Z2772412015http://arxiv.org/abs/2401.16685v2Communication-Efficient Multimodal Federated Learning: Joint Modality and Client Selection2026-03-11T03:47:33ZMultimodal federated learning (MFL) aims to enrich model training in FL settings where clients are collecting measurements across multiple modalities. However, key challenges to MFL remain unaddressed, particularly in heterogeneous network settings where: (i) the set of modalities collected by each client is diverse, and (ii) communication limitations prevent clients from uploading all their locally trained modality encoders to the server. In this paper, we propose Multimodal Federated learning with joint Modality and Client selection (MFedMC), a communication-efficient MFL framework that tackles these challenges through a decoupled architecture and selective uploading. Unlike traditional holistic fusion approaches, MFedMC separates modality encoders and fusion modules: modality encoders are aggregated at the server for generalization across diverse client distributions, while fusion modules remain local to each client for personalized adaptation to individual modality configurations and data characteristics. Building on this decoupled design, our joint selection algorithm incorporates two main components: (a) A modality selection methodology for each client, which weighs (i) the impact of the modality, gauged by Shapley value analysis, (ii) the modality encoder size as a gauge of communication overhead, and (iii) the frequency of modality encoder updates, denoted recency, to enhance generalizability. (b) A client selection strategy for the server based on the local loss of modality encoders at each client. Experiments on five real-world datasets demonstrate that MFedMC achieves comparable accuracy to several baselines while reducing communication overhead by over 20$\times$. A demo video and our code are available at https://liangqiy.com/mfedmc/.2024-01-30T02:16:19ZarXiv admin note: text overlap with arXiv:2310.07048Liangqi YuanDong-Jun HanSu WangDevesh UpadhyayChristopher G. Brintonhttp://arxiv.org/abs/2603.10353v1S-HPLB: Efficient LLM Attention Serving via Sparsity-Aware Head Parallelism Load Balance2026-03-11T03:03:58ZWith the increasing volumes of Large Language Models (LLMs) and the expanding context lengths, attention computation has become a key performance bottleneck in LLM serving. For fast attention computation, recent practices often parallelize the attention heads on multiple GPUs, and also widely adopt attention sparsification to reduce the computation amount -- which selectively computes a subset of attention pairs under a preset sparsity budget. In this paper, we notice that attention heads of an LLM model often exhibit heterogeneous-yet-stable sparsity elasticities, which motivates us to enforce head-adaptive sparsity budgets to attain better efficiency while preserving high inference quality. Yet, from the system aspect, with heterogeneous sparsity levels, attention computation time on different heads would be inconsistent, yielding cross-GPU resource bubbles under head-parallel deployment. To further minimize such bubbles, we propose a novel attention deployment strategy called Sparsity-aware Head-Parallel Load Balance (S-HPLB). Experiments on long-context benchmark show that, S-HPLB can achieve a $2.88\times$ improvement in average attention computation latency without quality degradation.2026-03-11T03:03:58ZDi LiuYifei LiuChen ChenZhibin YuXiaoyi FanQuan ChenMinyi Guohttp://arxiv.org/abs/2603.10342v1AgentServe: Algorithm-System Co-Design for Efficient Agentic AI Serving on a Consumer-Grade GPU2026-03-11T02:23:04ZLarge language models (LLMs) are increasingly deployed as AI agents that operate in short reasoning-action loops, interleaving model computation with external calls. Unlike traditional chat applications, these agentic workloads require inference serving systems to balance low latency, stable token emission, and throughput under multiple request arrivals from different AI agents. Recent deployments highlight a shift toward running small language models (SLMs) locally on consumer-grade GPUs, driven by privacy, compliance, and cost constraints. When heterogeneous requests overlap on a single GPU, long prefills and short decodes contend for resources, creating head-of-line blocking that destabilizes interactive performance. By analyzing agent workloads, we observe that their execution naturally separates into cold prefills, which process long system prompts, resume prefills, which append tool outputs to cached contexts, and short decodes, which are latency-critical. This mix intensifies contention compared to conventional chatbot serving. We present AgentServe, a single-GPU serving system that ensures stable multi-agent execution under such conditions by isolating prefills from decodes, applying dynamic budgeting to resume prefills, and allocating GPU resources through pre-established CUDA Green Context slots with adaptive control. Evaluation results show that AgentServe significantly improves latency stability while sustaining competitive throughput, achieving up to 2.8x TTFT improvement and 2.7x TPOT improvement over state-of-the-art baselines across different settings.2026-03-11T02:23:04ZYuning ZhangYan YanNan YangDong Yuanhttp://arxiv.org/abs/2603.10242v1ACE Runtime - A ZKP-Native Blockchain Runtime with Sub-Second Cryptographic Finality2026-03-10T21:39:36ZExisting high performance blockchains verify one signature per transaction on the critical path, which creates O(N) verification cost, high hardware pressure, and difficult post quantum migration. This paper presents ACE Runtime, a ZKP native execution layer built on identity authorization separation. We replace per transaction signature checks with lightweight HMAC attestations in the hot path, then generate one aggregated zero knowledge finality certificate per block in an asynchronous prove stage. The system is organized as an Attest Execute Prove pipeline with two tier finality: soft finality from BFT voting and hard finality from proof verification. Under standard cryptographic assumptions, we provide formal arguments for attestation unforgeability and hard finality irreversibility. We also define a two phase timeout and backup proving path with witness availability gossip for liveness under builder failure. Quantitative results combine analytical modeling with reference implementation measurements. The prototype shows low CPU orchestration overhead, while model driven analysis projects constant per block verification cost, lower validator hardware requirements for non builders, and better bandwidth efficiency than per transaction signature designs. These results indicate that identity authorization separation is a practical architecture for sub second cryptographic finality with a clear path toward stronger post quantum components.2026-03-10T21:39:36Z23 pages, 3 figures, 14 tablesJian Sheng Wanghttp://arxiv.org/abs/2603.09875v1The Bureaucracy of Speed: Structural Equivalence Between Memory Consistency Models and Multi-Agent Authorization Revocation2026-03-10T16:37:02ZThe temporal assumptions underpinning conventional Identity and Access Management collapse under agentic execution regimes. A sixty-second revocation window permits on the order of $6 \times 10^3$ unauthorized API calls at 100 ops/tick; at AWS Lambda scale, the figure approaches $6 \times 10^5$. This is a coherence problem, not merely a latency problem. We define a Capability Coherence System (CCS) and construct a state-mapping $\varphi : Σ_{\rm MESI} \to Σ_{\rm auth}$ preserving transition structure under bounded-staleness semantics. A safety theorem bounds unauthorized operations for the execution-count Release Consistency-directed Coherence (RCC) strategy at $D_{\rm rcc} \leq n$, independent of agent velocity $v$ -- a qualitative departure from the $O(v \cdot \mathrm{TTL})$ scaling of time-bounded strategies. Tick-based discrete event simulation across three business-contextualised scenarios (four strategies, ten deterministic seeds each) confirms: RCC achieves a $120\times$ reduction versus TTL-based lease in the high-velocity scenario (50 vs. 6,000 unauthorized operations), and $184\times$ under anomaly-triggered revocation. Zero bound violations across all 120 runs confirm the per-capability safety guarantee. Simulation code: https://github.com/hipvlady/prizm2026-03-10T16:37:02Z18 pages, 3 figures. Simulation code at https://github.com/hipvlady/prizmVladyslav Parakhinhttp://arxiv.org/abs/2603.09833v1Rate-Distortion Bounds for Heterogeneous Random Fields on Finite Lattices2026-03-10T15:55:52ZSince Shannon's foundational work, rate-distortion theory has defined the fundamental limits of lossy compression. Classical results, derived for memoryless and stationary ergodic sources in the asymptotic regime, have shaped both transform and predictive coding architectures, as well as practical standards such as JPEG. Finite-blocklength refinements, initiated by the non-asymptotic achievability and converse bounds of Kostina and Verdu, provide precise characterizations under excess-distortion probability constraints, but primarily for memoryless or statistically homogeneous models. In contrast, error-bounded practical lossy compressors for scientific computing, such as SZ, ZFP, MGARD, and SPERR, are designed for finite, high-dimensional, spatially correlated, and statistically heterogeneous random fields. These compressors partition data into fixed-size tiles that are processed independently, making tile size a central architectural constraint. Structural heterogeneity, finite lattice effects, and tiling constraints are not addressed by existing finite-blocklength analyses. This paper introduces a finite-blocklength rate-distortion framework for heterogeneous random fields on finite lattices, explicitly accounting for the tile-based architectures used in high-performance scientific compressors. The field is modeled as piecewise homogeneous with regionwise stationary second-order statistics, and tiling constraints are incorporated directly into the source model. Under an excess-distortion probability criterion, we establish non-asymptotic achievability, converse bounds and derive a second-order expansion that quantifies the impact of spatial correlation, region geometry, heterogeneity, and tile size on the rate and dispersion.2026-03-10T15:55:52ZSujata SinhaVishwas RaoRobert UnderwoodDavid LenzSheng DiFranck CappelloLingjia Liuhttp://arxiv.org/abs/2603.09738v1Ensuring Data Freshness in Multi-Rate Task Chains Scheduling2026-03-10T14:45:16ZIn safety-critical autonomous systems, data freshness presents a fundamental design challenge. While the Logical Execution Time (LET) paradigm ensures compositional determinism, it often does so at the cost of injected latency, degrading the phase margin of high-frequency control loops. Furthermore, mapping heterogeneous, multi-rate sensor fusion requirements onto rigid task-centric schedules typically implies in resource-inefficient oversampling. This paper proposes a Task-based scheduling framework extended with data freshness constraints. Unlike traditional models, scheduling decisions are driven by the lifespan of data. We introduce task offset based on the data freshness constraint to order data production in a Just-in-Time (JIT) fashion: the completion of the production of data with strictest data freshness constraint is delayed to the instant its consumers will be ready to use it. This allows for flexible task release offsets. We introduce a formal methodology to decompose Data Dependency Graphs into Dominant Paths by tracing the strictest data freshness constraints backward from the actuators. Based on this decomposition, we propose a Consensus Offset Search algorithm that synchronizes shared producers and private predecessors. This approach enforces end-to-end data freshness without the artificial latency of LET buffering. We formally prove that this offset-based alignment preserves the 100\% schedulability capacity of Global EDF, ensuring data freshness while eliminating the computational overhead of redundant sampling.2026-03-10T14:45:16ZJosé Luis Conradi HoffmannAntônio Augusto Fröhlichhttp://arxiv.org/abs/2603.10087v1Pooling Engram Conditional Memory in Large Language Models using CXL2026-03-10T14:13:02ZEngram conditional memory has emerged as a promising component for LLMs by decoupling static knowledge lookup from dynamic computation. Since Engram exhibits sparse access patterns and supports prefetching, its massive embedding tables are well-suited for offloading to lower-tier memory. In this paper, we propose using Compute Express Link (CXL) memory pool for Engram storage. Compared to RDMA, CXL provides fine-grained and low-latency access required by minimal and discrete retrieval patterns of Engram. We integrate the CXL-based Engram pool into SGLang, achieving near-DRAM end-to-end performance. This provides a scalable and cost-efficient storage solution for future Engram-integrated LLMs without compromising inference performance.2026-03-10T14:13:02ZSubmitted to EuroMLSys'26Ruiyang MaTeng MaZhiyuan SuHantian ZhaXinpeng ZhaoXuchun ShangXingrui YiZheng LiuZhu CaoAn WuZhichong DouZiqian LiuDaikang KuangGuojie Luohttp://arxiv.org/abs/2504.20067v2Scalable and Performant Data Loading2026-03-10T14:08:15ZWe present SPDL (Scalable and Performant Data Loading), an open-source, framework-agnostic library designed for efficiently loading array data to GPU. Data loading is often a bottleneck in AI applications, and is challenging to optimize because it requires coordination of network calls, CPU-bound tasks, and GPU device transfer. On top of that, Python's GIL (Global Interpreter Lock) makes it difficult to gain performance improvement from multi-threading. We found that when data preprocessing functions release the GIL entirely, it is possible to execute them concurrently in a thread pool, thereby improving the workflow performance. Our benchmark shows that compared to the PyTorch DataLoader, SPDL can iterate through the ImageNet dataset 74% faster while using 38% less CPU and 50GB less memory. When training ViT-B/16 model, SPDL can send data to the GPU at a speed that does not starve the training. Additionally, when using SPDL on Python 3.13t, without changing any code, the throughput is further by improved by 33%, thanks to the disabled GIL. SPDL can improve the performance of current AI model training, and receives further performance improvements when Free-Threaded Python is adopted in production systems. SPDL is available at https://github.com/facebookresearch/spdl.2025-04-23T19:59:43ZFor the latest version of the software please visit https://facebookresearch.github.io/spdl/main/Moto HiraChristian PuhrschValentin AndreiRoman MalinovskyyGael Le LanAbhinandan KrishnanJoseph CummingsVictor BourginOlga GerasimovaMiguel MartinGokul GunasekaranYuta InoueAlex J TurnerRaghuraman Krishnamoorthihttp://arxiv.org/abs/2603.09642v1Multi-DNN Inference of Sparse Models on Edge SoCs2026-03-10T13:16:59ZModern edge applications increasingly require multi-DNN inference systems to execute tasks on heterogeneous processors, gaining performance from both concurrent execution and from matching each model to the most suited accelerator. However, existing systems support only a single model (or a few sparse variants) per task, which impedes the efficiency of this matching and results in high Service Level Objective violation rates. We introduce model stitching for multi-DNN inference systems, which creates model variants by recombining subgraphs from sparse models without re-training. We present a demonstrator system, SparseLoom, that shows model stitching can be deployed to SoCs. We show experimentally that SparseLoom reduces SLO violation rates by up to 74%, improves throughput by up to 2.31x, and lowers memory overhead by an average of 28% compared to state-of-the-art multi-DNN inference systems.2026-03-10T13:16:59ZJiawei LuoDi WuSimon DobsonBlesson Varghesehttp://arxiv.org/abs/2603.09577v1Randomized Distributed Function Computation (RDFC): Ultra-Efficient Semantic Communication Applications to Privacy2026-03-10T12:23:50ZWe establish the randomized distributed function computation (RDFC) framework, in which a sender transmits just enough information for a receiver to generate a randomized function of the input data. Describing RDFC as a form of semantic communication, which can be essentially seen as a generalized remote-source-coding problem, we show that security and privacy constraints naturally fit this model, as they generally require a randomization step. Using strong coordination metrics, we ensure (local differential) privacy for every input sequence and prove that such guarantees can be met even when no common randomness is shared between the transmitter and receiver.
This work provides lower bounds on Wyner's common information (WCI), which is the communication cost when common randomness is absent, and proposes numerical techniques to evaluate the other corner point of the RDFC rate region for continuous-alphabet random variables with unlimited shared randomness. Experiments illustrate that a sufficient amount of common randomness can reduce the semantic communication rate by up to two orders of magnitude compared to the WCI point, while RDFC without any shared randomness still outperforms lossless transmission by a large margin. A finite blocklength analysis further confirms that the privacy parameter gap between the asymptotic and non-asymptotic RDFC methods closes exponentially fast with input length. Our results position RDFC as an energy-efficient semantic communication strategy for privacy-aware distributed computation systems.2026-03-10T12:23:50ZOnur Günlü10.1186/s13635-026-00223-zhttp://arxiv.org/abs/2603.09568v1Case Study: Performance Analysis of a Virtualized XRootD Frontend in Large-Scale WAN Transfers2026-03-10T12:13:01ZThis paper presents a detailed case study of the T2_BR_SPRACE storage frontend architecture and its observed performance in high-intensity data transfers. The architecture is composed of a heterogeneous cluster of XRootD [1] Virtual Machines (VMs) with 10 Gb/s and 40 Gb/s links, which aggregate data from a 77 Gb/s dCache [2] backend via pNFS to an external 100 Gb/s WAN link. We describe the system configuration, including the use of the BBR [3] congestion control algorithm and TCP extensions [4]. Under peak production conditions, we observed the system sustaining an aggregate throughput of 51.3 Gb/s. An analysis of a specific data flow to Fermilab (FNAL) showed peaks of 41.5 Gb/s, validated by external monitoring tools (CERN). This study documents the performance of a complex virtualized architecture under real load.2026-03-10T12:13:01ZJ M da SilvaM A CostaR L Iopehttp://arxiv.org/abs/2603.09555v1Compiler-First State Space Duality and Portable $O(1)$ Autoregressive Caching for Inference2026-03-10T12:03:00ZState-space model releases are typically coupled to fused CUDA and Triton kernels, inheriting a hard dependency on NVIDIA hardware. We show that Mamba-2's state space duality algorithm -- diagonal state structure, chunkable recurrence, and einsum-dominated compute with static control flow -- maps cleanly onto what XLA's fusion and tiling passes actually optimise, making custom kernels optional rather than required. We implement the full inference path (prefill, cached autoregressive decoding) as shaped standard primitives under XLA, without hand-written kernels, and realise the architecture's theoretical $O(1)$ state management as a compiled on-device cache requiring no host synchronisation during generation. The implementation runs unmodified on CPU, NVIDIA GPU, and Google Cloud TPU from a single JAX source. On TPU v6e across five model scales (130M--2.7B parameters), XLA-generated code reaches approximately 140 TFLOPS on single-stream prefill ($15%$ MFU) and up to $64%$ bandwidth utilisation on decode. Greedy decoding matches the PyTorch/CUDA reference token-for-token across 64 steps, with hidden-state agreement within float32 rounding tolerance. The pattern transfers to any SSM recurrence satisfying the same structural conditions, on any platform with a mature XLA backend. The implementation is publicly available at https://github.com/CosmoNaught/mamba2-jax and merged into the Bonsai JAX model library.2026-03-10T12:03:00Z18 pages, 6 figures. Code available at: https://github.com/CosmoNaught/mamba2-jaxCosmo Santonihttp://arxiv.org/abs/2308.04604v2A Survey on Decentralized Federated Learning2026-03-10T10:59:15ZFederated learning (FL) enables collaborative training without pooling raw data, but standard FL relies on a central coordinator, which introduces a single point of failure and concentrates trust in the orchestration infrastructure. Decentralized federated learning (DFL) removes the coordinator and replaces client-server orchestration with peer-to-peer coordination, making learning dynamics topology-dependent and reshaping the associated security, privacy, and systems trade-offs. This survey systematically reviews DFL methods from 2018 through early 2026 and organizes them into two architectural families: traditional distributed FL and blockchain-based FL. We then propose a unified, challenge-driven taxonomy that maps both families to the core bottlenecks they primarily address, and we summarize prevailing evaluation practices and their limitations, exposing gaps in the literature. Finally, we distill lessons learned and outline research directions, emphasizing topology-aware threat models, privacy notions that reflect decentralized exposure, incentive mechanisms robust to manipulation, and the need to explicitly define whether the objective is a single global model or personalized solutions in decentralized settings.2023-08-08T22:07:15ZEdoardo GabrielliAnthony Di PietroDario FenoglioGiovanni PicaGabriele Tolomeihttp://arxiv.org/abs/2603.06170v2Provuse: Platform-Side Function Fusion for Performance and Efficiency in FaaS Environments2026-03-10T10:50:43ZFunction-as-a-Service (FaaS) platforms provide scalable and cost-efficient execution but suffer from increased latency and resource overheads in complex applications comprising multiple functions, particularly due to double billing when functions call each other. This paper presents Provuse, a transparent, platform-side optimization that automatically performs function fusion at runtime for independently deployed functions, thereby eliminating redundant function instances. This approach reduces both cost and latency without requiring users to change any code. Provusetargets provider-managed FaaS platforms that retain control over function entry points and deployment artifacts, enabling transparent, runtime execution consolidation without developer intervention. We provide two implementations for this approach using the tinyFaaS platform as well as Kubernetes, demonstrating compatibility with container orchestration frameworks. An evaluation shows consistent improvements, achieving an average end-to-end latency reduction of 26.33% and a mean RAM usage reduction of 53.57%. These results indicate that automatic function fusion is an effective platform-side strategy for reducing latency and RAM consumption in composed FaaS applications, highlighting the potential of transparent infrastructure-level optimizations in serverless systems.2026-03-06T11:28:40ZAccepted for publication at the 4th Workshop on SErverless Systems, Applications and MEthodologies (SESAME '26)Niklas KowallikNatalie CarlLeon PöllingerWei WangSharan SanthanamDavid Bermbach