Idiosyncrasies of Programmable Caching Engines

2026-03-15T12:47:06Z

Programmable caching engines like CacheLib are widely used in production systems to support diverse workloads in multi-tenant environments. CacheLib's design focuses on performance, portability, and configurability, allowing applications to inherit caching improvements with minimal implementation effort. However, its behavior under dynamic and evolving workloads remains largely unexplored. This paper presents an empirical study of CacheLib with multi-tenant settings under dynamic and volatile environments. Our evaluation across multiple CacheLib configurations reveals several limitations that hinder its effectiveness under such environments, including rigid configurations, limited runtime adaptability, lack of quality-of-service support and coordination, which lead to suboptimal performance, inefficient memory usage, and tenant starvation. Based on these findings, we outline future research directions to improve the adaptability, fairness, and programmability of future caching engines.

AeroGen: Agentic Drone Autonomy through Single-Shot Structured Prompting & Drone SDK

2026-03-15T06:16:02Z

Designing correct UAV autonomy programs is challenging due to joint navigation, sensing and analytics requirements. While LLMs can generate code, their reliability for safety-critical UAVs remains uncertain. This paper presents AeroGen, an open-loop framework that enables consistently correct single-shot AI-generated drone control programs through structured guardrail prompting and integration with the AeroDaaS drone SDK. AeroGen encodes API descriptions, flight constraints and operational world rules directly into the system context prompt, enabling generic LLMs to produce constraint-aware code from user prompts, with minimal example code. We evaluate AeroGen across a diverse benchmark of 20 navigation tasks and 5 drone missions on urban, farm and inspection environments, using both imperative and declarative user prompts. AeroGen generates about 40 lines of AeroDaaS Python code in about 20s per mission, in both real-world and simulations, showing that structured prompting with a well-defined SDK improves robustness, correctness and deployability of LLM-generated drone autonomy programs.

The Big Send-off: Scalable and Performant Collectives for Deep Learning

2026-03-15T03:52:28Z

Collective communication is becoming increasingly important in data center and supercomputer workloads with an increase in distributed AI related jobs. However, existing libraries that provide collective support such as NCCL, RCCL, and Cray-MPICH exhibit several performance and scalability limitations on modern GPU supercomputers. To address these challenges, we introduce the Performant Collective Communication Library (PCCL), specifically targeted for distributed deep learning (DL) workloads. PCCL provides highly optimized implementations of key collectives used in distributed DL: all-gather, reduce-scatter, and all-reduce. PCCL uses a hierarchical design with learning-based adaptive selection of the best performing algorithms to scale efficiently to thousands of GPUs. It achieves substantial performance speedups over RCCL on 2048 GCDs of Frontier -- up to 168x for reduce-scatter, 33x for all-gather and 10x for all-reduce. More modest but still significant gains up to 5.7x over NCCL are observed on Perlmutter. These gains translate directly to performance improvement of production DL workloads: up to 4.9x speedup over RCCL in DeepSpeed ZeRO-3 training, and up to 2.4x speedup in DDP training.

A Case for CATS: A Conductor-driven Asymmetric Transport Scheme for Semantic Prioritization

2026-03-14T13:36:15Z

Standard transport protocols like TCP operate as a blind, FIFO conveyor belt for data, a model that is increasingly suboptimal for latency-sensitive and interactive applications. This paper challenges this model by introducing CATS (Conductor-driven Asymmetric Transport Scheme), a framework that provides TCP with the semantic awareness necessary to prioritize critical content. By centralizing scheduling intelligence in a transport-native "Conductor", CATS significantly improves user-perceived performance by delivering essential data first. This architecture directly confronts a cascade of historical performance workarounds and their limitations, including the high overhead of parallel connections in HTTP/1.1, the transport-layer Head-of-Line blocking in HTTP/2, and the observed implementation heterogeneity of prioritization in HTTP/3 over QUIC. Built upon TCP BBR, our ns-3 implementation demonstrates this principle by reducing the First Contentful Paint by over 78% in a representative webpage download configured as a deliberate worst-case scenario, with no penalty to total page load time compared to the baseline.

FedPBS: Proximal-Balanced Scaling Federated Learning Model for Robust Personalized Training for Non-IID Data

2026-03-14T11:44:27Z

Federated learning (FL) enables a set of distributed clients to jointly train machine learning models while preserving their local data privacy, making it attractive for applications in healthcare, finance, mobility, and smart-city systems. However, FL faces several challenges, including statistical heterogeneity and uneven client participation, which can degrade convergence and model quality. In this work, we propose FedPBS, an FL algorithm that couples complementary ideas from FedBS and FedProx to address these challenges. FedPBS dynamically adapts batch sizes to client resources to support balanced and scalable participation, and selectively applies a proximal correction to small-batch clients to stabilize local updates and reduce divergence from the global model. Experiments on benchmarking datasets such as CIFAR-10 and UCI-HAR under highly non-IID settings demonstrate that FedPBS consistently outperforms state-of-the-art methods, including FedBS, FedGA, MOON, and FedProx. The results demonstrate robust performance gains under extreme data heterogeneity, with smooth loss curves indicating stable convergence across diverse federated environments. FedPBS consistently outperforms state-of-the-art federated learning baselines on UCI-HAR and CIFAR-10 under severe non-IID conditions while maintaining stable and reliable convergence.

Justitia: Fair and Efficient Scheduling of Task-parallel LLM Agents with Selective Pampering

2026-03-14T10:58:56Z

LLM agents, which often comprise parallel inference tasks, are commonly adopted to solve real-world problems. When serving such task-parallel LLM agents in shared GPU servers, the scheduler is expected to attain fast agent completion with guaranteed worst-case performance. For that objective, our insight is to selectively pampering agents based on their completion order under idealized fair-sharing. We design Justitia, a fair and also efficient scheduler for task-parallel LLM agents. Noticing that memory is prevalently a bottleneck in LLM serving, Justitia quantifies the true agent cost in a memory-centric manner. It also adopts a light-weight yet accurate method to predict agent costs. Finally, Justitia adopts a virtual-time based fair queuing algorithm to reduce the overall performance with guaranteed worst-case delay. We have implemented Justitia atop vLLM, and experimental results involving diverse agents show that it can substantially enhance the scheduling efficiency with fairness preserved.

Population Protocols Revisited: Parity and Beyond

2026-03-14T08:20:32Z

For nearly two decades, population protocols have been extensively studied, yielding efficient solutions for central problems in distributed computing, including leader election, and majority computation, a predicate type in Presburger Arithmetic closely tied to population protocols. Surprisingly, no protocols have achieved both time- and space-efficiency for congruency predicates, such as parity computation, which are complementary in this arithmetic framework. This gap highlights a significant challenge in the field. To address this gap, we explore the parity problem, where agents are tasked with computing the parity of the given sub-population size. Then we extend the solution for parity to compute congruences modulo an arbitrary $m$. Previous research on efficient population protocols has focused on protocols that minimise both stabilisation time and state utilisation for specific problems. In contrast, this work slightly relaxes this expectation, permitting protocols to place less emphasis on full optimisation and more on universality, robustness, and probabilistic guarantees. This allows us to propose a novel computing paradigm that integrates population weights (or simply weights), a robust clocking mechanism, and efficient anomaly detection coupled with a switching mechanism (which ensures slow but always correct solutions). This paradigm facilitates universal design of efficient multistage stable population protocols. Specifically, the first efficient parity and congruence protocols introduced here use both $O(\log^3 n)$ states and achieve silent stabilisation in $O(\log^3 n)$ time. We conclude by discussing the impact of implicit conversion between unary and binary representations enabled by the weight system, with applications to other problems, including the computation and representation of (sub-)population sizes.

The Forward-In-Time-Only Assumption in SmartNIC Resource Management: A Critique of Wave and the Case for Bilateral Interaction

2026-03-14T04:51:06Z

The datacenter industry is converging on SmartNIC-based resource management. Wave (Humphries et al., ASPLOS '25) demonstrates the practical feasibility of offloading kernel thread scheduling, memory management, and RPC stacks to the ARM cores of Intel's Mount Evans Infrastructure Processing Unit (IPU). The engineering is careful and the results are honest: without Wave's PCIe latency mitigations, offloaded workloads degrade by 350%. We argue that this 350% degradation is not an engineering problem to be optimized away but a diagnostic symptom of a deeper architectural issue: Wave's communication model is Forward-In-Time-Only (FITO). Every interaction between host and SmartNIC is a unidirectional message -- event forward, decision back -- creating a temporal vulnerability window in which decisions can become stale before they are enforced. Wave's entire optimization stack (write-combining page table entries, prestaging, prefetching, atomic transaction abort) exists to hide or tolerate this window. We apply the FITO diagnostic to Wave's architecture systematically, identify the category mistake it inherits from Lamport's happened-before and Shannon's channel model, and show how Open Atomic Ethernet's bilateral swap primitive -- implemented on the same Intel IPU hardware -- dissolves the latency, atomicity, and timeout problems without engineering around them. The SmartNIC is the right location for resource management; what is missing is the right communication primitive at that location.

The Markovianity of Time: The Category Mistake in Open Quantum Systems

2026-03-14T03:50:26Z

The Markov approximation is arguably the most ubiquitous tool in physics, underpinning quantum master equations, stochastic processes, and -- via Shannon's channel model and Lamport's logical clocks -- the foundational assumptions of distributed computing. It is widely assumed that Markovianity inherently implies temporal asymmetry: that the Markov property is a forward-in-time-only (FITO) construct. We show that this assumption is a category mistake in the sense of Ryle (1949). Guff, Shastry, and Rocco (2025) have recently demonstrated that the Markov approximation applied to the Caldeira-Leggett model -- a paradigmatic open quantum system -- maintains time-reversal symmetry in the derived equations of motion. The resulting time-symmetric formulations of quantum Brownian motion, Lindblad master equations, and Pauli master equations describe thermalisation that can occur in two opposing temporal directions. Asymmetry arises not from the dynamics but from boundary conditions. We trace how Markovianity's assumed directionality propagated from physics through Shannon's information theory to Lamport's happens-before relation and the impossibility theorems of distributed computing (FLP, CAP, Two Generals). Each step encodes FITO as convention, then treats it as physical law -- the same category mistake repeated across domains. The Surrey result establishes that this conflation is not merely philosophically suspect but mathematically unnecessary: the most fundamental approximation used to derive irreversibility is itself time-symmetric.

MegaScale-Data: Scaling Dataloader for Multisource Large Foundation Model Training

2026-03-14T02:46:46Z

Modern frameworks for training large foundation models (LFMs) employ dataloaders in a data-parallel manner, with each loader processing a disjoint subset of training data. When preparing data for LFM training that originates from multiple, distinct sources, two fundamental challenges arise. First, due to the quadratic computational complexity of the attention operator, the non-uniform sample distribution over data-parallel ranks leads to significant workload imbalance among dataloaders, degrading the training efficiency. Second, supporting diverse data sources requires per-dataset file access states that are redundantly replicated across parallel loaders, consuming excessive memory. This also hinders dynamic data mixing (e.g., curriculum learning) and causes redundant access/memory overhead in hybrid parallelism. We present MegaScale-Data, an industrial-grade distributed data loading architecture for multisource LFMs training, with three key innovations: (1) Disaggregated data preprocessing via role-specific actors (Source Loaders/Data Constructors) to eliminate source and parallelism redundant data access and ensure multisource scalability. (2) Centralized and declarative data plane for load-time multisource orchestration, such as long-short context, multimodality, and curriculum learning. (3) Multi-level auto-partitioning and scaling mechanism for source loaders under heterogeneous preprocessing costs. We also contribute our designs and operational experience in deployment and fault tolerance. MegaScale-Data achieves up to: (1) 4.5x end-to-end training throughput improvement, and (2) 13.5x reduction in CPU memory usage.

Audo-Sight: AI-driven Ambient Perception Across Edge-Cloud for Blind and Low Vision Users

2026-03-14T00:30:04Z

Despite advances in assistive technologies, Blind and Low-Vision (BLV) individuals continue to face challenges in understanding their surroundings. Delivering concise, useful, and timely scene descriptions for ambient perception remains a long-standing accessibility problem. To address this, we introduce Audo-Sight, an AI-driven assistive system across Edge-Cloud that enables BLV individuals to perceive their surroundings through voice-based conversational interaction. Audo-Sight employs a set of expert and generic AI agents, each supported by dedicated processing pipelines distributed across edge and cloud. It analyzes user queries by considering urgency and contextual information to infer the user intent and dynamically route each query, along with a scene frame, to the most suitable pipeline. In cases where users require fast responses, the system simultaneously leverages edge and cloud processing pipelines. The edge generates an initial response quickly, while the cloud provides more detailed and accurate information. To overcome the challenge of seamlessly combining these outputs, we introduce the Response Fusion Engine, which fuses the fast edge response with the more accurate cloud output, ensuring timely and high-accuracy response for the BLV users. Systematic evaluation shows that Audo-Sight delivers speech output around 80% faster for urgent tasks and generates complete responses approximately 50% faster across all tasks compared to a commercial cloud-based solution -- highlighting the effectiveness of our system across edge-cloud. Human evaluation of Audo-Sight shows that it is the preferred choice over GPT-5 for 62% of BLV participants with another 23% stating both perform comparably.

NCCL EP: Towards a Unified Expert Parallel Communication API for NCCL

2026-03-13T21:28:22Z

Mixture-of-Experts (MoE) architectures have become essential for scaling large language models, driving the development of specialized device-initiated communication libraries such as DeepEP, Hybrid-EP, and others. These libraries demonstrate the performance benefits of GPU-initiated RDMA for MoE dispatch and combine operations. This paper presents NCCL EP (Expert Parallelism), a ground-up MoE communication library built entirely on NCCL's Device API. NCCL EP provides unified ncclEpDispatch and ncclEpCombine primitives with both C and Python interfaces, supporting Low-Latency (LL) mode for inference decoding and High-Throughput (HT) mode for training and inference prefill. LL targets small batch sizes (1-128 tokens) using direct all-to-all RDMA+NVLink mesh connectivity with double-buffered communication for overlapping dispatch and combine phases. HT targets large batches (4096+ tokens) using hierarchical communication that aggregates tokens within NVLink domains before inter-node RDMA transmission. Both modes leverage Device API for both intra- and inter-node communications, taking advantage of its topology awareness and optimized GPU-initiated implementation. We evaluate NCCL EP on an H100-based cluster across multi-node configurations, demonstrating competitive LL kernel performance and presenting end-to-end results with vLLM integration. By building MoE communication natively within NCCL, NCCL EP provides a supported path for expert parallelism on current and emerging NVIDIA platforms.

Neuromorphic Computing: A Theoretical Framework for Time, Space, and Energy Scaling

2026-03-13T19:42:48Z

Neuromorphic computing (NMC) is increasingly viewed as a low-power alternative to conventional von Neumann architectures such as central processing units (CPUs) and graphics processing units (GPUs), however the computational value proposition has been difficult to define precisely. Here, we propose a computational framework for analyzing NMC algorithms and architectures. Using this framework, we demonstrate that NMC can be analyzed as general-purpose and programmable even though it differs considerably from a conventional stored-program architecture. We show that the time and space scaling of idealized NMC has comparable time and footprint tradeoffs that align with that of a theoretically infinite processor conventional system. In contrast, energy scaling for NMC is significantly different than conventional systems, as NMC energy costs are event-driven. Using this framework, we show that while energy in conventional systems is largely determined by the scheduled operations determined by the structural algorithm graph, the energy of neuromorphic systems scales with the activity of the algorithm, that is the activity trace of the algorithm graph. Without making strong assumptions on NMC or conventional costs, we demonstrate which neuromorphic algorithm formulations can exhibit asymptotically improved energy scaling when activity is sparse and decaying over time. We further use these results to identify which broad algorithm families are more or less suitable for NMC approaches.

A common parallel framework for LLP combinatorial problems

2026-03-13T16:34:57Z

Traditional lock-free parallel algorithms for combinatorial optimization problems, such as shortest paths, stable matching, and job scheduling require programmers to write problem-specific routines and synchronization code. We propose a general-purpose lock-free runtime, LLP-FW that can solve all combinatorial optimization problems that can be formulated as a Lattice-Linear Predicate by advancing all forbidden local states in parallel until a solution emerges. The only problem-specific code is a definition of the forbiddenness check and a definition of the advancement. We show that LLP-FW can solve several different combinatorial optimization problems, such as Single Source Shortest Paths (SSSP), Breadth-First Search (BFS), Stable Marriage, Job Scheduling, Transitive Closure, Parallel Reduction, and 0-1 Knapsack. We compare LLP-FW against hand-tuned, custom solutions for these seven problems and show that it compares favorably in the majority of cases.

Federated Few-Shot Learning on Neuromorphic Hardware: An Empirical Study Across Physical Edge Nodes

2026-03-13T14:45:52Z

Federated learning on neuromorphic hardware remains unexplored because on-chip spike-timing-dependent plasticity (STDP) produces binary weight updates rather than the floating-point gradients assumed by standard algorithms. We build a two-node federated system with BrainChip Akida AKD1000 processors and run approximately 1,580 experimental trials across seven analysis phases. Of four weight-exchange strategies tested, neuron-level concatenation (FedUnion) consistently preserves accuracy while element-wise weight averaging (FedAvg) destroys it (p = 0.002). Domain-adaptive fine-tuning of the upstream feature extractor accounts for most of the accuracy gains, confirming feature quality as the dominant factor. Scaling feature dimensionality from 64 to 256 yields 77.0% best-strategy federated accuracy (n=30, p < 0.001). Two independent asymmetries (wider features help federation more than individual learning, while binarization hurts federation more) point to a shared prototype complementarity mechanism: cross-node transfer scales with the distinctiveness of neuron prototypes.