https://arxiv.org/api/j3fbpvejwANkYpayk8QmlbQJCw42026-04-03T08:30:43Z2787416515http://arxiv.org/abs/2602.15510v3On the Geometric Coherence of Global Aggregation in Federated Graph Neural Networks2026-03-23T08:53:04ZFederated learning over graph-structured data exposes a fundamental mismatch between standard aggregation mechanisms and the operator nature of graph neural networks (GNNs). While federated optimization treats model parameters as elements of a shared Euclidean space, GNN parameters induce graph-dependent message-passing operators whose semantics depend on underlying topology. Under structurally and distributionally heterogeneous client graph distributions, local updates correspond to perturbations of distinct operator manifolds. Linear aggregation of such updates mixes geometrically incompatible directions, producing global models that converge numerically yet exhibit degraded relational behavior. We formalize this phenomenon as a geometric failure of global aggregation in cross-domain federated GNNs, characterized by destructive interference between operator perturbations and loss of coherence in message-passing dynamics. This degradation is not captured by conventional metrics such as loss or accuracy, as models may retain predictive performance while losing structural sensitivity. To address this, we propose GGRS (Global Geometric Reference Structure), a server-side aggregation framework operating on a data-free proxy of operator perturbations. GGRS enforces geometric admissibility via directional alignment, subspace compatibility, and sensitivity control, preserving the structure of the induced message-passing operator.2026-02-17T11:34:04ZThis is a developing preprint of an 18-page journal manuscript (6 figures), currently being prepared for formal peer-review submissionChethana Prasad KabgereShylaja SShttp://arxiv.org/abs/2603.21692v1Reasoning Provenance for Autonomous AI Agents: Structured Behavioral Analytics Beyond State Checkpoints and Execution Traces2026-03-23T08:27:54ZAs AI agents transition from human-supervised copilots to autonomous platform infrastructure, the ability to analyze their reasoning behavior across populations of investigations becomes a pressing infrastructure requirement. Existing operational tooling addresses adjacent needs effectively: state checkpoint systems enable fault tolerance; observability platforms provide execution traces for debugging; telemetry standards ensure interoperability. What current systems do not natively provide as a first-class, schema-level primitive is structured reasoning provenance -- normalized, queryable records of why the agent chose each action, what it concluded from each observation, how each conclusion shaped its strategy, and which evidence supports its final verdict. This paper introduces the Agent Execution Record (AER), a structured reasoning provenance primitive that captures intent, observation, and inference as first-class queryable fields on every step, alongside versioned plans with revision rationale, evidence chains, structured verdicts with confidence scores, and delegation authority chains. We formalize the distinction between computational state persistence and reasoning provenance, argue that the latter cannot in general be faithfully reconstructed from the former, and show how AERs enable population-level behavioral analytics: reasoning pattern mining, confidence calibration, cross-agent comparison, and counterfactual regression testing via mock replay. We present a domain-agnostic model with extensible domain profiles, a reference implementation and SDK, and outline an evaluation methodology informed by preliminary deployment on a production platformized root cause analysis agent.2026-03-23T08:27:54Z8 pages, 2 tables, preprintNeelmani Visputehttp://arxiv.org/abs/2602.22350v3Engineered Simultaneity: The Physical Impossibility of Consolidated Price Discovery Across Spacelike-Separated Exchanges2026-03-23T07:24:51ZWe define \emph{engineered simultaneity}: the construction of a system that requires temporal comparison of events at spacelike-separated locations, implements this comparison via an implicit simultaneity convention, and represents the result as an objective measurement rather than a conventional choice. We show that the National Best Bid and Offer (NBBO) -- the regulatory cornerstone of U.S. equity markets -- is an instance of engineered simultaneity. The NBBO requires determining ``current'' prices across exchanges whose spatial separation places their price events outside each other's light cones. Special relativity proves that the temporal ordering of such events is frame-dependent: there exist inertial reference frames in which the NBBO differs from the value reported by the Securities Information Processor. The impossibility is not approximate; it is exact and unavoidable within the causal structure of Minkowski spacetime. General relativity compounds the impossibility: gravitational time dilation introduces frame-rate discrepancies between exchanges at different altitudes, and recent work on indefinite causal order in quantum information theory undermines the premise of a fixed causal structure altogether. We formalize the special-relativistic argument using the causal precedence relation, connect it to Lamport's theorem on distributed ordering, and note that approximately \$5~billion per year in latency arbitrage profits are extracted from the gap between the NBBO's implicit simultaneity convention and physical reality.2026-02-25T19:14:52Z9 pages, 2 figures, 2 tablesPaul Borrillhttp://arxiv.org/abs/2504.14145v2DIP: Efficient Large Multimodal Model Training with Dynamic Interleaved Pipeline2026-03-23T07:10:52ZLarge multimodal models (LMMs) have demonstrated excellent capabilities in both understanding and generation tasks with various modalities. While these models can accept flexible combinations of input data, their training efficiency suffers from two major issues: pipeline stage imbalance caused by heterogeneous model architectures, and training data dynamicity stemming from the diversity of multimodal data.
In this paper, we present DIP, a dynamic and modality-aware pipeline scheduling framework designed for LMM training. DIP tackles the challenge of dynamic imbalance via two key techniques: (1) separating computations of different modalities into dedicated pipeline segments to balance workloads within a continuous set of stages; (2) dynamically splitting input data into finer-grained, modality-specific sub-microbatches to balance workloads across these segments. By asynchronously generating pipeline schedules on idle CPU resources during training, DIP dynamically tailors stage executions to each input batch without stalling the training process. We validate DIP on a diverse set of five LMMs, ranging from 12B to 94B parameters and including vision-language and diffusion models. Experimental results show that our system achieves up to 97.3% higher throughput compared to state-of-the-art systems, demonstrating strong adaptability to fluctuating multimodal training workloads.2025-04-19T02:30:11ZTo be published in ASPLOS'26Zhenliang XueHanpeng HuXing ChenYimin JiangYixin SongZeyu MiYibo ZhuDaxin JiangYubin XiaHaibo Chen10.1145/3779212.3790154http://arxiv.org/abs/2603.21600v1Benchmarking Message Brokers for IoT Edge Computing: A Comprehensive Performance Study2026-03-23T05:49:19ZAsynchronous messaging is a cornerstone of modern distributed systems, enabling decoupled communication for scalable and resilient applications. Today's message queue (MQ) ecosystem spans a wide range of designs, from high-throughput streaming platforms to lightweight protocols tailored for edge and IoT environments. Despite this diversity, choosing an appropriate MQ system remains difficult. Existing evaluations largely focus on throughput and latency on fixed hardware, while overlooking CPU and memory footprint and the effects of resource constraints, factors that are critical for edge and IoT deployments. In this paper, we present a systematic performance study of eight prominent message brokers: Mosquitto, EMQX, HiveMQ, RabbitMQ, ActiveMQ Artemis, NATS Server, Redis (Pub/Sub), and Zenoh Router. We introduce mq-bench, a unified benchmarking framework to evaluate these systems under identical conditions, scaling up to 10,000 concurrent client pairs across three VM configurations representative of edge hardware. This study reveals several interesting and sometimes counter-intuitive insights. Lightweight native brokers achieve sub-millisecond latency, while feature-rich enterprise platforms incur 2-3X higher overhead. Under high connection loads, multi-threaded brokers like NATS and Zenoh scale efficiently, whereas the widely-deployed Mosquitto saturates earlier due to its single-threaded architecture. We also find that Java-based brokers consume significantly more memory than native implementations, which has important implications for memory-constrained edge deployments. Based on these findings, we provide practical deployment guidelines that map workload requirements and resource constraints to appropriate broker choices for telemetry, streaming analytics, and IoT use cases.2026-03-23T05:49:19ZAccepted at IEEE/ACM CCGrid 2026Tapajit Chandra PaulPawissanutt LertpongrujikornHai Duc NguyenMohsen Amini Salehihttp://arxiv.org/abs/2602.23036v2LLMServingSim 2.0: A Unified Simulator for Heterogeneous and Disaggregated LLM Serving Infrastructure2026-03-23T05:25:20ZLarge language model (LLM) serving infrastructures are undergoing a shift toward heterogeneity and disaggregation. Modern deployments increasingly integrate diverse accelerators and near-memory processing technologies, introducing significant hardware heterogeneity, while system software increasingly separates computation, memory, and model components across distributed resources to improve scalability and efficiency. As a result, LLM serving performance is no longer determined by hardware or software choices in isolation, but by their runtime interaction through scheduling, data movement, and interconnect behavior. However, understanding these interactions remains challenging, as existing simulators lack the ability to jointly model heterogeneous hardware and disaggregated serving techniques within a unified, runtime-driven framework.
This paper presents LLMServingSim 2.0, a unified system-level simulator designed to make runtime-driven hardware-software interactions in heterogeneous and disaggregated LLM serving infrastructures explicit and analyzable. LLMServingSim 2.0 embeds serving decisions and hardware behavior into a single runtime loop, enabling interaction-aware modeling of batching, routing, offloading, memory, and power. The simulator supports extensible integration of emerging accelerators and memory systems through profile-based modeling, while capturing dynamic serving behavior and system-level effects. We validate LLMServingSim 2.0 against real deployments, showing that it reproduces key performance, memory, and power metrics with an average error of 0.95%, while maintaining simulation times of around 10 minutes even for complex configurations. These results demonstrate that LLMServingSim 2.0 provides a practical bridge between hardware innovation and serving-system design, enabling systematic exploration and co-design for next-generation LLM serving infrastructures.2026-02-26T14:22:17Z14 pages, 11 figuresJaehong ChoHyunmin ChoiGuseul HeoJongse Parkhttp://arxiv.org/abs/2508.00596v4Information-Theoretic Decentralized Secure Aggregation with Passive Collusion Resilience2026-03-22T20:40:44ZIn decentralized federated learning (FL), multiple clients collaboratively learn a shared machine learning (ML) model by leveraging their privately held datasets distributed across the network, through interactive exchange of the intermediate model updates. To ensure data security, cryptographic techniques are commonly employed to protect model updates during aggregation. Despite growing interest in secure aggregation, existing works predominantly focus on protocol design and computational guarantees, with limited understanding of the fundamental information-theoretic limits of such systems. Moreover, optimal bounds on communication and key usage remain unknown in decentralized settings, where no central aggregator is available. Motivated by these gaps, we study the problem of decentralized secure aggregation (DSA) from an information-theoretic perspective. Specifically, we consider a network of $K$ fully-connected users, each holding a private input -- an abstraction of local training data -- who aim to securely compute the sum of all inputs. The security constraint requires that no user learns anything beyond the input sum, even when colluding with up to $T$ other users. We characterize the optimal rate region, which specifies the minimum achievable communication and secret key rates for DSA. In particular, we show that to securely compute one symbol of the desired input sum, each user must (i) transmit at least one symbol to others, (ii) hold at least one symbol of secret key, and (iii) all users must collectively hold no fewer than $K - 1$ independent key symbols. Our results establish the fundamental performance limits of DSA, providing insights for the design of provably secure and communication-efficient protocols in decentralized learning.2025-08-01T12:51:37ZAccepted by IEEE JSACXiang ZhangZhou LiShuangyang LiKai WanDerrick Wing Kwan NgGiuseppe Cairehttp://arxiv.org/abs/2603.21354v1The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project2026-03-22T18:30:11ZOver the past year, the vLLM Semantic Router project has released a series of work spanning: (1) core routing mechanisms -- signal-driven routing, context-length pool routing, router performance engineering, policy conflict detection, low-latency embedding models, category-aware semantic caching, user-feedback-driven routing adaptation, hallucination detection, and hierarchical content-safety classification for privacy and jailbreak protection; (2) fleet optimization -- fleet provisioning and energy-efficiency analysis; (3) agentic and multimodal routing -- multimodal agent routing, tool selection, CUA security, and multi-turn context memory and safety; (4) governance and standards -- inference routing protocols and multi-provider API extensions. Each paper tackled a specific problem in LLM inference, but the problems are not independent; for example, fleet provisioning depends on the routing policy, which depends on the workload mix, shifting as organizations adopt agentic and multimodal workloads. This paper distills those results into the Workload-Router-Pool (WRP) architecture, a three-dimensional framework for LLM inference optimization. Workload characterizes what the fleet serves (chat vs. agent, single-turn vs. multi-turn, warm vs. cold, prefill-heavy vs. decode-heavy). Router determines how each request is dispatched (static semantic rules, online bandit adaptation, RL-based model selection, quality-aware cascading). Pool defines where inference runs (homogeneous vs. heterogeneous GPU, disaggregated prefill/decode, KV-cache topology). We map our prior work onto a 3x3 WRP interaction matrix, identify which cells we have covered and which remain open, and propose twenty-one concrete research directions at the intersections, each grounded in our prior measurements, tiered by maturity from engineering-ready to open research.2026-03-22T18:30:11ZVision PaperHuamin ChenXunzhuo LiuBowei HeFuyuan LyuYankai ChenXue LiuYuhan LiuJunchen Jianghttp://arxiv.org/abs/2603.21340v1ARYA: A Physics-Constrained Composable & Deterministic World Model Architecture2026-03-22T17:46:04ZThis paper presents ARYA, a composable, physics-constrained, deterministic world model architecture built on five foundational principles: nano models, composability, causal reasoning, determinism, and architectural AI safety. We demonstrate that ARYA satisfies all canonical world model requirements, including state representation, dynamic prediction, causal and physical awareness, temporal consistency, generalization, learnability, and planning and control. Unlike monolithic foundation models, the ARYA foundation model implements these capabilities through a hierarchical system-of-system-of-systems of specialized nano models, orchestrated by AARA (ARYA Autonomous Research Agent), an always-on cognitive daemon that executes a continuous sense-decide-act-learn loop. The nano model architecture provides linear scaling, sparse activation, selective untraining, and sub-20-second training cycles, resolving the traditional tension between capability and computational efficiency. A central contribution is the Unfireable Safety Kernel: an architecturally immutable safety boundary that cannot be disabled or circumvented by any system component, including its own self-improvement engine. This is not a social or ethical alignment statement; it is a technical framework ensuring human control persists as autonomy increases. Safety is an architectural constraint governing every operation, not a policy layer applied after the fact. We present formal alignment between ARYA's architecture and canonical world model requirements, and report summarizing its state-of-the-art performance across 6 of 9 competitive benchmarks head-to-head with GPT-5.2, Opus 4.6, and V-JEPA-2. All with zero neural network parameters, across seven active industry domain nodes spanning aerospace, pharma manufacturing, oil and gas, smart cities, biotech, defense, and medical devices.2026-03-22T17:46:04ZSeth DobrinLukasz Chmielhttp://arxiv.org/abs/2603.28793v1Toward a Universal GPU Instruction Set Architecture: A Cross-Vendor Analysis of Hardware-Invariant Computational Primitives in Parallel Processors2026-03-22T16:19:12ZWe present the first systematic cross-vendor analysis of GPU instruction set architectures spanning all four major GPU vendors: NVIDIA (PTX ISA v1.0 through v9.2, Fermi through Blackwell), AMD (RDNA 1 to 4 and CDNA 1 to 4), Intel (Gen11, Xe-LP, Xe-HPG, Xe-HPC), and Apple (G13, reverse-engineered). Drawing on official ISA reference manuals, architecture whitepapers, patent filings, and community reverse-engineering efforts totaling over 5,000 pages of primary sources across 16 distinct microarchitectures, we identify ten hardware-invariant computational primitives that appear across all four architectures, six parameterizable dialects where vendors implement identical concepts with different parameters, and six true architectural divergences representing fundamental design disagreements. Based on this analysis, we propose an abstract execution model for a vendor-neutral GPU ISA grounded in the physical constraints of parallel computation. We validate our model with benchmark results on NVIDIA T4 and Apple M1 hardware, the two most architecturally distant platforms in our study. On five of six benchmark-platform pairs, the abstract model matches or exceeds native vendor-optimized performance. The single outlier (parallel reduction on NVIDIA, 62.5% of native) reveals that intra-wave shuffle must be a mandatory primitive, a finding that refines our proposed model.2026-03-22T16:19:12Z7 pages, 3 figures, 5 tables, 26 referencesOjima AbrahamOnyinye Okolihttp://arxiv.org/abs/2603.21257v1CALVO: Improve Serving Efficiency for LLM Inferences with Intense Network Demands2026-03-22T14:27:47ZDistributed prefix caching has become a core technique for efficient LLM serving. However, for long-context requests with high cache hit ratios, retrieving reusable KVCache blocks from remote servers has emerged as a new performance bottleneck. Such network-intensive LLM inference is expected to become increasingly common as agentic AI workloads continue to grow. However, existing LLM inference engines remain largely compute-centric: they treat KVCache loading as a subordinate phase to GPU execution and often fail to account for its delay explicitly during scheduling.
We present CALVO, an LLM serving engine that treats KVCache loading as a first-class concern. CALVO decouples KVCache loading and GPU computation into independently managed, asynchronously progressing stages, enabling better utilization of network, PCIe, and computation resources. In addition, CALVO incorporates KVCache loading delay as an explicit component of per-request service cost, leading to more accurate scheduling decisions. Experiments on a real testbed with diverse long-context workloads show that CALVO substantially improves the efficiency of network-intensive LLM inference, achieving up to 61.67% higher SLO attainment than the baseline.2026-03-22T14:27:47Z8 pages, 11 figuresWeiye WangShanghai Jiao Tong UniversityChen ChenShanghai Jiao Tong UniversityJunxue ZhangUniversity of Science and Technology of ChinaZhusheng WangHuaweiHui YuanHuaweiZixuan GuanHuaweiXiaolong ZhengHuaweiQizhen WengInstitute of Artificial IntelligenceYin ChenInstitute of Artificial IntelligenceMinyi GuoShanghai Jiao Tong Universityhttp://arxiv.org/abs/2507.19667v2Quantifying the Performance Gap for Simple Versus Optimal Dynamic Server Allocation Policies2026-03-22T13:45:41ZCloud computing enables the dynamic provisioning of server resources. To exploit this opportunity, a policy is needed for dynamically allocating (and deallocating) servers in response to the current load conditions. In this paper we describe several simple policies for dynamic server allocation and develop analytic models for their analysis. We also design semi-Markov decision models that enable determination of the performance achieved with optimal policies, allowing us to quantify the performance gap between simple, easily implemented policies, and optimal policies. Finally, we apply our models to study the potential performance benefits of state-dependent routing in multi-site systems when using dynamic server allocation at each site. Insights from our results are valuable to service providers wanting to balance cloud service costs and delays.2025-07-25T20:45:25ZAccepted to IEEE Transactions on Cloud Computing (TCC); 15 + 7 = 22 pagesIEEE Transactions on Cloud Computing (TCC), 2026Niklas CarlssonDerek Eager10.1109/TCC.2026.3672446http://arxiv.org/abs/2603.28792v1Parallel Gauss-Jordan Elimination and System Reduction for Efficient Circuit Simulation2026-03-22T10:22:08ZFor the purposes of electric circuit simulation, we consider an iterative simulation model based on solving systems of linear equations by Gauss-Jordan elimination (GJE) for individual moments in time. To accelerate the simulation, we propose two independent novel approaches: a parallel GJE algorithm and partial system reduction prior to the start of iterations. The former is based on a well-known strategy applied for the first time in this context, whereas the latter, to the best of our knowledge, proposes an entirely new system reduction approach. To evaluate performance, we implement these algorithms in C++ using OpenMP and run them on various input matrices. Our analyses of the individual methods show improved performance, whilst combining them maintains parallel efficiency after partial reduction on medium-sized matrices and even improves efficiency on the largest matrices on the tested machine.2026-03-22T10:22:08Z19 pages, 1 figure, 6 tablesFilip NoveskiElena Hadzievahttp://arxiv.org/abs/2603.21145v1NeSy-Edge: Neuro-Symbolic Trustworthy Self-Healing in the Computing Continuum2026-03-22T09:42:13ZThe computational demands of modern AI services are increasingly shifting execution beyond centralized clouds toward a computing continuum spanning edge and end devices. However, the scale, heterogeneity, and cross-layer dependencies of these environments make resilience difficult to maintain. Existing fault-management methods are often too static, fragmented, or heavy to support timely self-healing, especially under noisy logs and edge resource constraints. To address these limitations, this paper presents NeSy-Edge, a neuro-symbolic framework for trustworthy self-healing in the computing continuum. The framework follows an edge-first design, where a resource-constrained edge node performs local perception and reasoning, while a cloud model is invoked only at the final diagnosis stage. Specifically, NeSy-Edge converts raw runtime logs into structured event representations, builds a prior-constrained sparse symbolic causal graph, and integrates causal evidence with historical troubleshooting knowledge for root-cause analysis and recovery recommendation. We evaluate our work on representative Loghub datasets under multiple levels of semantic noise, considering parsing quality, causal reasoning, end-to-end diagnosis, and edge-side resource usage. The results show that NeSy-Edge remains robust even at the highest noise level, achieving up to 75% root-cause analysis accuracy and 65% end-to-end accuracy while operating within about 1500 MB of local memory.2026-03-22T09:42:13ZPeihan YeAlfreds LapkovskisAlaa SalehQiyang ZhangPraveen Kumar Dontahttp://arxiv.org/abs/2603.20966v1Communication Lower Bounds and Algorithms for Sketching with Random Dense Matrices2026-03-21T22:33:09ZSketching is widely used in randomized linear algebra for low-rank matrix approximation, column subset selection, and many other problems, and it has gained significant traction in machine learning applications. However, sketching large matrices often necessitates distributed memory algorithms, where communication overhead becomes a critical bottleneck on modern supercomputing clusters. Despite its growing relevance, distributed-memory parallel strategies for sketching remain largely unexplored. In this work, we establish communication lower bounds for sketching using dense matrices that determine how much data movement is required to perform it in parallel. One important observation of our lower bounds is that no communication is required for a small number of processors. We show that our lower bounds are tight by presenting communication optimal algorithms. Furthermore, we extend our approach to determine communication lower bounds for computations of Nyström approximation where sketching is applied twice. We also introduce novel parallel algorithms whose communication costs are close to the lower bounds. Finally, we implement our algorithms on modern state-of-the-art supercomputing infrastructures which have both CPU- and GPU-equipped systems and demonstrate their parallel scalability.2026-03-21T22:33:09ZHussam Al DaasGrey BallardLaura GrigoriMd Taufique HussainSuraj KumarMohammad Marufur RahmanKathryn Rouse