https://arxiv.org/api/naQJXrQH6vAIqkkWgqRjkRbBRy42026-04-07T11:27:28Z2791322515http://arxiv.org/abs/2603.21354v1The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project2026-03-22T18:30:11ZOver the past year, the vLLM Semantic Router project has released a series of work spanning: (1) core routing mechanisms -- signal-driven routing, context-length pool routing, router performance engineering, policy conflict detection, low-latency embedding models, category-aware semantic caching, user-feedback-driven routing adaptation, hallucination detection, and hierarchical content-safety classification for privacy and jailbreak protection; (2) fleet optimization -- fleet provisioning and energy-efficiency analysis; (3) agentic and multimodal routing -- multimodal agent routing, tool selection, CUA security, and multi-turn context memory and safety; (4) governance and standards -- inference routing protocols and multi-provider API extensions. Each paper tackled a specific problem in LLM inference, but the problems are not independent; for example, fleet provisioning depends on the routing policy, which depends on the workload mix, shifting as organizations adopt agentic and multimodal workloads. This paper distills those results into the Workload-Router-Pool (WRP) architecture, a three-dimensional framework for LLM inference optimization. Workload characterizes what the fleet serves (chat vs. agent, single-turn vs. multi-turn, warm vs. cold, prefill-heavy vs. decode-heavy). Router determines how each request is dispatched (static semantic rules, online bandit adaptation, RL-based model selection, quality-aware cascading). Pool defines where inference runs (homogeneous vs. heterogeneous GPU, disaggregated prefill/decode, KV-cache topology). We map our prior work onto a 3x3 WRP interaction matrix, identify which cells we have covered and which remain open, and propose twenty-one concrete research directions at the intersections, each grounded in our prior measurements, tiered by maturity from engineering-ready to open research.2026-03-22T18:30:11ZVision PaperHuamin ChenXunzhuo LiuBowei HeFuyuan LyuYankai ChenXue LiuYuhan LiuJunchen Jianghttp://arxiv.org/abs/2603.21340v1ARYA: A Physics-Constrained Composable & Deterministic World Model Architecture2026-03-22T17:46:04ZThis paper presents ARYA, a composable, physics-constrained, deterministic world model architecture built on five foundational principles: nano models, composability, causal reasoning, determinism, and architectural AI safety. We demonstrate that ARYA satisfies all canonical world model requirements, including state representation, dynamic prediction, causal and physical awareness, temporal consistency, generalization, learnability, and planning and control. Unlike monolithic foundation models, the ARYA foundation model implements these capabilities through a hierarchical system-of-system-of-systems of specialized nano models, orchestrated by AARA (ARYA Autonomous Research Agent), an always-on cognitive daemon that executes a continuous sense-decide-act-learn loop. The nano model architecture provides linear scaling, sparse activation, selective untraining, and sub-20-second training cycles, resolving the traditional tension between capability and computational efficiency. A central contribution is the Unfireable Safety Kernel: an architecturally immutable safety boundary that cannot be disabled or circumvented by any system component, including its own self-improvement engine. This is not a social or ethical alignment statement; it is a technical framework ensuring human control persists as autonomy increases. Safety is an architectural constraint governing every operation, not a policy layer applied after the fact. We present formal alignment between ARYA's architecture and canonical world model requirements, and report summarizing its state-of-the-art performance across 6 of 9 competitive benchmarks head-to-head with GPT-5.2, Opus 4.6, and V-JEPA-2. All with zero neural network parameters, across seven active industry domain nodes spanning aerospace, pharma manufacturing, oil and gas, smart cities, biotech, defense, and medical devices.2026-03-22T17:46:04ZSeth DobrinLukasz Chmielhttp://arxiv.org/abs/2603.28793v1Toward a Universal GPU Instruction Set Architecture: A Cross-Vendor Analysis of Hardware-Invariant Computational Primitives in Parallel Processors2026-03-22T16:19:12ZWe present the first systematic cross-vendor analysis of GPU instruction set architectures spanning all four major GPU vendors: NVIDIA (PTX ISA v1.0 through v9.2, Fermi through Blackwell), AMD (RDNA 1 to 4 and CDNA 1 to 4), Intel (Gen11, Xe-LP, Xe-HPG, Xe-HPC), and Apple (G13, reverse-engineered). Drawing on official ISA reference manuals, architecture whitepapers, patent filings, and community reverse-engineering efforts totaling over 5,000 pages of primary sources across 16 distinct microarchitectures, we identify ten hardware-invariant computational primitives that appear across all four architectures, six parameterizable dialects where vendors implement identical concepts with different parameters, and six true architectural divergences representing fundamental design disagreements. Based on this analysis, we propose an abstract execution model for a vendor-neutral GPU ISA grounded in the physical constraints of parallel computation. We validate our model with benchmark results on NVIDIA T4 and Apple M1 hardware, the two most architecturally distant platforms in our study. On five of six benchmark-platform pairs, the abstract model matches or exceeds native vendor-optimized performance. The single outlier (parallel reduction on NVIDIA, 62.5% of native) reveals that intra-wave shuffle must be a mandatory primitive, a finding that refines our proposed model.2026-03-22T16:19:12Z7 pages, 3 figures, 5 tables, 26 referencesOjima AbrahamOnyinye Okolihttp://arxiv.org/abs/2603.21257v1CALVO: Improve Serving Efficiency for LLM Inferences with Intense Network Demands2026-03-22T14:27:47ZDistributed prefix caching has become a core technique for efficient LLM serving. However, for long-context requests with high cache hit ratios, retrieving reusable KVCache blocks from remote servers has emerged as a new performance bottleneck. Such network-intensive LLM inference is expected to become increasingly common as agentic AI workloads continue to grow. However, existing LLM inference engines remain largely compute-centric: they treat KVCache loading as a subordinate phase to GPU execution and often fail to account for its delay explicitly during scheduling.
We present CALVO, an LLM serving engine that treats KVCache loading as a first-class concern. CALVO decouples KVCache loading and GPU computation into independently managed, asynchronously progressing stages, enabling better utilization of network, PCIe, and computation resources. In addition, CALVO incorporates KVCache loading delay as an explicit component of per-request service cost, leading to more accurate scheduling decisions. Experiments on a real testbed with diverse long-context workloads show that CALVO substantially improves the efficiency of network-intensive LLM inference, achieving up to 61.67% higher SLO attainment than the baseline.2026-03-22T14:27:47Z8 pages, 11 figuresWeiye WangShanghai Jiao Tong UniversityChen ChenShanghai Jiao Tong UniversityJunxue ZhangUniversity of Science and Technology of ChinaZhusheng WangHuaweiHui YuanHuaweiZixuan GuanHuaweiXiaolong ZhengHuaweiQizhen WengInstitute of Artificial IntelligenceYin ChenInstitute of Artificial IntelligenceMinyi GuoShanghai Jiao Tong Universityhttp://arxiv.org/abs/2507.19667v2Quantifying the Performance Gap for Simple Versus Optimal Dynamic Server Allocation Policies2026-03-22T13:45:41ZCloud computing enables the dynamic provisioning of server resources. To exploit this opportunity, a policy is needed for dynamically allocating (and deallocating) servers in response to the current load conditions. In this paper we describe several simple policies for dynamic server allocation and develop analytic models for their analysis. We also design semi-Markov decision models that enable determination of the performance achieved with optimal policies, allowing us to quantify the performance gap between simple, easily implemented policies, and optimal policies. Finally, we apply our models to study the potential performance benefits of state-dependent routing in multi-site systems when using dynamic server allocation at each site. Insights from our results are valuable to service providers wanting to balance cloud service costs and delays.2025-07-25T20:45:25ZAccepted to IEEE Transactions on Cloud Computing (TCC); 15 + 7 = 22 pagesIEEE Transactions on Cloud Computing (TCC), 2026Niklas CarlssonDerek Eager10.1109/TCC.2026.3672446http://arxiv.org/abs/2603.28792v1Parallel Gauss-Jordan Elimination and System Reduction for Efficient Circuit Simulation2026-03-22T10:22:08ZFor the purposes of electric circuit simulation, we consider an iterative simulation model based on solving systems of linear equations by Gauss-Jordan elimination (GJE) for individual moments in time. To accelerate the simulation, we propose two independent novel approaches: a parallel GJE algorithm and partial system reduction prior to the start of iterations. The former is based on a well-known strategy applied for the first time in this context, whereas the latter, to the best of our knowledge, proposes an entirely new system reduction approach. To evaluate performance, we implement these algorithms in C++ using OpenMP and run them on various input matrices. Our analyses of the individual methods show improved performance, whilst combining them maintains parallel efficiency after partial reduction on medium-sized matrices and even improves efficiency on the largest matrices on the tested machine.2026-03-22T10:22:08Z19 pages, 1 figure, 6 tablesFilip NoveskiElena Hadzievahttp://arxiv.org/abs/2603.21145v1NeSy-Edge: Neuro-Symbolic Trustworthy Self-Healing in the Computing Continuum2026-03-22T09:42:13ZThe computational demands of modern AI services are increasingly shifting execution beyond centralized clouds toward a computing continuum spanning edge and end devices. However, the scale, heterogeneity, and cross-layer dependencies of these environments make resilience difficult to maintain. Existing fault-management methods are often too static, fragmented, or heavy to support timely self-healing, especially under noisy logs and edge resource constraints. To address these limitations, this paper presents NeSy-Edge, a neuro-symbolic framework for trustworthy self-healing in the computing continuum. The framework follows an edge-first design, where a resource-constrained edge node performs local perception and reasoning, while a cloud model is invoked only at the final diagnosis stage. Specifically, NeSy-Edge converts raw runtime logs into structured event representations, builds a prior-constrained sparse symbolic causal graph, and integrates causal evidence with historical troubleshooting knowledge for root-cause analysis and recovery recommendation. We evaluate our work on representative Loghub datasets under multiple levels of semantic noise, considering parsing quality, causal reasoning, end-to-end diagnosis, and edge-side resource usage. The results show that NeSy-Edge remains robust even at the highest noise level, achieving up to 75% root-cause analysis accuracy and 65% end-to-end accuracy while operating within about 1500 MB of local memory.2026-03-22T09:42:13ZPeihan YeAlfreds LapkovskisAlaa SalehQiyang ZhangPraveen Kumar Dontahttp://arxiv.org/abs/2603.20966v1Communication Lower Bounds and Algorithms for Sketching with Random Dense Matrices2026-03-21T22:33:09ZSketching is widely used in randomized linear algebra for low-rank matrix approximation, column subset selection, and many other problems, and it has gained significant traction in machine learning applications. However, sketching large matrices often necessitates distributed memory algorithms, where communication overhead becomes a critical bottleneck on modern supercomputing clusters. Despite its growing relevance, distributed-memory parallel strategies for sketching remain largely unexplored. In this work, we establish communication lower bounds for sketching using dense matrices that determine how much data movement is required to perform it in parallel. One important observation of our lower bounds is that no communication is required for a small number of processors. We show that our lower bounds are tight by presenting communication optimal algorithms. Furthermore, we extend our approach to determine communication lower bounds for computations of Nyström approximation where sketching is applied twice. We also introduce novel parallel algorithms whose communication costs are close to the lower bounds. Finally, we implement our algorithms on modern state-of-the-art supercomputing infrastructures which have both CPU- and GPU-equipped systems and demonstrate their parallel scalability.2026-03-21T22:33:09ZHussam Al DaasGrey BallardLaura GrigoriMd Taufique HussainSuraj KumarMohammad Marufur RahmanKathryn Rousehttp://arxiv.org/abs/2603.20941v1Adviser: An Intuitive Multi-Cloud Platform for Scientific and ML Workflows2026-03-21T20:44:54ZEffectively leveraging the vast computational resources of modern cloud environments requires expertise spanning multiple technical domains: configuring scientific software with correct parameters and dependencies, navigating thousands of provider-specific instance types and pricing options, and managing parallel or distributed execution. We conduct a study indicating that the absence of these categories of expertise poses an ongoing challenge to unlocking the potential of cloud-enabled computational science. To address this challenge, we introduce Adviser, an intuitive multi-cloud platform centered on a workflow abstraction. Workflows are reusable, expert-crafted artifacts encapsulating environment setup, data processing, simulation, result capture, and visualization steps needed to execute scientific and ML applications. This approach allows users to specify high-level intent, while Adviser handles resource provisioning, runtime configuration, and data movement. Using two computational glaciology codes, Icepack and PISM, we show how to use Adviser to gain scientific insight and perform rapid exploration of cost-performance tradeoffs and scaling behavior without specialized expertise in cloud or high-performance computing.2026-03-21T20:44:54Z13 pages, 6 figures, 2 tablesShihan ChengMichael A. LaurenzanoBrian StrauchTimothy A. EllisKrish WadhwaniDavid A. B. Hydehttp://arxiv.org/abs/2412.07971v2Effectiveness of Distributed Gradient Descent with Local Steps for Overparameterized Models2026-03-21T17:55:46ZIn distributed training of machine learning models, gradient descent with local iterative steps, commonly known as Local (Stochastic) Gradient Descent (Local-(S)GD) or Federated averaging (FedAvg), is a very popular method to mitigate communication burden. In this method, gradient steps based on local datasets are taken independently in distributed compute nodes to update the local models, which are then aggregated intermittently. In the interpolation regime, Local-GD can converge to zero training loss. However, with many potential solutions corresponding to zero training loss, it is not known which solution Local-GD converges to. In this work we answer this question by analyzing implicit bias of Local-GD for classification tasks with linearly separable data. For the interpolation regime, our analysis shows that the aggregated global model obtained from Local-GD, with arbitrary number of local steps, converges exactly to the model that would be obtained if all data were in one place (centralized model) ''in direction''. Our result gives the exact rate of convergence to the centralized model with respect to the number of local steps. We also obtain the same implicit bias with a learning rate independent of number of local steps with a modified version of the Local-GD algorithm. Our analysis provides a new view to understand why Local-GD can still perform well with a very large number of local steps even for heterogeneous data. Lastly, we also discuss the extension of our results to Local-SGD and non-separable data.2024-12-10T23:19:40ZHeng ZhuHarsh VardhanArya Mazumdarhttp://arxiv.org/abs/2603.20831v1Error-resilient Distributed Local Verification2026-03-21T14:25:35ZWe study verification (decision) problems for graph properties in distributed networks under the locally checkable labeling framework, where nodes use labels (proofs) and local neighborhoods to decide acceptance or rejection.
Our focus is twofold. First, we study cycle detection. While it is known that this can be verified using 3 labels with access to the 1-hop neighborhood, we introduce a novel gadget that encodes direction along a path using only 2 labels and access to a 3-hop neighborhood. This yields a cycle-detection labeling scheme with just 2 labels and may be of independent interest.
Second, we consider adversarially corrupted labelings, where each node has access to a local neighborhood within which a fraction of nodes may receive erroneous labels. We introduce a general algorithmic framework, called refix, that transforms a base verification algorithm for a property P operating on labels within a d-hop neighborhood into one that tolerates up to i erroneous labels within a radius d+2i, by accessing a d+2i-hop neighborhood. We demonstrate applications to cycle detection, cycle absence, and bipartiteness, and provide lower bounds relating the number of errors to the required neighborhood size.2026-03-21T14:25:35ZPaweł GarncarekTomasz JurdzinskiDariusz KowalskiSubhajit Pramanickhttp://arxiv.org/abs/2603.20821v1Compass: Optimizing Compound AI Workflows for Dynamic Adaptation2026-03-21T13:40:48ZCompound AI is a distributed intelligence approach that represents a unified system orchestrating specialized AI/ML models with engineered software components into AI workflows. Compound AI production deployments must satisfy accuracy, latency, and cost objectives under varying loads. However, many deployments operate on fixed infrastructure where horizontal scaling is not viable. Existing approaches optimize solely for accuracy and do not consider changes in workload conditions. We observe that compound AI systems can switch between configurations to fit infrastructure capacity, trading accuracy for latency based on current load. This requires discovering multiple Pareto-optimal configurations from a combinatorial search space and determining when to switch between them at runtime. We present Compass, a novel framework that enables dynamic configuration switching through offline optimization and online adaptation. Compass consists of three components: COMPASS-V algorithm for configuration discovery, Planner for switching policy derivation, and Elastico Controller for runtime adaptation. COMPASS-V discovers accuracy-feasible configurations using finite-difference guided search and a combination of hill-climbing and lateral expansion. Planner profiles these configurations on target hardware and derives switching policies using a queuing theory based model. Elastico monitors queue depth and switches configurations based on derived thresholds. Across two compound AI workflows, COMPASS-V achieves 100% recall while reducing configuration evaluations by 57.5% on average compared to exhaustive search, with efficiency gains reaching 95.3% at tight accuracy thresholds. Runtime adaptation achieves 90-98% SLO compliance under dynamic load patterns, improving SLO compliance by 71.6% over static high-accuracy baselines, while simultaneously improving accuracy by 3-5% over static fast baselines.2026-03-21T13:40:48Z10 pages, 7 figures; accepted at the 26th IEEE International Symposium on Cluster, Cloud, and Internet Computing (CCGrid 2026)Milos GravaraJuan Luis HerreraStefan Nastichttp://arxiv.org/abs/2509.09525v2TrEnv-X: Transparently Share Serverless Execution Environments Across Different Functions and Nodes2026-03-21T12:02:04ZServerless computing is renowned for its computation elasticity, yet its full potential is often constrained by the requirement for functions to operate within local and dedicated background environments, resulting in limited memory elasticity. To address this limitation, this paper introduces TrEnv-X, a co-designed integration of the serverless platform with the operating system and CXL/RDMA-based remote memory pools. TrEnv-X's core innovations are repurposable sandboxes, which can be shared across different functions to decrease the associated creation overhead, and OS-level memory templates, which enable rapid state restoration from CXL/RDMA-based remote memory pools. To further demonstrate TrEnv-X's versatility, we generalize its design from traditional containers for microVM-based agent workloads and introduce new optimizations, including browser sharing and a page cache bypassing mechanism. Our evaluation shows that TrEnv-X achieves up to 7x reduction in P99 latency and 48% memory savings for container-based functions. When applied to LLM agents, it reduces the P99 latency by up to 58% and memory usage by 61% compared to state-of-the-art systems like E2B.2025-09-11T15:06:03ZAccepted by ACM Transactions on Computer Systems (TOCS)Jialiang HuangTeng MaZheng LiuSixing LinKang ChenJinlei JiangXia LiaoYingdi ShanYongwei WuNing ZhangMengting LuTao MaHaifeng GongMingxing Zhanghttp://arxiv.org/abs/2603.28790v1Mitigating Temporal Blindness in Kubernetes Autoscaling: An Attention-Double-LSTM Framework2026-03-21T10:03:53ZIn the emerging landscape of edge computing, the stochastic and bursty nature of serverless workloads presents a critical challenge for autonomous resource orchestration. Traditional reactive controllers, such as the Kubernetes Horizontal Pod Autoscaler (HPA), suffer from inherent reaction latency, leading to Service Level Objective (SLO) violations during traffic spikes and resource flapping during ramp-downs. While Deep Reinforcement Learning (DRL) offers a pathway toward proactive management, standard agents suffer from temporal blindness, an inability to effectively capture long-term dependencies in non-Markovian edge environments. To bridge this gap, we propose a novel stability-aware autoscaling framework unifying workload forecasting and control via an Attention-Enhanced Double-Stacked LSTM architecture integrated within a Proximal Policy Optimization (PPO) agent. Unlike shallow recurrent models, our approach employs a deep temporal attention mechanism to selectively weight historical states, effectively filtering high-frequency noise while retaining critical precursors of demand shifts. We validate the framework on a heterogeneous cluster using real-world Azure Functions traces. Comparative analysis against industry-standard HPA, stateless Double DQN, and a single-layer LSTM ablation demonstrates that our approach reduces 90th percentile latency by approximately 29% while simultaneously decreasing replica churn by 39%, relative to the single-layer LSTM baseline. These results confirm that mitigating temporal blindness through deep attentive memory is a prerequisite for reliable, low-jitter autoscaling in production edge environments.2026-03-21T10:03:53ZSubmitted for journal publicationFaraz ShaikhGianluca RealiMauro Femminellahttp://arxiv.org/abs/2603.20735v1Optimality in Decentralized Optimization under Bandwidth Constraints2026-03-21T09:49:42ZWe consider a realistic decentralized setup with bandwidth-constrained communication and derive optimal time complexities for non-convex stochastic parallel and asynchronous optimization (up to logarithmic factors). We develop the corresponding methods, Grace SGD and Leon SGD, for both homogeneous and heterogeneous settings. Unlike previous work, our optimal bounds are characterized in terms of min-cut/max-flow quantities and rely on tools from Gomory-Hu trees and Steiner Tree Packing problems, providing tighter and more practical complexities.2026-03-21T09:49:42ZAlexander Tyurin