https://arxiv.org/api/Z1+1W0GSxEWgIm1Ym9TsiQ/Xy68 2026-04-05T12:56:45Z 27874 180 15 http://arxiv.org/abs/2603.20941v1 Adviser: An Intuitive Multi-Cloud Platform for Scientific and ML Workflows 2026-03-21T20:44:54Z Effectively leveraging the vast computational resources of modern cloud environments requires expertise spanning multiple technical domains: configuring scientific software with correct parameters and dependencies, navigating thousands of provider-specific instance types and pricing options, and managing parallel or distributed execution. We conduct a study indicating that the absence of these categories of expertise poses an ongoing challenge to unlocking the potential of cloud-enabled computational science. To address this challenge, we introduce Adviser, an intuitive multi-cloud platform centered on a workflow abstraction. Workflows are reusable, expert-crafted artifacts encapsulating environment setup, data processing, simulation, result capture, and visualization steps needed to execute scientific and ML applications. This approach allows users to specify high-level intent, while Adviser handles resource provisioning, runtime configuration, and data movement. Using two computational glaciology codes, Icepack and PISM, we show how to use Adviser to gain scientific insight and perform rapid exploration of cost-performance tradeoffs and scaling behavior without specialized expertise in cloud or high-performance computing. 2026-03-21T20:44:54Z 13 pages, 6 figures, 2 tables Shihan Cheng Michael A. Laurenzano Brian Strauch Timothy A. Ellis Krish Wadhwani David A. B. Hyde http://arxiv.org/abs/2412.07971v2 Effectiveness of Distributed Gradient Descent with Local Steps for Overparameterized Models 2026-03-21T17:55:46Z In distributed training of machine learning models, gradient descent with local iterative steps, commonly known as Local (Stochastic) Gradient Descent (Local-(S)GD) or Federated averaging (FedAvg), is a very popular method to mitigate communication burden. In this method, gradient steps based on local datasets are taken independently in distributed compute nodes to update the local models, which are then aggregated intermittently. In the interpolation regime, Local-GD can converge to zero training loss. However, with many potential solutions corresponding to zero training loss, it is not known which solution Local-GD converges to. In this work we answer this question by analyzing implicit bias of Local-GD for classification tasks with linearly separable data. For the interpolation regime, our analysis shows that the aggregated global model obtained from Local-GD, with arbitrary number of local steps, converges exactly to the model that would be obtained if all data were in one place (centralized model) ''in direction''. Our result gives the exact rate of convergence to the centralized model with respect to the number of local steps. We also obtain the same implicit bias with a learning rate independent of number of local steps with a modified version of the Local-GD algorithm. Our analysis provides a new view to understand why Local-GD can still perform well with a very large number of local steps even for heterogeneous data. Lastly, we also discuss the extension of our results to Local-SGD and non-separable data. 2024-12-10T23:19:40Z Heng Zhu Harsh Vardhan Arya Mazumdar http://arxiv.org/abs/2603.20831v1 Error-resilient Distributed Local Verification 2026-03-21T14:25:35Z We study verification (decision) problems for graph properties in distributed networks under the locally checkable labeling framework, where nodes use labels (proofs) and local neighborhoods to decide acceptance or rejection. Our focus is twofold. First, we study cycle detection. While it is known that this can be verified using 3 labels with access to the 1-hop neighborhood, we introduce a novel gadget that encodes direction along a path using only 2 labels and access to a 3-hop neighborhood. This yields a cycle-detection labeling scheme with just 2 labels and may be of independent interest. Second, we consider adversarially corrupted labelings, where each node has access to a local neighborhood within which a fraction of nodes may receive erroneous labels. We introduce a general algorithmic framework, called refix, that transforms a base verification algorithm for a property P operating on labels within a d-hop neighborhood into one that tolerates up to i erroneous labels within a radius d+2i, by accessing a d+2i-hop neighborhood. We demonstrate applications to cycle detection, cycle absence, and bipartiteness, and provide lower bounds relating the number of errors to the required neighborhood size. 2026-03-21T14:25:35Z Paweł Garncarek Tomasz Jurdzinski Dariusz Kowalski Subhajit Pramanick http://arxiv.org/abs/2603.20821v1 Compass: Optimizing Compound AI Workflows for Dynamic Adaptation 2026-03-21T13:40:48Z Compound AI is a distributed intelligence approach that represents a unified system orchestrating specialized AI/ML models with engineered software components into AI workflows. Compound AI production deployments must satisfy accuracy, latency, and cost objectives under varying loads. However, many deployments operate on fixed infrastructure where horizontal scaling is not viable. Existing approaches optimize solely for accuracy and do not consider changes in workload conditions. We observe that compound AI systems can switch between configurations to fit infrastructure capacity, trading accuracy for latency based on current load. This requires discovering multiple Pareto-optimal configurations from a combinatorial search space and determining when to switch between them at runtime. We present Compass, a novel framework that enables dynamic configuration switching through offline optimization and online adaptation. Compass consists of three components: COMPASS-V algorithm for configuration discovery, Planner for switching policy derivation, and Elastico Controller for runtime adaptation. COMPASS-V discovers accuracy-feasible configurations using finite-difference guided search and a combination of hill-climbing and lateral expansion. Planner profiles these configurations on target hardware and derives switching policies using a queuing theory based model. Elastico monitors queue depth and switches configurations based on derived thresholds. Across two compound AI workflows, COMPASS-V achieves 100% recall while reducing configuration evaluations by 57.5% on average compared to exhaustive search, with efficiency gains reaching 95.3% at tight accuracy thresholds. Runtime adaptation achieves 90-98% SLO compliance under dynamic load patterns, improving SLO compliance by 71.6% over static high-accuracy baselines, while simultaneously improving accuracy by 3-5% over static fast baselines. 2026-03-21T13:40:48Z 10 pages, 7 figures; accepted at the 26th IEEE International Symposium on Cluster, Cloud, and Internet Computing (CCGrid 2026) Milos Gravara Juan Luis Herrera Stefan Nastic http://arxiv.org/abs/2509.09525v2 TrEnv-X: Transparently Share Serverless Execution Environments Across Different Functions and Nodes 2026-03-21T12:02:04Z Serverless computing is renowned for its computation elasticity, yet its full potential is often constrained by the requirement for functions to operate within local and dedicated background environments, resulting in limited memory elasticity. To address this limitation, this paper introduces TrEnv-X, a co-designed integration of the serverless platform with the operating system and CXL/RDMA-based remote memory pools. TrEnv-X's core innovations are repurposable sandboxes, which can be shared across different functions to decrease the associated creation overhead, and OS-level memory templates, which enable rapid state restoration from CXL/RDMA-based remote memory pools. To further demonstrate TrEnv-X's versatility, we generalize its design from traditional containers for microVM-based agent workloads and introduce new optimizations, including browser sharing and a page cache bypassing mechanism. Our evaluation shows that TrEnv-X achieves up to 7x reduction in P99 latency and 48% memory savings for container-based functions. When applied to LLM agents, it reduces the P99 latency by up to 58% and memory usage by 61% compared to state-of-the-art systems like E2B. 2025-09-11T15:06:03Z Accepted by ACM Transactions on Computer Systems (TOCS) Jialiang Huang Teng Ma Zheng Liu Sixing Lin Kang Chen Jinlei Jiang Xia Liao Yingdi Shan Yongwei Wu Ning Zhang Mengting Lu Tao Ma Haifeng Gong Mingxing Zhang http://arxiv.org/abs/2603.28790v1 Mitigating Temporal Blindness in Kubernetes Autoscaling: An Attention-Double-LSTM Framework 2026-03-21T10:03:53Z In the emerging landscape of edge computing, the stochastic and bursty nature of serverless workloads presents a critical challenge for autonomous resource orchestration. Traditional reactive controllers, such as the Kubernetes Horizontal Pod Autoscaler (HPA), suffer from inherent reaction latency, leading to Service Level Objective (SLO) violations during traffic spikes and resource flapping during ramp-downs. While Deep Reinforcement Learning (DRL) offers a pathway toward proactive management, standard agents suffer from temporal blindness, an inability to effectively capture long-term dependencies in non-Markovian edge environments. To bridge this gap, we propose a novel stability-aware autoscaling framework unifying workload forecasting and control via an Attention-Enhanced Double-Stacked LSTM architecture integrated within a Proximal Policy Optimization (PPO) agent. Unlike shallow recurrent models, our approach employs a deep temporal attention mechanism to selectively weight historical states, effectively filtering high-frequency noise while retaining critical precursors of demand shifts. We validate the framework on a heterogeneous cluster using real-world Azure Functions traces. Comparative analysis against industry-standard HPA, stateless Double DQN, and a single-layer LSTM ablation demonstrates that our approach reduces 90th percentile latency by approximately 29% while simultaneously decreasing replica churn by 39%, relative to the single-layer LSTM baseline. These results confirm that mitigating temporal blindness through deep attentive memory is a prerequisite for reliable, low-jitter autoscaling in production edge environments. 2026-03-21T10:03:53Z Submitted for journal publication Faraz Shaikh Gianluca Reali Mauro Femminella http://arxiv.org/abs/2603.20735v1 Optimality in Decentralized Optimization under Bandwidth Constraints 2026-03-21T09:49:42Z We consider a realistic decentralized setup with bandwidth-constrained communication and derive optimal time complexities for non-convex stochastic parallel and asynchronous optimization (up to logarithmic factors). We develop the corresponding methods, Grace SGD and Leon SGD, for both homogeneous and heterogeneous settings. Unlike previous work, our optimal bounds are characterized in terms of min-cut/max-flow quantities and rely on tools from Gomory-Hu trees and Steiner Tree Packing problems, providing tighter and more practical complexities. 2026-03-21T09:49:42Z Alexander Tyurin http://arxiv.org/abs/2603.20711v1 RoboECC: Multi-Factor-Aware Edge-Cloud Collaborative Deployment for VLA Models 2026-03-21T08:16:10Z Vision-Language-Action (VLA) models are mainstream in embodied intelligence but face high inference costs. Edge-Cloud Collaborative (ECC) deployment offers an effective fix by easing edge-device computing pressure to meet real-time needs. However, existing ECC frameworks are suboptimal for VLA models due to two challenges: (1) Diverse model structures hinder optimal ECC segmentation point identification; (2) Even if the optimal split point is determined, changes in network bandwidth can cause performance drift. To address these issues, we propose a novel ECC deployment framework for various VLA models, termed RoboECC. Specifically, we propose a model-hardware co-aware segmentation strategy to help find the optimal segmentation point for various VLA models. Moreover, we propose a network-aware deployment adjustment approach to adapt to the network fluctuations for maintaining optimal performance. Experiments demonstrate that RoboECC achieves a speedup of up to 3.28x with only 2.55x~2.62x overhead. 2026-03-21T08:16:10Z This paper has been accepted by IJCNN 2026 Zihao Zheng Hangyu Cao Jiayu Chen Sicheng Tian Chenyue Li Maoliang Li Xinhao Sun Guojie Luo Xiang Chen http://arxiv.org/abs/2603.20622v1 Incremental GNN Embedding Computation on Streaming Graphs 2026-03-21T03:35:03Z Graph Neural Network (GNN) on streaming graphs has gained increasing popularity. However, its practical deployment remains challenging, as the inference process relies on Runtime Embedding Computation (RTEC) to capture recent graph changes. This process incurs heavyweight multi-hop graph traversal overhead, which significantly undermines computation efficiency. We observe that the intermediate results for large portions of the graph remain unchanged during graph evolution, and thus redundant computations can be effectively eliminated through carefully designed incremental methods. In this work, we propose an efficient framework for incrementalizing RTEC on streaming graphs.The key idea is to decouple GNN computation into a set of generalized, fine-grained operators and safely reorder them, transforming the expensive full-neighbor GNN computation into a more efficient form over the affected subgraph. With this design, our framework preserves the semantics and accuracy of the original full-neighbor computation while supporting a wide range of GNN models with complex message-passing patterns. To further scale to graphs with massive historical results, we develop a GPU-CPU co-processing system that offloads embeddings to CPU memory with communication-optimized scheduling. Experiments across diverse graph sizes and GNN models show that our method reduces computation by 64%-99% and achieves 1.7x-145.8x speedups over existing solutions. 2026-03-21T03:35:03Z 14 pages; 12 figures; accepted for ICDE 2026 Qiange Wang Haoran Lv Yanfeng Zhang Weng-Fai Wong Bingsheng He http://arxiv.org/abs/2507.01113v2 Stannic: Systolic STochAstic ONliNe SchedulIng AcCelerator 2026-03-21T00:25:21Z Efficient workload scheduling is a critical challenge in modern heterogeneous computing environments, particularly in high-performance computing (HPC) systems. Traditional software-based schedulers struggle to efficiently balance workloads due to scheduling overhead, lack of adaptability to stochastic workloads, and suboptimal resource utilization. The scheduling problem further compounds in the context of shared HPC clusters, where job arrivals and processing times are inherently stochastic. Prediction of these elements is possible, but it introduces additional overhead. To perform this complex scheduling, we developed two FPGA-assisted hardware accelerator microarchitectures, Hercules and Stannic. Hercules adopts a task-centric abstraction of stochastic scheduling, whereas Stannic inherits a schedule-centric abstraction. These hardware-assisted solutions leverage parallelism, pre-calculation, and spatial memory access to significantly accelerate scheduling. We accelerate a non-preemptive stochastic online scheduling algorithm to produce heterogeneity-aware schedules in near real time. With Hercules, we achieved a speedup of up to 1060x over a baseline C/C++ implementation, demonstrating the efficacy of a hardware-assisted acceleration for heterogeneity-aware stochastic scheduling. With Stannic, we further improved efficiency, achieving a 7.5x reduction in latency per computation iteration and a 14x increase in the target heterogeneous system size. Experimental results show that the resulting schedules demonstrate efficient machine utilization and low average job latency in stochastic contexts. 2025-07-01T18:18:00Z 30 pages, 18 figures, Conference version published in Int'l Conference on Computer Aided Design (ICCAD) 2025. Journal version (current version) is under revision with ACM TRETS Adam H. Ross Vairavan Palaniappan Debjit Pal 10.1109/ICCAD66269.2025.11240693 http://arxiv.org/abs/2603.20531v1 Epistemic Observability in Language Models 2026-03-20T21:59:34Z We find that models report highest confidence precisely when they are fabricating. Across four model families (OLMo-3, Llama-3.1, Qwen3, Mistral), self-reported confidence inversely correlates with accuracy, with AUC ranging from 0.28 to 0.36 where 0.5 is random guessing. We prove, under explicit formal assumptions, that this is not a capability gap but an observational one. Under text-only observation, where a supervisor sees only the model's output text, no monitoring system can reliably distinguish honest model outputs from plausible fabrications. We prove two results: first, that any policy conditioning only on the query cannot satisfy epistemic honesty across ambiguous world states; second, that no learning algorithm optimizing reward from a text-only supervisor can converge to honest behavior when the supervisor's observations are identical for both grounded and fabricated responses. Within our formal model, these impossibilities hold regardless of model scale or training procedure, including RLHF and instruction tuning. We construct a tensor interface that escapes the impossibility by exporting computational byproducts (per-token entropy and log-probability distributions) that are structurally coupled to correctness under standard training. Per-token entropy achieves pooled AUC 0.757, outperforming all text baselines by 2.5--3.9 percentage points at every budget level tested (10\%, 20\%, 30\%). The entropy signal generalizes across architectures (Spearman $ρ= 0.762$). The core contribution is a cost surface where the empirical mapping from verification budget (fraction of queries receiving expensive checks) to detection accuracy for each judge strategy is a practical lookup for system builders deciding how to allocate verification resources. The contribution is the map. The territory is the system you are building. 2026-03-20T21:59:34Z Tony Mason http://arxiv.org/abs/2603.20512v1 SkyHOST: A Unified Architecture for Cross-Cloud Hybrid Object and Stream Transfer 2026-03-20T21:26:53Z Cloud and big data workloads are increasingly distributing data across multiple cloud providers and regions for rapid decision-making and analytics. Traditional transfer tools are typically specialized for a single paradigm, either stream replication or bulk transfer. This specialization forces users to deploy and manage separate systems with different configurations for each transfer pattern. This paper presents SkyHOST (Hybrid Object and Stream Transfer), a unified data movement architecture built upon the Skyplane framework to bridge the gap between bulk object transfer and streaming workloads through a single control plane and CLI. SkyHOST manages URI-based routing to automatically select the appropriate transfer mechanism, supporting both structured data for record-level ingestion and chunk-based transfer for large binary objects. We demonstrate, through an environmental monitoring use case and empirical evaluation, that SkyHOST provides operational simplicity by consolidating heterogeneous data movement patterns under a single control plane while achieving competitive throughput for cross-region transfers. 2026-03-20T21:26:53Z Submitted to IEEE Open Journal of the Computer Society. 11 pages, 6 figures, 4 tables Muhammad Arslan Tariq Grégoire Danoy Pascal Bouvry http://arxiv.org/abs/2511.16665v3 Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter 2026-03-20T17:56:05Z The emergence of Large Language Models (LLMs) with strong reasoning capabilities marks a significant milestone, unlocking new frontiers in complex problem-solving. However, training these reasoning models, typically using Reinforcement Learning (RL), encounters critical efficiency bottlenecks: response generation during RL training exhibits a persistent long-tail distribution, where a few very long responses dominate execution time, wasting resources and inflating costs. To address this, we propose TLT, a system that accelerates reasoning RL training losslessly by integrating adaptive speculative decoding. Applying speculative decoding in RL is challenging due to the dynamic workloads, evolving target model, and draft model training overhead. TLT overcomes these obstacles with two synergistic components: (1) Adaptive Drafter, a lightweight draft model trained continuously on idle GPUs during long-tail generation to maintain alignment with the target model at no extra cost; and (2) Adaptive Rollout Engine, which maintains a memory-efficient pool of pre-captured CUDAGraphs and adaptively select suitable SD strategies for each input batch. Evaluations demonstrate that TLT achieves over 1.7x end-to-end RL training speedup over state-of-the-art systems, preserves the model accuracy, and yields a high-quality draft model as a free byproduct suitable for efficient deployment. Code is released at https://github.com/mit-han-lab/fastrl. 2025-11-20T18:59:25Z Qinghao Hu Shang Yang Junxian Guo Xiaozhe Yao Yujun Lin Yuxian Gu Han Cai Chuang Gan Ana Klimovic Song Han http://arxiv.org/abs/2603.20364v1 DGNNFlow: A Streaming Dataflow Architecture for Real-Time Edge-based Dynamic GNN Inference in HL-LHC Trigger Systems 2026-03-20T17:22:21Z Dynamic GNN inference has exhibited effectiveness in High Energy Physics (HEP) experiments at High Luminosity Large Hadron Collider (HL-LHC) due to strong capability to model complex particle interactions in collision events. Future HEP experiments will involve detectors that produce 10x more collision data to help unlocking physics discoveries. Due to limitations in offline compute capacity and storage, revamped trigger systems require FPGAs to run ultra-low-latency Machine Learning models for online filtering of useful events with low power consumption. State-of-the-art GNN accelerators relied on static graph structures, but this assumption breaks down in real-time HL-LHC trigger systems and edge-based dynamic GNN models where edge embeddings change in-place based on neighbor node embeddings at runtime. We propose DGNNFlow, a novel dataflow architecture for real-time edge-based dynamic GNN inference applications, especially HL-LHC trigger systems, with three key contributions. First, we introduce hardware support for dynamic computation of edge embeddings. Second, we resolve data dependencies in edge-based dynamic GNN dataflow, where edge embedding is formulated using its source and target nodes. Third, we perform input dynamic graph construction auxiliary setup for complete support of models without pre-defined edge embeddings. We deployed DGNNFlow using AMD Alveo U50 FPGA to evaluate end-to-end latency on-board at 200 MHz clock frequency. DGNNFlow achieved 1.6x-6.3x speedup and 0.22x power consumption compared to GPU (NVIDIA RTX A6000) with batch sizes from 1 to 4, 3.2x-5.1x speedup and 0.25x power consumption compared to CPU (Intel Xeon Gold 6226R). Our complete implementation is publicly available on GitHub. 2026-03-20T17:22:21Z Davendra Maharaj Tu Pham Peter Meiring Kyungmin Park Sena Durgut Cong Hao Matteo Cremonesi http://arxiv.org/abs/2504.09775v5 Understanding and Optimizing Multi-Stage AI Inference Pipelines 2026-03-20T16:55:01Z The rapid evolution of Large Language Models (LLMs) has driven the need for increasingly sophisticated inference pipelines and hardware platforms. Modern LLM serving extends beyond traditional prefill-decode workflows, incorporating multi-stage processes such as Retrieval Augmented Generation (RAG), key-value (KV) cache retrieval, dynamic model routing, and multi step reasoning. These stages exhibit diverse computational demands, requiring distributed systems that integrate GPUs, ASICs, CPUs, and memory-centric architectures. However, existing simulators lack the fidelity to model these heterogeneous, multi-engine workflows, limiting their ability to inform architectural decisions. To address this gap, we introduce MIST, a Heterogeneous Multi-stage LLM inference Execution Simulator. MIST models diverse request stages; including RAG, KV retrieval, reasoning, prefill, and decode across complex hardware hierarchies. MIST supports heterogeneous clients executing multiple models concurrently unlike prior frameworks while incorporating advanced batching strategies and multi-level memory hierarchies. By integrating real hardware traces with analytical modeling, MIST captures critical trade-offs such as memory bandwidth contention, inter-cluster communication latency, and batching efficiency in hybrid CPU-accelerator deployments. Through case studies, we explore the impact of reasoning stages on end-to-end latency, optimal batching strategies for hybrid pipelines, and the architectural implications of remote KV cache retrieval. MIST empowers system designers to navigate the evolving landscape of LLM inference, providing actionable insights into optimizing hardware-software co-design for next-generation AI workloads. 2025-04-14T00:29:49Z Inference System Design for Multi-Stage AI Inference Pipelines. 13 Pages, 15 Figues, 3 Tables Abhimanyu Rajeshkumar Bambhaniya Hanjiang Wu Suvinay Subramanian Sudarshan Srinivasan Souvik Kundu Amir Yazdanbakhsh Midhilesh Elavazhagan Madhu Kumar Tushar Krishna