https://arxiv.org/api/Vbs99gOcBivRyvgVE6m3pIU58oI 2026-04-07T12:59:41Z 27913 240 15 http://arxiv.org/abs/2603.20711v1 RoboECC: Multi-Factor-Aware Edge-Cloud Collaborative Deployment for VLA Models 2026-03-21T08:16:10Z Vision-Language-Action (VLA) models are mainstream in embodied intelligence but face high inference costs. Edge-Cloud Collaborative (ECC) deployment offers an effective fix by easing edge-device computing pressure to meet real-time needs. However, existing ECC frameworks are suboptimal for VLA models due to two challenges: (1) Diverse model structures hinder optimal ECC segmentation point identification; (2) Even if the optimal split point is determined, changes in network bandwidth can cause performance drift. To address these issues, we propose a novel ECC deployment framework for various VLA models, termed RoboECC. Specifically, we propose a model-hardware co-aware segmentation strategy to help find the optimal segmentation point for various VLA models. Moreover, we propose a network-aware deployment adjustment approach to adapt to the network fluctuations for maintaining optimal performance. Experiments demonstrate that RoboECC achieves a speedup of up to 3.28x with only 2.55x~2.62x overhead. 2026-03-21T08:16:10Z This paper has been accepted by IJCNN 2026 Zihao Zheng Hangyu Cao Jiayu Chen Sicheng Tian Chenyue Li Maoliang Li Xinhao Sun Guojie Luo Xiang Chen http://arxiv.org/abs/2603.20622v1 Incremental GNN Embedding Computation on Streaming Graphs 2026-03-21T03:35:03Z Graph Neural Network (GNN) on streaming graphs has gained increasing popularity. However, its practical deployment remains challenging, as the inference process relies on Runtime Embedding Computation (RTEC) to capture recent graph changes. This process incurs heavyweight multi-hop graph traversal overhead, which significantly undermines computation efficiency. We observe that the intermediate results for large portions of the graph remain unchanged during graph evolution, and thus redundant computations can be effectively eliminated through carefully designed incremental methods. In this work, we propose an efficient framework for incrementalizing RTEC on streaming graphs.The key idea is to decouple GNN computation into a set of generalized, fine-grained operators and safely reorder them, transforming the expensive full-neighbor GNN computation into a more efficient form over the affected subgraph. With this design, our framework preserves the semantics and accuracy of the original full-neighbor computation while supporting a wide range of GNN models with complex message-passing patterns. To further scale to graphs with massive historical results, we develop a GPU-CPU co-processing system that offloads embeddings to CPU memory with communication-optimized scheduling. Experiments across diverse graph sizes and GNN models show that our method reduces computation by 64%-99% and achieves 1.7x-145.8x speedups over existing solutions. 2026-03-21T03:35:03Z 14 pages; 12 figures; accepted for ICDE 2026 Qiange Wang Haoran Lv Yanfeng Zhang Weng-Fai Wong Bingsheng He http://arxiv.org/abs/2507.01113v2 Stannic: Systolic STochAstic ONliNe SchedulIng AcCelerator 2026-03-21T00:25:21Z Efficient workload scheduling is a critical challenge in modern heterogeneous computing environments, particularly in high-performance computing (HPC) systems. Traditional software-based schedulers struggle to efficiently balance workloads due to scheduling overhead, lack of adaptability to stochastic workloads, and suboptimal resource utilization. The scheduling problem further compounds in the context of shared HPC clusters, where job arrivals and processing times are inherently stochastic. Prediction of these elements is possible, but it introduces additional overhead. To perform this complex scheduling, we developed two FPGA-assisted hardware accelerator microarchitectures, Hercules and Stannic. Hercules adopts a task-centric abstraction of stochastic scheduling, whereas Stannic inherits a schedule-centric abstraction. These hardware-assisted solutions leverage parallelism, pre-calculation, and spatial memory access to significantly accelerate scheduling. We accelerate a non-preemptive stochastic online scheduling algorithm to produce heterogeneity-aware schedules in near real time. With Hercules, we achieved a speedup of up to 1060x over a baseline C/C++ implementation, demonstrating the efficacy of a hardware-assisted acceleration for heterogeneity-aware stochastic scheduling. With Stannic, we further improved efficiency, achieving a 7.5x reduction in latency per computation iteration and a 14x increase in the target heterogeneous system size. Experimental results show that the resulting schedules demonstrate efficient machine utilization and low average job latency in stochastic contexts. 2025-07-01T18:18:00Z 30 pages, 18 figures, Conference version published in Int'l Conference on Computer Aided Design (ICCAD) 2025. Journal version (current version) is under revision with ACM TRETS Adam H. Ross Vairavan Palaniappan Debjit Pal 10.1109/ICCAD66269.2025.11240693 http://arxiv.org/abs/2603.20531v1 Epistemic Observability in Language Models 2026-03-20T21:59:34Z We find that models report highest confidence precisely when they are fabricating. Across four model families (OLMo-3, Llama-3.1, Qwen3, Mistral), self-reported confidence inversely correlates with accuracy, with AUC ranging from 0.28 to 0.36 where 0.5 is random guessing. We prove, under explicit formal assumptions, that this is not a capability gap but an observational one. Under text-only observation, where a supervisor sees only the model's output text, no monitoring system can reliably distinguish honest model outputs from plausible fabrications. We prove two results: first, that any policy conditioning only on the query cannot satisfy epistemic honesty across ambiguous world states; second, that no learning algorithm optimizing reward from a text-only supervisor can converge to honest behavior when the supervisor's observations are identical for both grounded and fabricated responses. Within our formal model, these impossibilities hold regardless of model scale or training procedure, including RLHF and instruction tuning. We construct a tensor interface that escapes the impossibility by exporting computational byproducts (per-token entropy and log-probability distributions) that are structurally coupled to correctness under standard training. Per-token entropy achieves pooled AUC 0.757, outperforming all text baselines by 2.5--3.9 percentage points at every budget level tested (10\%, 20\%, 30\%). The entropy signal generalizes across architectures (Spearman $ρ= 0.762$). The core contribution is a cost surface where the empirical mapping from verification budget (fraction of queries receiving expensive checks) to detection accuracy for each judge strategy is a practical lookup for system builders deciding how to allocate verification resources. The contribution is the map. The territory is the system you are building. 2026-03-20T21:59:34Z Tony Mason http://arxiv.org/abs/2603.20512v1 SkyHOST: A Unified Architecture for Cross-Cloud Hybrid Object and Stream Transfer 2026-03-20T21:26:53Z Cloud and big data workloads are increasingly distributing data across multiple cloud providers and regions for rapid decision-making and analytics. Traditional transfer tools are typically specialized for a single paradigm, either stream replication or bulk transfer. This specialization forces users to deploy and manage separate systems with different configurations for each transfer pattern. This paper presents SkyHOST (Hybrid Object and Stream Transfer), a unified data movement architecture built upon the Skyplane framework to bridge the gap between bulk object transfer and streaming workloads through a single control plane and CLI. SkyHOST manages URI-based routing to automatically select the appropriate transfer mechanism, supporting both structured data for record-level ingestion and chunk-based transfer for large binary objects. We demonstrate, through an environmental monitoring use case and empirical evaluation, that SkyHOST provides operational simplicity by consolidating heterogeneous data movement patterns under a single control plane while achieving competitive throughput for cross-region transfers. 2026-03-20T21:26:53Z Submitted to IEEE Open Journal of the Computer Society. 11 pages, 6 figures, 4 tables Muhammad Arslan Tariq Grégoire Danoy Pascal Bouvry http://arxiv.org/abs/2511.16665v3 Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter 2026-03-20T17:56:05Z The emergence of Large Language Models (LLMs) with strong reasoning capabilities marks a significant milestone, unlocking new frontiers in complex problem-solving. However, training these reasoning models, typically using Reinforcement Learning (RL), encounters critical efficiency bottlenecks: response generation during RL training exhibits a persistent long-tail distribution, where a few very long responses dominate execution time, wasting resources and inflating costs. To address this, we propose TLT, a system that accelerates reasoning RL training losslessly by integrating adaptive speculative decoding. Applying speculative decoding in RL is challenging due to the dynamic workloads, evolving target model, and draft model training overhead. TLT overcomes these obstacles with two synergistic components: (1) Adaptive Drafter, a lightweight draft model trained continuously on idle GPUs during long-tail generation to maintain alignment with the target model at no extra cost; and (2) Adaptive Rollout Engine, which maintains a memory-efficient pool of pre-captured CUDAGraphs and adaptively select suitable SD strategies for each input batch. Evaluations demonstrate that TLT achieves over 1.7x end-to-end RL training speedup over state-of-the-art systems, preserves the model accuracy, and yields a high-quality draft model as a free byproduct suitable for efficient deployment. Code is released at https://github.com/mit-han-lab/fastrl. 2025-11-20T18:59:25Z Qinghao Hu Shang Yang Junxian Guo Xiaozhe Yao Yujun Lin Yuxian Gu Han Cai Chuang Gan Ana Klimovic Song Han http://arxiv.org/abs/2603.20364v1 DGNNFlow: A Streaming Dataflow Architecture for Real-Time Edge-based Dynamic GNN Inference in HL-LHC Trigger Systems 2026-03-20T17:22:21Z Dynamic GNN inference has exhibited effectiveness in High Energy Physics (HEP) experiments at High Luminosity Large Hadron Collider (HL-LHC) due to strong capability to model complex particle interactions in collision events. Future HEP experiments will involve detectors that produce 10x more collision data to help unlocking physics discoveries. Due to limitations in offline compute capacity and storage, revamped trigger systems require FPGAs to run ultra-low-latency Machine Learning models for online filtering of useful events with low power consumption. State-of-the-art GNN accelerators relied on static graph structures, but this assumption breaks down in real-time HL-LHC trigger systems and edge-based dynamic GNN models where edge embeddings change in-place based on neighbor node embeddings at runtime. We propose DGNNFlow, a novel dataflow architecture for real-time edge-based dynamic GNN inference applications, especially HL-LHC trigger systems, with three key contributions. First, we introduce hardware support for dynamic computation of edge embeddings. Second, we resolve data dependencies in edge-based dynamic GNN dataflow, where edge embedding is formulated using its source and target nodes. Third, we perform input dynamic graph construction auxiliary setup for complete support of models without pre-defined edge embeddings. We deployed DGNNFlow using AMD Alveo U50 FPGA to evaluate end-to-end latency on-board at 200 MHz clock frequency. DGNNFlow achieved 1.6x-6.3x speedup and 0.22x power consumption compared to GPU (NVIDIA RTX A6000) with batch sizes from 1 to 4, 3.2x-5.1x speedup and 0.25x power consumption compared to CPU (Intel Xeon Gold 6226R). Our complete implementation is publicly available on GitHub. 2026-03-20T17:22:21Z Davendra Maharaj Tu Pham Peter Meiring Kyungmin Park Sena Durgut Cong Hao Matteo Cremonesi http://arxiv.org/abs/2504.09775v5 Understanding and Optimizing Multi-Stage AI Inference Pipelines 2026-03-20T16:55:01Z The rapid evolution of Large Language Models (LLMs) has driven the need for increasingly sophisticated inference pipelines and hardware platforms. Modern LLM serving extends beyond traditional prefill-decode workflows, incorporating multi-stage processes such as Retrieval Augmented Generation (RAG), key-value (KV) cache retrieval, dynamic model routing, and multi step reasoning. These stages exhibit diverse computational demands, requiring distributed systems that integrate GPUs, ASICs, CPUs, and memory-centric architectures. However, existing simulators lack the fidelity to model these heterogeneous, multi-engine workflows, limiting their ability to inform architectural decisions. To address this gap, we introduce MIST, a Heterogeneous Multi-stage LLM inference Execution Simulator. MIST models diverse request stages; including RAG, KV retrieval, reasoning, prefill, and decode across complex hardware hierarchies. MIST supports heterogeneous clients executing multiple models concurrently unlike prior frameworks while incorporating advanced batching strategies and multi-level memory hierarchies. By integrating real hardware traces with analytical modeling, MIST captures critical trade-offs such as memory bandwidth contention, inter-cluster communication latency, and batching efficiency in hybrid CPU-accelerator deployments. Through case studies, we explore the impact of reasoning stages on end-to-end latency, optimal batching strategies for hybrid pipelines, and the architectural implications of remote KV cache retrieval. MIST empowers system designers to navigate the evolving landscape of LLM inference, providing actionable insights into optimizing hardware-software co-design for next-generation AI workloads. 2025-04-14T00:29:49Z Inference System Design for Multi-Stage AI Inference Pipelines. 13 Pages, 15 Figues, 3 Tables Abhimanyu Rajeshkumar Bambhaniya Hanjiang Wu Suvinay Subramanian Sudarshan Srinivasan Souvik Kundu Amir Yazdanbakhsh Midhilesh Elavazhagan Madhu Kumar Tushar Krishna http://arxiv.org/abs/2603.12381v2 OpenDC-STEAM: Realistic Modeling and Systematic Exploration of Composable Techniques for Sustainable Datacenters 2026-03-20T15:54:39Z The need to reduce datacenter carbon footprint is urgent. While many sustainability techniques have been proposed, they are often evaluated in isolation, using limited setups or analytical models that overlook real-world dynamics and interactions between methods. This makes it challenging for researchers and operators to understand the effectiveness and trade-offs of combining such techniques. We design OpenDC-STEAM, an open-source customizable datacenter simulator, to investigate the individual and combined impact of sustainability techniques on datacenter operational and embodied carbon emissions, and their trade-off with performance. Using STEAM, we systematically explore three representative techniques - horizontal scaling, leveraging batteries, and temporal shifting - with diverse representative workloads, datacenter configurations, and carbon-intensity traces. Our analysis highlights that datacenter dynamics can influence their effectiveness and that combining strategies can significantly lower emissions, but introduces complex cost-emissions-performance trade-offs that STEAM can help navigate. STEAM supports the integration of new models and techniques, making it a foundation framework for holistic, quantitative, and reproducible research in sustainable computing. Following open-science principles, STEAM is available as FOSS: https://github.com/atlarge-research/OpenDC-STEAM. 2026-03-12T18:59:30Z This is an extended version of a paper published at CCGRID 2026 Dante Niewenhuis Sacheendra Talluri Alexandru Iosup Tiziano de Matteis http://arxiv.org/abs/2603.19980v1 Stone-in-Waiting: A Cloud-Based Accelerator for the Quantum Approximate Optimization Algorithm 2026-03-20T14:23:57Z The Quantum Approximate Optimization Algorithm (QAOA) and its advanced variant, the Quantum Alternating Operator Ansatz (QAOA), are major research topics in the current era of Noisy Intermediate-Scale Quantum (NISQ) computing. However, the problem of initializing their parameters remains unresolved. Motivated by the combinatorial optimization task in the 6th MindSpore Quantum Computing Hackathon (2024), this paper proposes Stone-in-Waiting, a cloud-based accelerator for obtaining high-quality initial parameters for QAOA. Internally, the accelerator builds on state-of-the-art theories and methods for parameter determination and integrates four self-developed algorithms for QAOA parameter initialization, mainly based on Bayesian methods, nearest-neighbor methods, and metric learning. Compared with the Baseline Algorithm, the generated parameters improve the score by 40.19%. Externally, the accelerator offers both a web interface and an API, providing flexible and convenient access for users to test and develop related experiments and applications. This paper presents the design principles and methods of Stone-in-Waiting, demonstrates its functional characteristics, compares the strengths and weaknesses of the four proposed algorithms, and validates the overall system performance through experiments. 2026-03-20T14:23:57Z Shuai Zeng http://arxiv.org/abs/2602.24152v2 Advanced Scheduling Strategies for Distributed Quantum Computing Jobs 2026-03-20T09:34:58Z Distributed quantum computing (DQC) is being actively investigated as a means of scaling the number of qubits across multiple connected quantum devices. This includes quantum circuit compilation and execution management on multiple quantum devices in the network. The latter aspect is very challenging because, while reducing the makespan of job batches remains a relevant objective, novel quantum-specific constraints must be considered, including QPU utilization, non-local gate rate, and the latency associated with queued DQC jobs. In this work, a range of scheduling strategies is proposed, simulated, and evaluated, including heuristics that prioritize resource maximization for QPU utilization, node selection based on heterogeneous network connectivity, asynchronous node release upon job completion, and a scheduling strategy based on reinforcement learning with proximal policy optimization. These approaches are benchmarked against traditional FIFO and LIST schedulers under varying DQC job types and network conditions for the allocation of DQC jobs to devices within a network. 2026-02-27T16:35:32Z 14 pages, 10 figures, 9 tables Gongyu Ni Davide Ferrari Lester Ho Michele Amoretti http://arxiv.org/abs/2603.19787v1 Kumo: A Security-Focused Serverless Cloud Simulator 2026-03-20T09:23:04Z Serverless computing abstracts infrastructure management but also obscures system-level behaviors that can introduce security risks. Prior work has shown that serverless platforms are vulnerable to attacks exploiting shared execution environments, including attacker--victim co-location and denial-of-service through resource contention, yet analyzing these risks on production platforms is difficult due to limited observability, high cost, and lack of experimental control, while existing simulators primarily focus on performance and cost rather than security. We present Kumo, a security-focused simulator for serverless platforms that enables controlled, reproducible analysis of security risks arising from scheduling and resource sharing decisions. Kumo models invocation arrivals, scheduler placement, container reuse, resource contention, and queuing within a discrete-event framework, explicitly representing attackers and victims as first-class entities and providing metrics such as co-location probability, time to first co-location, invocation drop rate, and tail latency. Through two case studies, we show that scheduler choice is a first-order factor for co-location attacks, inducing orders-of-magnitude differences under identical workloads, while Denial-of-Service behavior is largely governed by system-level factors such as service time, queuing policy, and cluster capacity once contention dominates. These results highlight the need to distinguish scheduler-driven isolation risks from broader resource exhaustion vulnerabilities and position Kumo as a flexible foundation for systematic, security-aware exploration of serverless platforms. 2026-03-20T09:23:04Z In the proceedings of IEEE International Symposium on Cluster, Cloud, and Internet Computing (CCGRID) 2026 Wei Shao Khaled Khasawneh Setareh Rafatirad Houman Homayoun Chongzhou Fang http://arxiv.org/abs/2505.04269v2 Accelerating Triangle Counting with Real Processing-in-Memory Systems 2026-03-20T08:24:38Z Triangle Counting (TC) is a procedure that involves enumerating the number of triangles within a graph. It has important applications in numerous fields, such as social or biological network analysis and network security. TC is a memory-bound workload that does not scale efficiently in conventional processor-centric systems due to several memory accesses across large memory regions and low data reuse. However, recent Processing-in-Memory (PIM) architectures present a promising solution to alleviate these bottlenecks. Our work presents the first TC algorithm that leverages the capabilities of the UPMEM system, the first commercially available PIM architecture, while at the same time addressing its limitations. We use a vertex coloring technique to avoid expensive communication between PIM cores and employ reservoir sampling to address the limited amount of memory available in the PIM cores' DRAM banks. In addition, our work makes use of the Misra-Gries summary to speed up counting triangles on graphs with high-degree nodes and uniform sampling of the graph edges for quicker approximate results. Our PIM implementation surpasses state-of-the-art CPU-based TC implementations when processing dynamic graphs in Coordinate List format, showcasing the effectiveness of the UPMEM architecture in addressing TC's memory-bound challenges. 2025-05-07T09:20:03Z Proc. IPDPS Workshop on Graphs, Architectures, Programming, and Learning (GrAPL), 2025 Lorenzo Asquini Manos Frouzakis Juan Gómez-Luna Mohammad Sadrosadati Onur Mutlu Francesco Silvestri 10.1109/IPDPSW66978.2025.00126 http://arxiv.org/abs/2501.03227v5 When Should Selfish Miners Double-Spend? 2026-03-20T04:05:22Z Conventional double-spending attack models ignore the revenue losses stemming from the orphan blocks. On the other hand, selfish mining literature usually ignores the chance of the attacker to double-spend at no-cost in each attack cycle. In this paper, we give a rigorous stochastic analysis of an attack where the goal of the adversary is to double-spend while mining selfishly. To do so, we first combine stubborn and selfish mining attacks, \textit{i.e.}, construct a strategy where the attacker acts stubborn until its private branch reaches a certain length and then switches to act selfish. We provide the optimal stubbornness for each parameter regime. Next, we provide the maximum stubbornness that is still more profitable than honest mining and argue a connection between the level of stubbornness and the $k$-confirmation rule. We show that, at each attack cycle, if the level of stubbornness is higher than $k$, the adversary gets a free shot at double-spending. At each cycle, for a given stubbornness level, we rigorously formulate how great the probability of double-spending is. We further modify the attack in the stubborn regime in order to conceal the attack and increase the double-spending probability. 2025-01-06T18:59:26Z Mustafa Doger Sennur Ulukus http://arxiv.org/abs/2603.28787v1 Smartphone-Based Identification of Unknown Liquids via Active Vibration Sensing 2026-03-20T03:05:44Z Traditional liquid identification instruments are often unavailable to the general public. This paper shows the feasibility of identifying unknown liquids with commercial lightweight devices, such as a smartphone. The key insight is that different liquid molecules have different viscosity coefficients and therefore must overcome different energy barriers during relative motion. With this intuition in mind, we introduce a novel model that measures liquids' viscosity based on active vibration. However, building a robust system using built-in smartphone accelerometers is challenging. Practical issues include under-sampling, self-interference, and the impact of liquid-volume changes. Instead of machine learning, we tackle these issues through multiple signal processing stages to reconstruct the original signals and cancel out the interference. Our approach estimates liquid viscosity with a mean relative error of 2.9% and distinguishes 30 types of liquids with an average accuracy of 95.47%. 2026-03-20T03:05:44Z Conference on Mobile Computing and Networking (MobiCom),10 pages, 5 figures Proc. of the 27th Annual International Conference on Mobile Computing and Networking (MobiCom 2021), pages 174-187, 2021 Yongzhi Huang 10.1145/3447993.3448621