https://arxiv.org/api/NhCPMieT/kSpvBzG99tjVs5tqkI 2026-04-14T19:57:05Z 28013 630 15 http://arxiv.org/abs/2603.02642v1 cuNRTO: GPU-Accelerated Nonlinear Robust Trajectory Optimization 2026-03-03T06:17:08Z Robust trajectory optimization enables autonomous systems to operate safely under uncertainty by computing control policies that satisfy the constraints for all bounded disturbances. However, these problems often lead to large Second Order Conic Programming (SOCP) constraints, which are computationally expensive. In this work, we propose the CUDA Nonlinear Robust Trajectory Optimization (cuNRTO) framework by introducing two dynamic optimization architectures that have direct application to robust decision-making and are implemented on CUDA. The first architecture, NRTO-DR, leverages the Douglas-Rachford (DR) splitting method to solve the SOCP inner subproblems of NRTO, thereby significantly reducing the computational burden through parallel SOCP projections and sparse direct solves. The second architecture, NRTO-FullADMM, is a novel variant that further exploits the problem structure to improve scalability using the Alternating Direction Method of Multipliers (ADMM). Finally, we provide GPU implementation of the proposed methodologies using custom CUDA kernels for SOC projection steps and cuBLAS GEMM chains for feedback gain updates. We validate the performance of cuNRTO through simulated experiments on unicycle, quadcopter, and Franka manipulator models, demonstrating speedup up to 139.6$\times$. 2026-03-03T06:17:08Z Jiawei Wang Arshiya Taj Abdul Evangelos A. Theodorou http://arxiv.org/abs/2601.16536v2 W4A16 Mixed-Precision Matrix Multiplication on Decoupled Architecture: Kernel Design and Memory Bottleneck Analysis for Ascend NPUs 2026-03-03T06:05:02Z As Large Language Models (LLMs) scale, weight-only quantization (W4A16: 4-bit weights, 16-bit activations) becomes critical for reducing memory footprint with minimal accuracy loss. However, its efficient deployment on Huawei's Ascend 910 Neural Processing Unit (NPU) is challenging due to limited native mixed-precision support and the accelerator's decoupled compute architecture. To enable quantization on such architecture, we present the first practical W4A16 matrix multiplication kernel tailored for the Ascend 910 NPU. Our design leverages vector cores for on-the-fly INT4-to-FP16 dequantization, cube cores for high-throughput GEMM, and Split-K parallelization to mitigate memory latency. Performance evaluations across diverse matrix shapes and batch sizes show our method outperforms data-parallel approaches when K >> N, a typical scenario in LLM decoding. Specially, our method can achieve a speedup ranging from 1.01x to 1.74x. In addition, our profile reveals the primary bottleneck is not dequantization compution itself, but extra global memory transfer for the weight, making W4A16 only reaching a maximum speedup of 1.48x over native FP16xFP16 matrix multiplication in PyTorch. In the long run, our method lays a solid foundation and provides insightful views for the efficient deployment of quantized large language models on various domain-specific accelerators. 2026-01-23T08:15:13Z Yuanhong He Peiyu Niu Jun Chen Chenchen Zhang Chao Yang http://arxiv.org/abs/2603.02636v1 Undecided State Dynamics with Many Opinions 2026-03-03T06:04:15Z We study the Undecided-State Dynamics (USD), a fundamental consensus process in which each vertex holds one of $k$ decided opinions or the undecided state. We consider both the gossip model and the population protocol model. Prior work established tight bounds on the consensus time of this process only for the regime $k = O(\sqrt{n}/(\log n)^2)$ (for the population protocol model) and $k = O((n/\log n)^{1/3})$ (for the gossip model), often under restrictive assumptions on the initial configuration. In this paper, we obtain the first consensus-time guarantees for USD that hold for \emph{arbitrary} $2\le k\le n$ and for \emph{arbitrary} initial configurations in both the gossip model and the population protocol model. In the gossip model, USD reaches consensus within $\widetilde O(\min\{k,\sqrt n\})$ synchronous rounds with probability $1-p_{\bot}-n^{-c}$, where $p_{\bot}$ is the gossip-specific probability of collapsing to the all-undecided state in the first round. In the population protocol model, USD reaches consensus within $\widetilde O(\min\{kn,n^{3/2}\})$ asynchronous interactions with high probability. We also present lower bounds that match the upper bounds up to polylogarithmic factors for a specific initial configuration and show that our upper bounds are essentially optimal. 2026-03-03T06:04:15Z Colin Cooper Frederik Mallmann-Trenn Tomasz Radzik Nobutaka Shimizu Takeharu Shiraga http://arxiv.org/abs/2510.08976v2 MIRAGE: Runtime Scheduling for Multi-Vector Image Retrieval with Hierarchical Decomposition 2026-03-03T05:34:36Z To effectively leverage user-specific data, retrieval augmented generation (RAG) is employed in multimodal large language model (MLLM) applications. However, conventional retrieval approaches often suffer from limited retrieval accuracy. Recent advances in multi-vector retrieval (MVR) improve accuracy by decomposing queries and matching against segmented images. They still suffer from sub-optimal accuracy and efficiency, overlooking alignment between the query and varying image objects and redundant fine-grained image segments. In this work, we present an efficient scheduling framework for image retrieval - MIRAGE. First, we introduce a novel hierarchical paradigm, employing multiple intermediate granularities for varying image objects to enhance alignment. Second, we minimize redundancy in retrieval by leveraging cross-hierarchy similarity consistency and hierarchy sparsity to minimize unnecessary matching computation. Furthermore, we configure parameters for each dataset automatically for practicality across diverse scenarios. Our empirical study shows that, MIRAGE not only achieves substantial accuracy improvements but also reduces computation by up to 3.5 times over the existing MVR system. 2025-10-10T03:36:18Z Will appear in DAC'2026 Maoliang Li Ke Li Yaoyang Liu Jiayu Chen Zihao Zheng Yinjun Wu Chenchen Liu Xiang Chen http://arxiv.org/abs/2603.02603v1 Why Atomicity Matters to AI/ML Infrastructure: Snapshots, Firmware Updates, and the Cost of the Forward-In-Time-Only Category Mistake 2026-03-03T05:08:55Z Large-scale AI/ML training systems depend on two assumptions that are rarely examined: (1) that checkpoints represent atomic snapshots of global training state, and (2) that infrastructure updates can be applied without inducing mixed-protocol cluster states. Both assumptions are instances of a deeper structural error: the Forward-In-Time-Only (FITO) category mistake, which confuses protocol convergence properties with temporal predicates. We formalize this confusion as a type error: the identification of a temporal snapshot $\mathsf{Snap}(t)$ with a convergence property $\mathsf{Conv}(\mathcal{P},e)$. We model checkpoint execution in a process-algebraic framework and prove that under asynchronous composition with crash-recovery failures, no temporal instant can serve as an atomicity boundary. We reformulate checkpoint inconsistency on an epoch lattice and show that atomicity is a measure-zero event whose complement grows exponentially with the number of independent persistence domains. We formalize mixed-epoch recovery as a type violation in the optimization algebra and show that the resulting update is not a valid step of any standard optimizer. For firmware fleet updates, we strengthen the known consensus-hardness result: atomic deployment requires not merely agreement but common knowledge of the epoch transition, which is strictly unattainable in asynchronous systems with unreliable communication. We conclude by sketching a bilateral convergence protocol, inspired by Open Atomic Ethernet, that achieves $\mathsf{Conv}(\mathcal{P},e)$ without requiring $\mathsf{Snap}(t)$ -- replacing the FITO assumption with constraint semantics. 2026-03-03T05:08:55Z Paul Borrill http://arxiv.org/abs/2403.20135v3 Parallel performance of shared memory parallel spectral deferred corrections 2026-03-03T04:53:27Z We investigate the parallel performance of Parallel Spectral Deferred corrections, a numerical approach that provides small-scale parallelism for the numerical solution of initial value problems. The scheme is applied to the shallow-water equation and uses an implicit-explicit splitting that, in order to be efficient, integrates fast modes implicitly and slow modes explicitly. We describe parallel \OpenMP-based implementations of parallel Spectral Deferred Corrections for two well established simulation codes: the finite volume based operational ocean model \ICON and the spherical harmonics based research code \SWEET. We also develop a performance model and benchmark our implementations on a single node of the JUSUF (\SWEET) and JUWELS (\ICON) system at Jülich Supercomputing Centre. A reduction of time-to-solution across a range of accuracies is demonstrated. For \ICON, we show speedup over the currently used Adams--Bashforth-2 integrator with \OpenMP loop parallelization. For \SWEET, we show speedup over serial Spectral Deferred Corrections and a second order implicit-explicit integrator. 2024-03-29T11:56:58Z 21 pages, 5 figures Philip Freese Sebastian Götschel Thibaut Lunet Daniel Ruprecht Martin Schreiber 10.1177/10943420251400406 http://arxiv.org/abs/2603.02597v1 GPUTOK: GPU Accelerated Byte Level BPE Tokenization 2026-03-03T04:48:28Z As large language models move toward million-token context windows, CPU tokenizers become a major slowdown because they process text one step at a time while powerful GPUs sit unused. We built a GPU-based byte-level BPE tokenizer that follows GPT-2's merge rules. It includes a basic BlockBPE-style kernel and a faster, optimized version that uses cuCollections static map, CUB reductions, and a pybind11 interface for Python. On WikiText103 sequences up to 131k tokens, the optimized GPU tokenizer produces the same tokens as a CPU version and, for the longest inputs, is about 1.7x faster than tiktoken and about 7.6x faster than the HuggingFace GPT-2 tokenizer. Nsight profiling shows that 70-80% of CUDA API time goes to memory allocation, so adding memory pooling should give the biggest speed boost next. Tests on generation tasks using WikiText103 prompts show that our GPU tokenizer's outputs stay within about one percentage point of tiktoken and HuggingFace GPT-2 on similarity and overlap metrics, meaning it keeps output quality while making long-context inference more practical. 2026-03-03T04:48:28Z Venu Gopal Kadamba Kanishkha Jaisankar http://arxiv.org/abs/2603.02510v1 ParEVO: Synthesizing Code for Irregular Data: High-Performance Parallelism through Agentic Evolution 2026-03-03T01:41:07Z The transition from sequential to parallel computing is essential for modern high-performance applications but is hindered by the steep learning curve of concurrent programming. This challenge is magnified for irregular data structures (such as sparse graphs, unbalanced trees, and non-uniform meshes) where static scheduling fails and data dependencies are unpredictable. Current Large Language Models (LLMs) often fail catastrophically on these tasks, generating code plagued by subtle race conditions, deadlocks, and sub-optimal scaling. We bridge this gap with ParEVO, a framework designed to synthesize high-performance parallel algorithms for irregular data. Our contributions include: (1) The Parlay-Instruct Corpus, a curated dataset of 13,820 tasks synthesized via a "Critic-Refine" pipeline that explicitly filters for empirically performant algorithms that effectively utilize Work-Span parallel primitives; (2) specialized DeepSeek, Qwen, and Gemini models fine-tuned to align probabilistic generation with the rigorous semantics of the ParlayLib library; and (3) an Evolutionary Coding Agent (ECA) that improves the "last mile" of correctness by iteratively repairing code using feedback from compilers, dynamic race detectors, and performance profilers. On the ParEval benchmark, ParEVO achieves an average 106x speedup (with a maximum of 1103x) across the suite, and a robust 13.6x speedup specifically on complex irregular graph problems, outperforming state-of-the-art commercial models. Furthermore, our evolutionary approach matches state-of-the-art expert human baselines, achieving up to a 4.1x speedup on specific highly-irregular kernels. Source code and datasets are available at https://github.com/WildAlg/ParEVO. 2026-03-03T01:41:07Z Liu Yang Zeyu Nie Andrew Liu Felix Zou Deniz Altinbüken Amir Yazdanbakhsh Quanquan C. Liu http://arxiv.org/abs/2602.21626v2 Multi-Layer Scheduling for MoE-Based LLM Reasoning 2026-03-02T23:30:35Z Large Language Models (LLMs) have achieved remarkable success across a wide range of tasks, but serving them efficiently at scale remains a critical challenge due to their substantial computational and latency demands. While most existing inference frameworks rely on simple scheduling strategies such as First-Come-First-Serve (FCFS) at the engine level and Round-Robin (RR) at the scheduler or coordinator level, they often fail to fully utilize system resources and may suffer from issues such as head-of-line blocking and load imbalance. Recent advances in Mixture-of-Experts (MoE) models have also introduced new challenges in scheduling arising from expert parallelism and routing complexity. This research proposes a multi-layer scheduling framework tailored for MoE-based LLM serving. It targets scheduling at three levels: request-level, enginelevel, and expert-level. At the request level, we explore algorithms such as Shortest-Job-First (SJF) and priority-aware aging to improve throughput and reduce latency. At the engine level, we design load-aware dispatching strategies that account for the current prefix token load, KV cache utilization, and user stickiness to achieve better resource matching. At the expert level, we focus on alleviating expert hotspots and strategically placing inter-layer expert dependencies to balance load and improve routing efficiency. Extensive experimental results from more than 100 experiments conducted under diverse workload distributions show that our approach consistently outperforms the state-of-theart inference framework vLLM, achieving up to 17.8% reduction in Time To First Token (TTFT) latency and 13.3% reduction in Time-Per-Output-Token (TPOT) latency. 2026-02-25T06:42:08Z 12 pages, 10 figures Yifan Sun Gholamreza Haffari Minxian Xu Rajkumar Buyya Adel N. Toosi http://arxiv.org/abs/2603.02376v1 CUCo: An Agentic Framework for Compute and Communication Co-design 2026-03-02T20:35:50Z Custom CUDA kernel development is essential for maximizing GPU utilization in large-scale distributed LLM training and inference, yet manually writing kernels that jointly leverage both computation and communication remains a labor-intensive and error-prone process. Prior work on kernel optimization has focused almost exclusively on computation, leaving communication kernels largely untouched even though they constitute a significant share of total execution time. We introduce CUCo, a training-free agent-driven workflow that automatically generates high-performance CUDA kernels that jointly orchestrate computation and communication. By co-optimizing these traditionally disjoint components, CUCo unlocks new optimization opportunities unavailable to existing approaches, outperforming state-of-the-art baselines and reducing end-to-end latency by up to $1.57\times$. 2026-03-02T20:35:50Z Bodun Hu Yoga Sri Varshan Saurabh Agarwal Aditya Akella http://arxiv.org/abs/2602.22593v2 FLYING SERVING: On-the-Fly Parallelism Switching for Large Language Model Serving 2026-03-02T17:26:15Z Production LLM serving must simultaneously deliver high throughput, low latency, and sufficient context capacity under non-stationary traffic and mixed request requirements. Data parallelism (DP) maximizes throughput by running independent replicas, while tensor parallelism (TP) reduces per-request latency and pools memory for long-context inference. However, existing serving stacks typically commit to a static parallelism configuration at deployment; adapting to bursts, priorities, or long-context requests is often disruptive and slow. We present Flying Serving, a vLLM-based system that enables online DP-TP switching without restarting engine workers. Flying Serving makes reconfiguration practical by virtualizing the state that would otherwise force data movement: (i) a zero-copy Model Weights Manager that exposes TP shard views on demand, (ii) a KV Cache Adaptor that preserves request KV state across DP/TP layouts, (iii) an eagerly initialized Communicator Pool to amortize collective setup, and (iv) a deadlock-free scheduler that coordinates safe transitions under execution skew. Across three popular LLMs and realistic serving scenarios, Flying Serving improves performance by up to $4.79\times$ under high load and $3.47\times$ under low load while supporting latency- and memory-driven requests. 2026-02-26T03:55:51Z This paper is accepted by the 40th ACM International Conference on Supercomputing (ICS 2026) Shouwei Gao Junqi Yin Feiyi Wang Wenqian Dong http://arxiv.org/abs/2305.04979v2 FedHB: Hierarchical Bayesian Federated Learning 2026-03-02T17:12:02Z We propose a novel hierarchical Bayesian approach to Federated Learning (FL), where our model reasonably describes the generative process of clients' local data via hierarchical Bayesian modeling: constituting random variables of local models for clients that are governed by a higher-level global variate. Interestingly, the variational inference in our Bayesian model leads to an optimisation problem whose block-coordinate descent solution becomes a distributed algorithm that is separable over clients and allows them not to reveal their own private data at all, thus fully compatible with FL. We also highlight that our block-coordinate algorithm has particular forms that subsume the well-known FL algorithms including Fed-Avg and Fed-Prox as special cases. Beyond introducing novel modeling and derivations, we also offer convergence analysis showing that our block-coordinate FL algorithm converges to an (local) optimum of the objective at the rate of $O(1/\sqrt{t})$, the same rate as regular (centralised) SGD, as well as the generalisation error analysis where we prove that the test error of our model on unseen data is guaranteed to vanish as we increase the training data size, thus asymptotically optimal. 2023-05-08T18:21:41Z Minyoung Kim Timothy Hospedales http://arxiv.org/abs/2603.02075v1 Trident: Adaptive Scheduling for Heterogeneous Multimodal Data Pipelines 2026-03-02T17:00:22Z The rapid adoption of large language models and multimodal foundation models has made multimodal data preparation pipelines critical AI infrastructure. These pipelines interleave CPU-heavy preprocessing with accelerator-backed (GPU/NPU/TPU) inference and produce massive intermediate artifacts. Achieving high throughput is difficult because workloads are highly non-stationary: regime shifts, input-dependent inference, and transient memory spikes cause rapid performance fluctuations and out-of-memory (OOM) failures. Existing schedulers typically rely on threshold-based autoscaling or assume synchronous, homogeneous operators, leading to poor efficiency. We present Trident, an adaptive scheduling framework for heterogeneous multimodal pipelines on fixed-resource clusters. Trident closes the loop across three coupled layers: (i) an observation layer that estimates per-operator sustainable throughput for asynchronous operators via Gaussian Process regression with anomaly filtering; (ii) an adaptation layer that detects workload shifts online and performs memory-constrained Bayesian optimization to recommend OOM-safe configurations; and (iii) a scheduling layer that solves a mixed-integer linear program to jointly optimize operator parallelism, placement, and configuration transitions under heterogeneous compute and bandwidth constraints, accounting for cold-start overhead via rolling updates. Decisions trigger sample invalidation and model refresh to keep estimates consistent with the active configuration. Implemented on Ray Data, Trident improves end-to-end throughput by up to 2.01x on a document curation (PDF) pipeline and 1.88x on a video curation pipeline over a static baseline, with low overhead suitable for online re-optimization. 2026-03-02T17:00:22Z 22 pages, 3 figures Ding Pan Zhuangzhuang Zhou Long Qian Binhang Yuan http://arxiv.org/abs/2603.02071v1 Subcubic Coin Tossing in Asynchrony without Setup 2026-03-02T16:58:44Z We consider an asynchronous network of $n$ parties connected to each other via secure channels, up to $t$ of which are byzantine. We study common coin tossing, a task where the parties try to agree on an unpredictable random value, with some chance of failure due to the byzantine parties' influence. Coin tossing is a well known and often studied task due to its use in byzantine agreement. In this work, we present an adaptively secure committee-based method to roughly speaking turn strong but costly common coins into cheaper but lower-quality ones. For all $k > 2$ and $\varepsilon > 0$, we show how to use a strong (very rarely failing) coin that costs $\widetilde{O}(n^k)$ bits of communication to get a cheaper coin that costs $\widetilde{O}(\varepsilon^{-2k}n^{3 - 2/k})$ bits of communication. This latter coin tolerates $\varepsilon n$ fewer byzantine parties than the former, and it fails with an arbitrarily small constant probability. For any $\varepsilon > 0$, our method allows us to get a perfectly secure binary coin that tolerates $t \leq (\frac{1}{4} - \varepsilon)n$ faults with $O(n^{2.5}(\varepsilon^{-8} + \log n))$ messages of size $O(\log n)$, as well as a setup-free cryptographically secure binary coin that tolerates $t \leq (\frac{1}{3} - \varepsilon)n$ faults with $O(n^{7/3}\varepsilon^{-6}κ\log n)$ bits of communication (where $κ= Ω(\log n)$ is a cryptographic security paramater). These coins both have $O(\log n)$ latency. They are to our knowledge the first setup-free coins that cost $o(n^3)$ bits of communication but still succeed with at least constant probability against $t = Θ(n)$ adaptive byzantine faults. As such, they for the first time enable setup-free (and even perfectly secure) asynchronous byzantine agreement with $o(n^3)$ communication against $Θ(n)$ adaptive byzantine faults. 2026-03-02T16:58:44Z 17 pages, preprint Mose Mizrahi Roger Wattenhofer http://arxiv.org/abs/2603.02057v1 Beyond Microservices: Testing Web-Scale RCA Methods on GPU-Driven LLM Workloads 2026-03-02T16:47:28Z Large language model (LLM) services have become an integral part of search, assistance, and decision-making applications. However, unlike traditional web or microservices, the hardware and software stack enabling LLM inference deployment is of higher complexity and far less field-tested, making it more susceptible to failures that are difficult to resolve. Keeping outage costs and quality of service degradations in check depends on shortening mean time to repair, which in practice is gated by how quickly the fault is identified, located, and diagnosed. Automated root cause analysis (RCA) accelerates failure localization by identifying the system component that failed and tracing how the failure propagated. Numerous RCA methods have been developed for traditional services, using request path tracing, resource metric and log data analysis. Yet, existing RCA methods have not been designed for LLM deployments that present distinct runtime characteristics. In this study, we evaluate the effectiveness of RCA methods on a best-practice LLM inference deployment under controlled failure injections. Across 24 methods (20 metric-based, two trace-based, and two multi-source), we find that multi-source approaches achieve the highest accuracy, metric-based methods show fault-type-dependent performance, and trace-based methods largely fail. These results reveal that existing RCA tools do not generalize to LLM systems, motivating tailored analysis techniques and enhanced observability, for which we formulate guidelines. 2026-03-02T16:47:28Z 13 pages, 8 figures, 1 table Dominik Scheinert Alexander Acker Thorsten Wittkopp Soeren Becker Hamza Yous Karnakar Reddy Ibrahim Farhat Hakim Hacid Odej Kao