https://arxiv.org/api/IrH6nN83/Ib12rj3JZfeJ2my0KA 2026-04-12T15:33:43Z 27953 555 15 http://arxiv.org/abs/2603.02971v1 Scalable Mesh Coupling for Atmospheric Wave Simulation 2026-03-03T13:25:39Z

We describe the application of a scalable algorithm for interpolating solution data in the overlapping mesh region of two solvers. This feature is essential to obtain a globally consistent solution for in-situ coupled atmospheric wave simulation. We provide timings and discuss a real-world application run.

2026-03-03T13:25:39Z 5 pages, 6 figures, presented at SIAM International Meshing Roundtable 2026 Hannes Brandt Tim Griesbach Matthew Zettergren Scott Aiton Jonathan Snively Donna Calhoun Carsten Burstedde http://arxiv.org/abs/2603.02885v1 MuxTune: Efficient Multi-Task LLM Fine-Tuning in Multi-Tenant Datacenters via Spatial-Temporal Backbone Multiplexing 2026-03-03T11:34:49Z

Parameter-Efficient Fine-Tuning (PEFT) is widely applied as the backend of fine-tuning APIs for large language model (LLM) customization in datacenters. Service providers deploy separate instances for individual PEFT tasks, giving rise to prominent resource inefficiencies, including (1) GPU underutilization from small-scale, PEFT-native operators and (2) device stalls from communication delays and data dependencies in parallelized execution. To address these issues, this paper presents MuxTune, a fine-tuning system that enables resource-efficient concurrent execution of multiple PEFT tasks. The key idea is to multiplex the backbone across independent tasks in a spatial-temporal manner for improved utilization and reduced stalls. Building on flexible, modularized backbone sharing via unified PEFT representations, MuxTune proposes hierarchical co-scheduling scheme with task, operator, and data-level optimizations. Specifically, it fuses tasks through a hybrid of spatial and temporal multiplexing, and orchestrates multi-task operator execution in two-tiered hybrid parallelism. Additionally, MuxTune employs chunk-based data alignment to mitigate inter-task ineffective tokens. Experimental results demonstrate that MuxTune achieves up to $2.33\times$ higher throughput and $5.29\times$ memory reduction compared to three state-of-the-art baselines.

2026-03-03T11:34:49Z Chunyu Xue Yi Pan Weihao Cui Quan Chen Shulai Zhang Bingsheng He Minyi Guo http://arxiv.org/abs/2512.22420v4 Nightjar: Dynamic Adaptive Speculative Decoding for Large Language Models Serving 2026-03-03T09:33:44Z

Speculative decoding (SD) accelerates LLM inference by verifying draft tokens in parallel. However, this method presents a critical trade-off: it improves throughput in low-load, memory-bound systems but degrades performance in high-load, compute-bound environments due to verification overhead. Existing speculative decoding methods use fixed lengths and cannot adapt to workload changes or decide when to stop speculation. The cost of restarting speculative inference also remains unquantified. Under high load, the benefit of speculation diminishes, while retaining the draft model reduces KV-cache capacity, limiting batch size and degrading throughput. To overcome this, we propose Nightjar, a resource-aware adaptive speculative framework. It first adjusts to the request load by dynamically selecting the optimal speculative length for different batch sizes. Crucially, Nightjar proactively disables speculative decoding when the MAB planner determines that speculation is no longer beneficial, and during the disabled phase, offloads the draft model to the CPU only under GPU memory pressure. This reclaims memory for the KV cache, thereby facilitating larger batch sizes and maximizing overall system throughput. Experiments show that Nightjar achieves average 27.29% higher throughput and up to 20.18% lower latency compared to standard speculative decoding under dynamic request arrival rates in real-time LLM serving scenarios.

2025-12-27T00:57:55Z Rui Li Zhaoning Zhang Libo Zhang Huaimin Wang Xiang Fu Zhiquan Lai http://arxiv.org/abs/2510.14686v2 xLLM Technical Report 2026-03-03T07:38:09Z

We introduce xLLM, an intelligent and efficient Large Language Model (LLM) inference framework designed for high-performance, large-scale enterprise-grade serving, with deep optimizations for diverse AI accelerators. To address these challenges, xLLM builds a novel decoupled service-engine architecture. At the service layer, xLLM-Service features an intelligent scheduling module that efficiently processes multimodal requests and co-locates online and offline tasks through unified elastic scheduling to maximize cluster utilization. This module also relies on a workload-adaptive dynamic Prefill-Decode (PD) disaggregation policy and a novel Encode-Prefill-Decode (EPD) disaggregation policy designed for multimodal inputs. Furthermore, it incorporates a distributed architecture to provide global KV Cache management and robust fault-tolerant capabilities for high availability. At the engine layer, xLLM-Engine co-optimizes system and algorithm designs to fully saturate computing resources. This is achieved through comprehensive multi-layer execution pipeline optimizations, an adaptive graph mode and an xTensor memory management. xLLM-Engine also further integrates algorithmic enhancements such as optimized speculative decoding and dynamic EPLB, collectively serving to substantially boost throughput and inference efficiency. Extensive evaluations demonstrate that xLLM delivers significantly superior performance and resource efficiency. Under identical TPOT constraints, xLLM achieves throughput up to 1.7x that of MindIE and 2.2x that of vLLM-Ascend with Qwen-series models, while maintaining an average throughput of 1.7x that of MindIE with Deepseek-series models. xLLM framework is publicly available at https://github.com/jd-opensource/xllm and https://github.com/jd-opensource/xllm-service.

2025-10-16T13:53:47Z 39 pages Tongxuan Liu Tao Peng Peijun Yang Xiaoyang Zhao Xiusheng Lu Weizhe Huang Zirui Liu Xiaoyu Chen Zhiwei Liang Jun Xiong Donghe Jin Minchao Zhang Jinrong Guo Yingxu Deng Xu Zhang Xianzhe Dong Siqi Wang Siyu Wu Yu Wu Zihan Tang Yuting Zeng Yanshu Wang Jinguang Liu Meng Kang Menxin Li Yunlong Wang Yiming Liu Xiaolong Ma Yifan Wang Yichen Zhang Jinrun Yin Keyang Zheng Jiawei Yin Jun Zhang Ziyue Wang Xiaobo Lin Liangyu Liu Liwei Lan Yang Liu Chunhua Peng Han Liu Songcheng Ren Xuezhu Wang Yunheng Shen Yi Wang Guyue Liu Yitao Hu Hui Chen Tong Yang Hailong Yang Jing Li Guiguang Ding Ke Zhang http://arxiv.org/abs/2603.02661v1 Blockchain Communication Vulnerabilities 2026-03-03T06:50:47Z

Blockchains are diverse in the way they handle communications between their nodes to disseminate information, mitigate attacks, and agree on the next block. While security vulnerabilities have been identified, they rely on an attack custom-made for a specific blockchain communication protocol. To our knowledge, the vulnerabilities of multiple blockchain communication protocols to adversarial conditions have never been compared. In this paper, we compare empirically the vulnerabilities of the communication protocols of five modern in-production blockchains, Algorand, Aptos, Avalanche, Redbelly and Solana, when attacked in five different ways. We conclude that Algorand is vulnerable to packet loss attacks, Aptos is vulnerable to targeted load attacks and leader isolation attacks, Avalanche is vulnerable to transient failure attacks, Redbelly's performance is impacted by packet loss attacks and Solana is vulnerable to stopping attacks and leader isolation attacks. Our system is open source.

2026-03-03T06:50:47Z 17 pages, 11 figures Andrei Lebedev Vincent Gramoli http://arxiv.org/abs/2603.03383v1 Accelerating OpenPangu Inference on NPU via Speculative Decoding 2026-03-03T06:50:31Z

To mitigate the Memory Wall bottleneck encountered by Large Language Models (LLMs) during inference on \textbf{NPU} hardware, and addressing the scarcity of native support for mainstream speculative decoding algorithms on domestic infrastructure, this study presents an end-to-end speculative inference acceleration scheme for OpenPangu-7B.

2026-03-03T06:50:31Z Yuntao Dai Jing Wu Hang Gu Teng Wang http://arxiv.org/abs/2603.02642v1 cuNRTO: GPU-Accelerated Nonlinear Robust Trajectory Optimization 2026-03-03T06:17:08Z

Robust trajectory optimization enables autonomous systems to operate safely under uncertainty by computing control policies that satisfy the constraints for all bounded disturbances. However, these problems often lead to large Second Order Conic Programming (SOCP) constraints, which are computationally expensive. In this work, we propose the CUDA Nonlinear Robust Trajectory Optimization (cuNRTO) framework by introducing two dynamic optimization architectures that have direct application to robust decision-making and are implemented on CUDA. The first architecture, NRTO-DR, leverages the Douglas-Rachford (DR) splitting method to solve the SOCP inner subproblems of NRTO, thereby significantly reducing the computational burden through parallel SOCP projections and sparse direct solves. The second architecture, NRTO-FullADMM, is a novel variant that further exploits the problem structure to improve scalability using the Alternating Direction Method of Multipliers (ADMM). Finally, we provide GPU implementation of the proposed methodologies using custom CUDA kernels for SOC projection steps and cuBLAS GEMM chains for feedback gain updates. We validate the performance of cuNRTO through simulated experiments on unicycle, quadcopter, and Franka manipulator models, demonstrating speedup up to 139.6$\times$.

2026-03-03T06:17:08Z Jiawei Wang Arshiya Taj Abdul Evangelos A. Theodorou http://arxiv.org/abs/2601.16536v2 W4A16 Mixed-Precision Matrix Multiplication on Decoupled Architecture: Kernel Design and Memory Bottleneck Analysis for Ascend NPUs 2026-03-03T06:05:02Z

As Large Language Models (LLMs) scale, weight-only quantization (W4A16: 4-bit weights, 16-bit activations) becomes critical for reducing memory footprint with minimal accuracy loss. However, its efficient deployment on Huawei's Ascend 910 Neural Processing Unit (NPU) is challenging due to limited native mixed-precision support and the accelerator's decoupled compute architecture. To enable quantization on such architecture, we present the first practical W4A16 matrix multiplication kernel tailored for the Ascend 910 NPU. Our design leverages vector cores for on-the-fly INT4-to-FP16 dequantization, cube cores for high-throughput GEMM, and Split-K parallelization to mitigate memory latency. Performance evaluations across diverse matrix shapes and batch sizes show our method outperforms data-parallel approaches when K >> N, a typical scenario in LLM decoding. Specially, our method can achieve a speedup ranging from 1.01x to 1.74x. In addition, our profile reveals the primary bottleneck is not dequantization compution itself, but extra global memory transfer for the weight, making W4A16 only reaching a maximum speedup of 1.48x over native FP16xFP16 matrix multiplication in PyTorch. In the long run, our method lays a solid foundation and provides insightful views for the efficient deployment of quantized large language models on various domain-specific accelerators.

2026-01-23T08:15:13Z Yuanhong He Peiyu Niu Jun Chen Chenchen Zhang Chao Yang http://arxiv.org/abs/2603.02636v1 Undecided State Dynamics with Many Opinions 2026-03-03T06:04:15Z

We study the Undecided-State Dynamics (USD), a fundamental consensus process in which each vertex holds one of $k$ decided opinions or the undecided state. We consider both the gossip model and the population protocol model. Prior work established tight bounds on the consensus time of this process only for the regime $k = O(\sqrt{n}/(\log n)^2)$ (for the population protocol model) and $k = O((n/\log n)^{1/3})$ (for the gossip model), often under restrictive assumptions on the initial configuration. In this paper, we obtain the first consensus-time guarantees for USD that hold for \emph{arbitrary} $2\le k\le n$ and for \emph{arbitrary} initial configurations in both the gossip model and the population protocol model. In the gossip model, USD reaches consensus within $\widetilde O(\min\{k,\sqrt n\})$ synchronous rounds with probability $1-p_{\bot}-n^{-c}$, where $p_{\bot}$ is the gossip-specific probability of collapsing to the all-undecided state in the first round. In the population protocol model, USD reaches consensus within $\widetilde O(\min\{kn,n^{3/2}\})$ asynchronous interactions with high probability. We also present lower bounds that match the upper bounds up to polylogarithmic factors for a specific initial configuration and show that our upper bounds are essentially optimal.

2026-03-03T06:04:15Z Colin Cooper Frederik Mallmann-Trenn Tomasz Radzik Nobutaka Shimizu Takeharu Shiraga http://arxiv.org/abs/2510.08976v2 MIRAGE: Runtime Scheduling for Multi-Vector Image Retrieval with Hierarchical Decomposition 2026-03-03T05:34:36Z

To effectively leverage user-specific data, retrieval augmented generation (RAG) is employed in multimodal large language model (MLLM) applications. However, conventional retrieval approaches often suffer from limited retrieval accuracy. Recent advances in multi-vector retrieval (MVR) improve accuracy by decomposing queries and matching against segmented images. They still suffer from sub-optimal accuracy and efficiency, overlooking alignment between the query and varying image objects and redundant fine-grained image segments. In this work, we present an efficient scheduling framework for image retrieval - MIRAGE. First, we introduce a novel hierarchical paradigm, employing multiple intermediate granularities for varying image objects to enhance alignment. Second, we minimize redundancy in retrieval by leveraging cross-hierarchy similarity consistency and hierarchy sparsity to minimize unnecessary matching computation. Furthermore, we configure parameters for each dataset automatically for practicality across diverse scenarios. Our empirical study shows that, MIRAGE not only achieves substantial accuracy improvements but also reduces computation by up to 3.5 times over the existing MVR system.

2025-10-10T03:36:18Z Will appear in DAC'2026 Maoliang Li Ke Li Yaoyang Liu Jiayu Chen Zihao Zheng Yinjun Wu Chenchen Liu Xiang Chen http://arxiv.org/abs/2603.02603v1 Why Atomicity Matters to AI/ML Infrastructure: Snapshots, Firmware Updates, and the Cost of the Forward-In-Time-Only Category Mistake 2026-03-03T05:08:55Z

Large-scale AI/ML training systems depend on two assumptions that are rarely examined: (1) that checkpoints represent atomic snapshots of global training state, and (2) that infrastructure updates can be applied without inducing mixed-protocol cluster states. Both assumptions are instances of a deeper structural error: the Forward-In-Time-Only (FITO) category mistake, which confuses protocol convergence properties with temporal predicates. We formalize this confusion as a type error: the identification of a temporal snapshot $\mathsf{Snap}(t)$ with a convergence property $\mathsf{Conv}(\mathcal{P},e)$. We model checkpoint execution in a process-algebraic framework and prove that under asynchronous composition with crash-recovery failures, no temporal instant can serve as an atomicity boundary. We reformulate checkpoint inconsistency on an epoch lattice and show that atomicity is a measure-zero event whose complement grows exponentially with the number of independent persistence domains. We formalize mixed-epoch recovery as a type violation in the optimization algebra and show that the resulting update is not a valid step of any standard optimizer. For firmware fleet updates, we strengthen the known consensus-hardness result: atomic deployment requires not merely agreement but common knowledge of the epoch transition, which is strictly unattainable in asynchronous systems with unreliable communication. We conclude by sketching a bilateral convergence protocol, inspired by Open Atomic Ethernet, that achieves $\mathsf{Conv}(\mathcal{P},e)$ without requiring $\mathsf{Snap}(t)$ -- replacing the FITO assumption with constraint semantics.

2026-03-03T05:08:55Z Paul Borrill http://arxiv.org/abs/2403.20135v3 Parallel performance of shared memory parallel spectral deferred corrections 2026-03-03T04:53:27Z

We investigate the parallel performance of Parallel Spectral Deferred corrections, a numerical approach that provides small-scale parallelism for the numerical solution of initial value problems. The scheme is applied to the shallow-water equation and uses an implicit-explicit splitting that, in order to be efficient, integrates fast modes implicitly and slow modes explicitly. We describe parallel \OpenMP-based implementations of parallel Spectral Deferred Corrections for two well established simulation codes: the finite volume based operational ocean model \ICON and the spherical harmonics based research code \SWEET. We also develop a performance model and benchmark our implementations on a single node of the JUSUF (\SWEET) and JUWELS (\ICON) system at Jülich Supercomputing Centre. A reduction of time-to-solution across a range of accuracies is demonstrated. For \ICON, we show speedup over the currently used Adams--Bashforth-2 integrator with \OpenMP loop parallelization. For \SWEET, we show speedup over serial Spectral Deferred Corrections and a second order implicit-explicit integrator.

2024-03-29T11:56:58Z 21 pages, 5 figures Philip Freese Sebastian Götschel Thibaut Lunet Daniel Ruprecht Martin Schreiber 10.1177/10943420251400406 http://arxiv.org/abs/2603.02597v1 GPUTOK: GPU Accelerated Byte Level BPE Tokenization 2026-03-03T04:48:28Z

As large language models move toward million-token context windows, CPU tokenizers become a major slowdown because they process text one step at a time while powerful GPUs sit unused. We built a GPU-based byte-level BPE tokenizer that follows GPT-2's merge rules. It includes a basic BlockBPE-style kernel and a faster, optimized version that uses cuCollections static map, CUB reductions, and a pybind11 interface for Python. On WikiText103 sequences up to 131k tokens, the optimized GPU tokenizer produces the same tokens as a CPU version and, for the longest inputs, is about 1.7x faster than tiktoken and about 7.6x faster than the HuggingFace GPT-2 tokenizer. Nsight profiling shows that 70-80% of CUDA API time goes to memory allocation, so adding memory pooling should give the biggest speed boost next. Tests on generation tasks using WikiText103 prompts show that our GPU tokenizer's outputs stay within about one percentage point of tiktoken and HuggingFace GPT-2 on similarity and overlap metrics, meaning it keeps output quality while making long-context inference more practical.

2026-03-03T04:48:28Z Venu Gopal Kadamba Kanishkha Jaisankar http://arxiv.org/abs/2603.02510v1 ParEVO: Synthesizing Code for Irregular Data: High-Performance Parallelism through Agentic Evolution 2026-03-03T01:41:07Z

The transition from sequential to parallel computing is essential for modern high-performance applications but is hindered by the steep learning curve of concurrent programming. This challenge is magnified for irregular data structures (such as sparse graphs, unbalanced trees, and non-uniform meshes) where static scheduling fails and data dependencies are unpredictable. Current Large Language Models (LLMs) often fail catastrophically on these tasks, generating code plagued by subtle race conditions, deadlocks, and sub-optimal scaling. We bridge this gap with ParEVO, a framework designed to synthesize high-performance parallel algorithms for irregular data. Our contributions include: (1) The Parlay-Instruct Corpus, a curated dataset of 13,820 tasks synthesized via a "Critic-Refine" pipeline that explicitly filters for empirically performant algorithms that effectively utilize Work-Span parallel primitives; (2) specialized DeepSeek, Qwen, and Gemini models fine-tuned to align probabilistic generation with the rigorous semantics of the ParlayLib library; and (3) an Evolutionary Coding Agent (ECA) that improves the "last mile" of correctness by iteratively repairing code using feedback from compilers, dynamic race detectors, and performance profilers. On the ParEval benchmark, ParEVO achieves an average 106x speedup (with a maximum of 1103x) across the suite, and a robust 13.6x speedup specifically on complex irregular graph problems, outperforming state-of-the-art commercial models. Furthermore, our evolutionary approach matches state-of-the-art expert human baselines, achieving up to a 4.1x speedup on specific highly-irregular kernels. Source code and datasets are available at https://github.com/WildAlg/ParEVO.

2026-03-03T01:41:07Z Liu Yang Zeyu Nie Andrew Liu Felix Zou Deniz Altinbüken Amir Yazdanbakhsh Quanquan C. Liu http://arxiv.org/abs/2602.21626v2 Multi-Layer Scheduling for MoE-Based LLM Reasoning 2026-03-02T23:30:35Z

Large Language Models (LLMs) have achieved remarkable success across a wide range of tasks, but serving them efficiently at scale remains a critical challenge due to their substantial computational and latency demands. While most existing inference frameworks rely on simple scheduling strategies such as First-Come-First-Serve (FCFS) at the engine level and Round-Robin (RR) at the scheduler or coordinator level, they often fail to fully utilize system resources and may suffer from issues such as head-of-line blocking and load imbalance. Recent advances in Mixture-of-Experts (MoE) models have also introduced new challenges in scheduling arising from expert parallelism and routing complexity. This research proposes a multi-layer scheduling framework tailored for MoE-based LLM serving. It targets scheduling at three levels: request-level, enginelevel, and expert-level. At the request level, we explore algorithms such as Shortest-Job-First (SJF) and priority-aware aging to improve throughput and reduce latency. At the engine level, we design load-aware dispatching strategies that account for the current prefix token load, KV cache utilization, and user stickiness to achieve better resource matching. At the expert level, we focus on alleviating expert hotspots and strategically placing inter-layer expert dependencies to balance load and improve routing efficiency. Extensive experimental results from more than 100 experiments conducted under diverse workload distributions show that our approach consistently outperforms the state-of-theart inference framework vLLM, achieving up to 17.8% reduction in Time To First Token (TTFT) latency and 13.3% reduction in Time-Per-Output-Token (TPOT) latency.

2026-02-25T06:42:08Z 12 pages, 10 figures Yifan Sun Gholamreza Haffari Minxian Xu Rajkumar Buyya Adel N. Toosi