https://arxiv.org/api/c2DHgV34a9XJ4MgPQcsZ7yMw9M4 2026-03-22T08:36:25Z 27724 75 15 http://arxiv.org/abs/2603.13019v1 ARL-Tangram: Unleash the Resource Efficiency in Agentic Reinforcement Learning 2026-03-13T14:25:20Z

Agentic reinforcement learning (RL) has emerged as a transformative workload in cloud clusters, enabling large language models (LLMs) to solve complex problems through interactions with real world. However, unlike traditional RL, agentic RL demands substantial external cloud resources, e.g., CPUs for code execution and GPUs for reward models, that exist outside the primary training cluster. Existing agentic RL framework typically rely on static over-provisioning, i.e., resources are often tied to long-lived trajectories or isolated by tasks, which leads to severe resource inefficiency. We propose the action-level orchestration, and incorporate it into ARL-Tangram, a unified resource management system that enables fine-grained external resource sharing and elasticity. ARL-Tangram utilizes a unified action-level formulation and an elastic scheduling algorithm to minimize action completion time (ACT) while satisfying heterogeneous resource constraints. Further, heterogeneous resource managers are tailored to efficiently support the action-level execution on resources with heterogeneous characteristics and topologies. Evaluation on real-world agentic RL tasks demonstrates that ARL-Tangram improves average ACT by up to 4.3$\times$, speeds up the step duration of RL training by up to 1.5$\times$, and saves the external resources by up to 71.2$\%$. This system has been deployed to support the training of the MiMo series models.

2026-03-13T14:25:20Z Bangjun Xiao Yihao Zhao Xiangwei Deng Shihua Yu Yuxing Xiang Huaqiu Liu Qiying Wang Liang Zhao Hailin Zhang Xuanzhe Liu Xin Jin Fuli Luo http://arxiv.org/abs/2603.07974v2 ZK-ACE: Identity-Centric Zero-Knowledge Authorization for Post-Quantum Blockchain Systems 2026-03-13T14:07:40Z

Post-quantum signature schemes introduce kilobyte-scale authorization artifacts when applied directly to blockchain transaction validation. A widely considered mitigation is to verify post-quantum signatures inside zero-knowledge circuits and publish only succinct proofs on-chain. However, this approach preserves the signature-centric authorization model, merely relocating the verification cost, and embeds expensive high-dimensional lattice arithmetic into prover circuits.We present ZK-ACE (Zero-Knowledge Authorization for Cryptographic Entities), an authorization layer that replaces transaction-carried signature objects entirely with identity-bound zero-knowledge authorization statements. Rather than proving the correctness of a specific post-quantum signature, the prover demonstrates in zero knowledge that a transaction is authorized by an identity consistent with an on-chain commitment and bound replay state. The construction assumes a deterministic identity derivation primitive (DIDP) as a black box and uses a compact identity commitment as the primary on-chain identity anchor, supplemented by per-transaction replay-prevention state. We formalize ZK-ACE with explicit game-based security definitions for authorization soundness, replay resistance, substitution resistance, and cross-domain separation. We present a complete circuit constraint specification, define two replay-prevention models, and provide reduction-based security proofs under standard assumptions (knowledge soundness, collision resistance, and DIDP identity-root recovery hardness). A structural, protocol-level data accounting demonstrates an order-of-magnitude reduction in consensus-visible authorization data relative to direct post-quantum signature deployment. The design supports batch aggregation and recursive proof composition, and is compatible with account-abstraction and rollup-based deployment architectures.

2026-03-09T05:21:44Z 24 pages Jian Sheng Wang http://arxiv.org/abs/2512.21137v2 Declarative distributed algorithms as axiomatic theories in three-valued modal logic over semitopologies 2026-03-13T13:23:13Z

We illustrate how to formally specify distributed algorithms as declarative axiomatic theories in a modal logic, using as illustrative examples a simple voting protocol, a simple broadcast protocol (Bracha Broadcast), and a simple agreement protocol (Crusader Agreement). The methods scale well and have been used to find errors in a proposed industrial protocol. The key novelty is to use modal logic to capture a declarative, high-level representation of essential system properties -- the logical essence of the algorithm -- while abstracting away from explicit state transitions of an abstract machine that implements it. It is like the difference between specifying code in a functional or logic programming language, versus specifying code in an imperative one. Thus we present axiomatisations of Declarative Bracha Broacast and Declarative Crusader Agreement. A logical axiomatisation in the style we propose provides a precise, compact, human-readable specification that abstractly captures essential system properties, while eliding low-level implementation details; it is more precise than a natural language description, yet more abstract than source code or a logical specification thereof. This creates new opportunities for reasoning about correctness, resilience, and failure, and could serve as a foundation for human- and machine verification efforts, design improvements, and even alternative protocol implementations. The proofs in this paper have been formalised in Lean 4.

2025-12-24T12:07:25Z Murdoch J. Gabbay http://arxiv.org/abs/2602.15510v2 On the Geometric Coherence of Global Aggregation in Federated Graph Neural Networks 2026-03-13T10:56:01Z

Federated Learning (FL) enables distributed training across multiple clients without centralized data sharing, while Graph Neural Networks (GNNs) model relational data through message passing. In federated GNN settings, client graphs often exhibit heterogeneous structural and propagation characteristics. When standard aggregation mechanisms are applied to such heterogeneous updates, the global model may converge numerically while exhibiting degraded relational behavior. Our work identifies a geometric failure mode of global aggregation in Cross- Domain Federated GNNs. Although GNN parameters are numerically represented as vectors, they encode relational transformations that govern the direction, strength, and sensitivity of information flow across graph neighborhoods. Aggregating updates originating from incompatible propagation regimes can therefore introduce destructive interference in this transformation space. This leads to loss of coherence in global message passing. Importantly, this degradation is not necessarily reflected in conventional metrics such as loss or accuracy. To address this issue, we propose GGRS (Global Geometric Reference Structure), a server-side framework that regulates client updates prior to aggregation based on geometric admissibility criteria. GGRS preserves directional consistency of relational transformations as well as maintains diversity of admissible propagation subspaces. It also stabilizes sensitivity to neighborhood interactions, without accessing client data or graph topology. Experiments on heterogeneous GNN-native, Amazon Co-purchase datasets demonstrate that GGRS preserves global message-passing coherence across training rounds by highlighting the necessity of geometry-aware regulation in federated graph learning.

2026-02-17T11:34:04Z This is a developing preprint of an 18-page journal manuscript (6 figures), currently being prepared for formal peer-review submission Chethana Prasad Kabgere Shylaja SS http://arxiv.org/abs/2603.12838v1 A New Kernel Regularity Condition for Distributed Mirror Descent: Broader Coverage and Simpler Analysis 2026-03-13T09:40:15Z

Existing convergence of distributed optimization methods in non-Euclidean geometries typically rely on kernel assumptions: (i) global Lipschitz smoothness and (ii) bi-convexity of the associated Bregman divergence function. Unfortunately, these conditions are violated by nearly all kernels used in practice, leaving a huge theory-practice gap. This work closes this gap by developing a unified analytical tool that guarantees convergence under mild conditions. Specifically, we introduce Hessian relative uniform continuity (HRUC), a regularity satisfied by nearly all standard kernels. Importantly, HRUC is closed under concatenation, positive scaling, composition, and various kernel combinations. Leveraging the geometric structure induced by HRUC, we derive convergence guarantees for mirror descent-based gradient tracking without imposing any restrictive assumptions. More broadly, our analysis techniques extend seamlessly to other decentralized optimization methods in genuinely non-Euclidean and non-Lipschitz settings.

2026-03-13T09:40:15Z 25 pages, 4 figures Junwen Qiu Ziyang Zeng Leilei Mei Junyu Zhang http://arxiv.org/abs/2510.12196v2 GPU-Accelerated Algorithms for Process Mapping 2026-03-13T07:35:25Z

Process mapping asks to assign vertices of a task graph to processing elements of a supercomputer such that the computational workload is balanced while the communication cost is minimized. Motivated by the recent success of GPU-based graph partitioners, we propose two GPU-accelerated algorithms for this optimization problem. The first algorithm employs hierarchical multisection, which partitions the task graph alongside the hierarchy of the supercomputer. The method utilizes GPU-based graph partitioners to accelerate the mapping process. The second algorithm integrates process mapping directly into the modern multilevel graph partitioning pipeline. Vital phases like coarsening and refinement are accelerated by exploiting the parallelism of GPUs. The first algorithm has, on average, about 12 percent higher communication costs than the state-of-the-art solver and thus remains competitive with it. However, in terms of speed, it vastly outperforms the competitor with a geometric mean speedup of 22 times and a maximum speedup of 934 times. The second approach is even faster, with a geometric mean speedup of 1454 times and a peak speedup of 12376 times. Compared to other algorithms that prioritize speed over solution quality, this approach has the same quality but much greater speedups. To our knowledge, these are the first GPU-based algorithms for process mapping.

2025-10-14T06:42:20Z Petr Samoldekin Christian Schulz Henning Woydt http://arxiv.org/abs/2511.01863v2 SPHERE: Spherical partitioning for large-scale routing optimization 2026-03-13T06:49:27Z

We study shortest-path routing in large weighted, undirected graphs, where expanding search frontiers raise time and memory costs for exact solvers. We propose \emph{SPHERE}, a query-aware partitioning heuristic that adaptively splits the problem by identifying \emph{source-target} ($s$--$t$) overlaps of hop-distance spheres. Selecting an anchor node $a$ within this overlap partitions the task into independent induced subgraphs for $s\to a$ and $a\to t$, each restricted to its own induced subgraph. If resulting subgraphs remain large, the procedure recurses on that specific subgraph. We provide a formal guarantee that by using the partition cut within the shared overlap, the resulting subpaths preserve feasibility, thereby avoiding the need for boundary repair. Furthermore, \emph{SPHERE} acts as a solver-agnostic framework that naturally exposes parallelism across subproblems. On million-scale road networks, \emph{SPHERE} achieves faster runtimes and smaller optimality gaps than contemporary state-of-the-art partitioning and community-based routing pipelines. Crucially, it also substantially mitigates heavy-tail runtime outliers suffered by standard exact methods, yielding highly stable and predictable execution times across varying queries.

2025-10-12T19:13:13Z Changed abstract, revised chapters 1-5, adjusted bibliography Robert Fabian Lindermann Paul-Niklas Ken Kandora Simon Caspar Zeller Adrian Asmund Fessler Steffen Rebennack http://arxiv.org/abs/2603.12707v1 Cost-Efficient Multimodal LLM Inference via Cross-Tier GPU Heterogeneity 2026-03-13T06:42:35Z

Multimodal large language model (MLLM) inference splits into two phases with opposing hardware demands: vision encoding is compute-bound, while language generation is memory-bandwidth-bound. We show that under standard transformer KV caching, the modality boundary (between vision encoder and language model) minimizes cross-device transfer among all partition points that preserve standard stage-based execution. Partitioning here reduces transfer complexity from $O(L * s_ctx)$ bytes (GB-scale KV caches under stage-level disaggregation) to $O(N_v * d)$ bytes (MB-scale embeddings), an O(L) reduction where L is the transformer depth. The result holds across attention mechanisms (MHA/GQA), dynamic vision resolutions, and model scales, and the advantage grows as models deepen. A direct implication is that existing stage-level disaggregation systems are constrained to high-bandwidth interconnects (e.g., NVLink), whereas modality-level disaggregation enables cross-tier heterogeneous serving over commodity PCIe. A closed-form cost model shows that heterogeneous deployment is cost-optimal under phase-separable workloads (predicts 31.4% savings; observed 40.6%). We build HeteroServe, a phase-aware runtime with modality-level partitioning and cross-tier scheduling, and evaluate it on LLaVA-1.5-7B and Qwen2.5-VL against vLLM v0.3.0. On identical 4xA100 hardware, engine optimizations raise throughput by up to 54%. Under a fixed budget, a heterogeneous cluster (\$38k) improves Tokens/\$ by 37% over a homogeneous baseline (\$64k) without degrading latency.

2026-03-13T06:42:35Z Donglin Yu http://arxiv.org/abs/2603.07917v2 SageSched: Efficient LLM Scheduling Confronting Demand Uncertainty and Hybridity 2026-03-13T06:25:30Z

Efficient LLM inference scheduling is crucial for user experience. However, LLM inferences exhibit remarkable demand uncertainty (with unknown output length beforehand) and hybridity (being both compute and memory intensive). Existing LLM schedulers rely on simple heuristics or focus purely on compute resource, suffering suboptimal performance. In this work, we propose SageSched, an efficient LLM scheduler that properly handles demand uncertainty and hybridity of inference workloads. SageSched combines prompt contents with the past inference results to predict output-length distribution in a light-weight and also accurate manner. Meanwhile, it models the true service cost of an inference request with both compute and memory aspects considered. Finally, SageSched employs an uncertainty-aware scheduling policy that can yield the best overall efficiency given the request cost distributions. Testbed experiments over diverse setups confirm that SageSched can attain an efficiency improvement of over 28.7%.

2026-03-09T03:20:51Z Zhenghao Gan Yichen Bao Yifei Liu Chen Chen Quan Chen Minyi Guo http://arxiv.org/abs/2603.12684v1 Federated Hierarchical Clustering with Automatic Selection of Optimal Cluster Numbers 2026-03-13T05:58:35Z

Federated Clustering (FC) is an emerging and promising solution in exploring data distribution patterns from distributed and privacy-protected data in an unsupervised manner. Existing FC methods implicitly rely on the assumption that clients are with a known number of uniformly sized clusters. However, the true number of clusters is typically unknown, and cluster sizes are naturally imbalanced in real scenarios. Furthermore, the privacy-preserving transmission constraints in federated learning inevitably reduce usable information, making the development of robust and accurate FC extremely challenging. Accordingly, we propose a novel FC framework named Fed-$k^*$-HC, which can automatically determine an optimal number of clusters $k^*$ based on the data distribution explored through hierarchical clustering. To obtain the global data distribution for $k^*$ determination, we let each client generate micro-subclusters. Their prototypes are then uploaded to the server for hierarchical merging. The density-based merging design allows exploring clusters of varying sizes and shapes, and the progressive merging process can self-terminate according to the neighboring relationships among the prototypes to determine $k^*$. Extensive experiments on diverse datasets demonstrate the FC capability of the proposed Fed-$k^*$-HC in accurately exploring a proper number of clusters.

2026-03-13T05:58:35Z 29 pages, 7 figures Information Sciences 733 (2026) 122957 Yue Zhang Chuanlong Qiu Xinfa Liao Yiqun Zhang 10.1016/j.ins.2025.122957 http://arxiv.org/abs/2512.17023v2 LLM-HPC++: Evaluating LLM-Generated Modern C++ and MPI+OpenMP Codes for Scalable Mandelbrot Set Computation 2026-03-13T03:27:42Z

Parallel programming remains one of the most challenging aspects of High-Performance Computing (HPC), requiring deep knowledge of synchronization, communication, and memory models. While modern C++ standards and frameworks like OpenMP and MPI have simplified parallelism, mastering these paradigms is still complex. Recently, Large Language Models (LLMs) have shown promise in automating code generation, but their effectiveness in producing correct and efficient HPC code is not well understood. In this work, we systematically evaluate leading LLMs including ChatGPT 4 and 5, Claude, and LLaMA on the task of generating C++ implementations of the Mandelbrot set using shared-memory, directive-based, and distributed-memory paradigms. Each generated program is compiled and executed with GCC 11.5.0 to assess its correctness, robustness, and scalability. Results show that ChatGPT-4 and ChatGPT-5 achieve strong syntactic precision and scalable performance.

2025-12-18T19:37:33Z Patrick Diehl Noujoud Nader Deepti Gupta http://arxiv.org/abs/2411.10406v3 How to Build a Quantum Supercomputer: Scaling from Hundreds to Millions of Qubits 2026-03-13T03:26:12Z

In the span of four decades, quantum computation has evolved from an intellectual curiosity to a potentially realizable technology. Today, small-scale demonstrations have become possible for quantum algorithmic primitives on hundreds of physical qubits. Nevertheless, there are significant outstanding challenges in quantum hardware, fabrication, software architecture, and algorithms on the path towards a full-stack scalable quantum computing technology. Here, we provide a comprehensive review of these scaling challenges. We show how to facilitate scaling by adopting existing semiconductor technology to build much higher-quality qubits, employing systems engineering approaches, and performing distributed heterogeneous quantum-classical computing. We provide a detailed resource and sensitivity analysis for quantum applications on surface-code error-corrected quantum computers given current, target, and desired hardware specifications based on superconducting qubits, accounting for a realistic distribution of errors. We provide comprehensive resource estimates for several utility-scale applications including quantum chemistry calculations, catalyst design, NMR spectroscopy, and Fermi-Hubbard simulation. We show that orders of magnitude enhancement in performance could be obtained by a combination of hardware improvements and tight quantum-HPC integration. Furthermore, we introduce high-performance architectures for quantum-probabilistic computing with custom-designed accelerators to tackle today's industry-scale classical optimization, machine learning, and quantum simulation tasks in a cost-effective manner.

2024-11-15T18:22:46Z 71 pages, 53 figures. General revision, added new sections, added figures, added references, added appendices Masoud Mohseni Artur Scherer K. Grace Johnson Oded Wertheim Matthew Otten Namit Anand Navid Anjum Aadit Yuri Alexeev Gilad Ben-Shach Kirk M. Bresniker Kerem Y. Camsari Barbara Chapman Soumitra Chatterjee Shuvro Chowdhury Gebremedhin A. Dagnew Tom Dvir Aniello Esposito Farah Fahim Michael Ferguson Marco Fiorentino Archit Gajjar Katerina Gratsea Gaurav Gyawali Christian Heiter Ali H. Z. Kavaki Abdullah Khalid Xiangzhou Kong Bohdan Kulchytskyy Elica Kyoseva Ruoyu Li P. Aaron Lott Igor L. Markov Robert F. McDermott Lucas Morais Giacomo Pedretti Pooja Rao Eleanor Rieffel Allyson Silva John Sorebo Panagiotis Spentzouris Ziv Steiner Boyan Torosov Davide Venturelli Robert J. Visser Zak Webb Xin Zhan Yonatan Cohen Pooya Ronagh Alan Ho Raymond G. Beausoleil John M. Martinis http://arxiv.org/abs/2601.14608v2 Exploring Performance-Productivity Trade-offs in AMT Runtimes: A Task Bench Study of Itoyori, ItoyoriFBC, HPX, and MPI 2026-03-13T03:11:04Z

Asynchronous Many-Task (AMT) runtimes offer a productive alternative to the Message Passing Interface (MPI). However, the diverse AMT landscape makes fair comparisons challenging. Task Bench, proposed by Slaughter et al., addresses this challenge through a parameterized framework for evaluating parallel programming systems. This work integrates two recent cluster AMTs, Itoyori and ItoyoriFBC, into Task Bench for comprehensive evaluation against MPI and HPX. Itoyori employs a Partitioned Global Address Space (PGAS) model with RDMA-based work stealing, while ItoyoriFBC extends it with futurebased synchronization. We evaluate these systems in terms of both performance and programmer productivity. Performance is assessed across various configurations, including compute-bound kernels, weak scaling, and both imbalanced and communication-intensive patterns. Performance is quantified using application efficiency, i.e., the percentage of maximum performance achieved, and the Minimum Effective Task Granularity (METG), i.e., the smallest task duration before runtime overheads dominate. Programmer productivity is quantified using Lines of Code (LOC) and the Number of Library Constructs (NLC). Our results reveal distinct trade-offs. MPI achieves the highest efficiency for regular, communication-light workloads but requires verbose, lowlevel code. HPX maintains stable efficiency under load imbalance across varying node counts, yet ranks last in productivity metrics, demonstrating that AMTs do not inherently guarantee improved productivity over MPI. Itoyori achieves the highest efficiency in communication-intensive configurations while leading in programmer productivity. ItoyoriFBC exhibits slightly lower efficiency than Itoyori, though its future-based synchronization offers potential for expressing irregular workloads.

2026-01-21T02:57:36Z Torben R. Lahnor Mia Reitz Jonas Posner Patrick Diehl http://arxiv.org/abs/2603.12566v1 Streaming REST APIs for Large Financial Transaction Exports from Relational Databases 2026-03-13T01:56:55Z

Financial platforms and enterprise systems frequently provide transaction export capabilities to support reporting, reconciliation, auditing, and regulatory compliance workflows. In many environments, these exports involve very large datasets containing hundreds of thousands or even millions of transaction records. Traditional REST API implementations often construct the entire export payload in application memory before transmitting the response to the client, which can lead to high memory consumption and delayed response initiation when processing large datasets. This paper presents a streaming-based REST API architecture that retrieves transaction records incrementally from relational databases and writes them directly to the HTTP response output stream. By integrating database cursor retrieval with progressive HTTP transmission, the proposed design allows export data to be delivered continuously as records are processed rather than after the full dataset has been assembled. The architecture is implemented using a Java-based JAX-RS framework with the StreamingOutput interface and supports multiple financial export formats including CSV, OFX, QFX, and QBO. In practice, the streaming approach significantly reduces memory buffering requirements and allows large export downloads to begin immediately, improving responsiveness and scalability for high-volume export operations.

2026-03-13T01:56:55Z 6 pages, 2 figures, includes illustrative evaluation Abhiram Kandiraju http://arxiv.org/abs/2603.12465v1 TaxBreak: Unmasking the Hidden Costs of LLM Inference Through Overhead Decomposition 2026-03-12T21:30:07Z

Large Language Model (LLM) inference is widely used in interactive assistants and agentic systems. In latency-sensitive deployments, inference time can become dominated by host-side overheads. Existing approaches typically expose this cost only as an aggregate residual or a launch/queue metric, which is often insufficient to identify which execution layer should be optimized. This work presents TaxBreak, a trace-driven methodology for decomposing host-visible orchestration overhead into three components: framework translation time, CUDA library translation time, and kernel launch-path time. We validate TaxBreak on NVIDIA H100 and H200 systems and use it to derive our proposed Host-Device Balance Index (HDBI), a boundedness summary index that relates device-active execution to host-visible orchestration. Across representative dense and mixture-of-experts workloads in both prefill and decode, we show that aggregate latency, GPU inactivity, or boundedness ratios alone can obscure the dominant optimization target. TaxBreak instead distinguishes cases where optimization should reduce software-stack overhead from cases where the primary win comes from reducing device-side work. We further show that MoE models dispatch 8-11x more kernels per output token than dense models, and that for such host-bound workloads, CPU single-thread performance is a first-order parameter: a faster host CPU reduces orchestration overhead by 10-29% and improves end-to-end latency by up to 14%, even when paired with a slower-clocked GPU. These results position TaxBreak as a diagnostic tool for assessing whether optimization effort should target the software stack or the device-side workload execution.

2026-03-12T21:30:07Z Accepted at IEEE ISPASS 2026. Copyright assigned to IEEE Prabhu Vellaisamy Shreesh Tripathi Vignesh Natarajan Surya Santhan Thenarasu Shawn Blanton John P. Shen