https://arxiv.org/api/Z1+1W0GSxEWgIm1Ym9TsiQ/Xy68 2026-06-10T09:51:21Z 28838 180 15 http://arxiv.org/abs/2606.01472v1 Hierarchical Online Prompt Mutation with Dual-Loop Feedback for Guardrailed Evidence Document Generation: A Production-Evaluation Case Study 2026-05-31T22:17:44Z

High-stakes production document-generation systems require language models to be adaptive, evidence-grounded, and auditable. We present HOPM, a hierarchical online prompt mutation framework evaluated on a real marketplace dispute-evidence workflow. HOPM treats prompts as online policies: a family/version router selects a prompt, deterministic guardrails attribute failures to mutable prompt-token categories, and dual feedback from human review and an automated judge updates both routing and mutation priorities. The primary evidence is an observed matched production-evaluation ablation: seven variants are evaluated on the same 600 cases each, enabling component comparisons against static prompting, manual iteration, bandit-only routing, mutation-only adaptation, human-only feedback, auto-judge-only feedback, and full dual-loop HOPM. Full HOPM improves count win rate over a static control from 34.7% to 45.7% (+11.0 pp; paired McNemar p = 1.31e-11) and amount-weighted win rate from 22.3% to 41.4% (+19.1 pp; 95% paired bootstrap CI [10.3, 28.9] pp). It also increases mean Likert quality from 3.18 to 4.40 and reduces issue-flag rate from 15.3% to 5.2%. Supporting review artifacts cover 770 generated-text reviews, 318 labeled reviewer exports, a 10-case/61-rating calibration slice, and a 70-case/350-rating OCR benchmark; these artifacts calibrate rubric, guardrail, title-risk, and OCR-risk interpretation rather than substituting for the production ablation. The paper includes control setup, sample sizes, confidence intervals, paired tests, prompt-token categories, pseudocode, schema, rubric, guardrail taxonomy, and a constructed example so the evaluation structure can be reproduced without exposing proprietary evidence.

2026-05-31T22:17:44Z 7 pages. Production-evaluation case study of guardrailed LLM evidence-document generation Nataraj Agaram Sundar Tejas Morabia http://arxiv.org/abs/2606.01440v1 Understanding Cross-Cloud Interconnects: Hands-On Measurements and Cost Optimization 2026-05-31T20:19:12Z

New services such as Google Cross-Cloud Interconnect (CCI) address the rise in fast and large-scale cross-cloud data transfers. CCI offers dedicated high-throughput links with low per-GB transfer costs, but also involves high fixed leasing fees and multi-day provisioning delays. This combination makes cost optimization difficult because traffic patterns are unpredictable. This paper presents the first comprehensive study of CCI-like services. We begin with an empirical characterization of CCI and its alternatives using direct measurements across AWS-GCP interconnects. We then introduce ToggleCCI, a new dynamic cost-optimization algorithm designed to handle provisioning delays and uncertainty in future demand. ToggleCCI adapts by switching between VPN and CCI based on cost trends observed over a sliding time window. We prove that ToggleCCI achieves asymptotic optimality under sustained high-demand or low-demand regimes. Finally, using real-world traffic traces, we show that ToggleCCI consistently tracks the best static policy for each scenario and delivers substantial cost savings.

2026-05-31T20:19:12Z Accepted to IEEE CLOUD 2026 Eitan Eliav Isaac Keslassy David Breitgand Dean H. Lorenz Avi Weit http://arxiv.org/abs/2606.01387v1 Fail-Closed Lowering of Resident KV Claims onto LLM Serving Runtimes 2026-05-31T18:23:21Z

LLM serving runtimes increasingly expose KV-cache primitives that resemble future-reuse controls: retention priority, TTL-like duration, host or storage offload, block events, active no-evict scheduling, and KV-aware routing. This paper argues that such primitives are weaker than accepted future-KV obligations. A runtime can expose priority, offload, events, and routing without accepting responsibility for a future reuse claim. We study ResidentClaim lowering: when a runtime primitive, trusted adapter, or patch can be treated as satisfying an accepted claim about future KV reuse. A conformant lowering must bind behavior to accepted claim identity, a materialization predicate, ordered lifecycle events, and claim-scoped outcomes. We contribute a fail-closed lowering relation, checker, descriptor format, and bad-lowering suite that classify runtime/mode mappings as native conformance, adapter-observational evidence, adapter-policy evidence under controlled pressure, approximation substrate, rejected mapping, or unknown evidence. The checker validates manually curated, anchored runtime descriptors against obligation bundles; it does not prove that unaudited runtime behavior is complete. Public TensorRT-LLM, SGLang/HiCache, and Dynamo expose strong substrates and selected adapter positives, but not native ResidentClaim conformance. The positive systems witness is a local patched vLLM connector/scheduler-boundary mechanism: claim metadata flows through real in-process offload/load behavior, and controlled same-claim restoration failure reaches vLLM's invalid-KV-load path and becomes an ordered claim-scoped fail-closed outcome. The result is a calibrated semantics boundary, not a production performance claim or a compatibility survey.

2026-05-31T18:23:21Z 24 pages, 2 figures. Public artifact: https://github.com/gustavgauge/resident-kv-lowering-artifact Lukas Stepanek http://arxiv.org/abs/2606.01386v1 GuidaPA: Privacy-Preserving Chatbot for Public Administration via Federated Learning 2026-05-31T18:20:25Z

We present GuidaPA, a privacy-preserving chatbot for the Italian Public Administration (PA) trained via Federated Learning (FL) on documentation from two national PA platforms, SIGESON and SIDFORS. Our corpus includes approximately 8 pages of SIGESON manuals and 31 pages of SIDFORS manuals/FAQs; while this study uses public documentation as a safe proxy, the intended deployment extends to restricted internal sources (e.g., tickets, officer manuals, database extracts) that can not be centrally pooled due to regulatory and organizational constraints. GuidaPA integrates role-based access control, secure client-side preprocessing, explicit monitoring of non-IID effects, and parameter-efficient federated fine-tuning of large language models. Using QLoRA (4-bit) over 15 federated rounds with an 80/20 train-test split per client, we evaluate answer quality with ROUGE, BLEU-4, and METEOR. The best federated model achieves ROUGE-1/2/L of 61.10/55.77/59.44, BLEU-4 of 45.02, and METEOR of 63.94-close to private centralized fine-tuning while keeping data on-site. Compared to the general-purpose baseline, domain fine-tuning improves ROUGE-1 from 41.45 to 62.18 and BLEU-4 from 26.97 to 50.90. Overall, the results indicate that FL can deliver high-quality conversational AI for public services without centralized data sharing

2026-05-31T18:20:25Z Accepted to the 2nd International Conference on Federated Learning and Intelligent Computing Systems (FLICS2026) Daniel M. Jimenez-Gutierrez Albenzio Cirillo Raffaele Nicolussi Alessio Beltrame Andrea Vitaletti http://arxiv.org/abs/2603.11797v2 The Carnot Bound: Limits and Possibilities for Bandwidth-Efficient Consensus 2026-05-31T16:27:59Z

In leader-based State Machine Replication (SMR), the leader's outgoing bandwidth is a natural throughput bottleneck. Erasure coding can alleviate this by letting the leader send each processor one fragment of each block rather than a full copy. The data expansion rate, the ratio of total data sent to payload size, determines how close throughput can get to network bandwidth. We investigate the fundamental limits of bandwidth-efficient leader-based consensus. We prove that protocols with 2-round finality (one voting round) cannot achieve a data expansion rate below approximately~$2.5$, matching existing protocols. Protocols with 3-round finality (two voting rounds) can do significantly better: the second voting round provides a recovery mechanism, letting leaders attempt aggressive erasure codes and safely fall back to conservative ones when reconstruction fails, without compromising consistency. We present two 3-round protocols realising this. Carnot~1 solves Extractable SMR, in which any correct processor can efficiently reconstruct any finalised block from fragments held by correct processors, but processors need not hold full blocks locally; this suffices for settings such as data availability layers. Carnot~1 assumes $n \geq 4f+1$ (at most $f$ Byzantine) and requires no fragment dissemination beyond the initial messages. Carnot~2 solves full SMR, where every correct processor eventually receives every finalised transaction. It operates under optimal resilience $n \geq 3f+1$, at the cost of additional fragment dissemination when Byzantine processors interfere. Both protocols support stable leaders. Under favourable conditions, leaders can use expansion rates approaching $1$; under adversarial conditions, they revert to safe rates of approximately $1.33$ and $1.5$, respectively, both well below the $2.5$ lower bound for 2-round finality.

2026-03-12T10:59:35Z Andrew Lewis-Pye Patrick O'Grady http://arxiv.org/abs/2606.01232v1 Residual-Weighted Randomized Jacobi: Sharpened Bounds via Residual Concentration and Asynchronous Extension 2026-05-31T13:29:29Z

We study randomized stationary methods for symmetric positive definite linear systems in which component $j$ is selected with probability proportional to $|r_j|^\ell$. This power-weighted family interpolates continuously between uniform randomized Jacobi as $\ell \to 0$ and Gauss--Southwell greedy relaxation as $\ell \to \infty$. For the central case $\ell = 2$, we sharpen the standard one-step convergence analysis using the inverse participation ratio (IPR) $ν^2(r) = n\|r\|_4^4/\|r\|_2^4$, which equals $1$ when the residual is uniform and grows toward $n$ as it concentrates. The resulting bound amplifies the expected per-step progress by exactly $ν^2$ over the uniform-sampling baseline. The IPR can be computed online at $O(n)$ cost and doubles as a per-iteration diagnostic. We extend the analysis to asynchronous power-weighted Jacobi via the Avron--Druinsky--Gupta framework, obtaining an epoch-based convergence theorem in which the IPR controls both the progress coefficient and the allowed-delay window. Numerical experiments on shared-memory hardware support the sharpened bound and show the IPR trajectory is essentially concurrency-insensitive. Unexpectedly, consistent-reads execution, the easier case for the ADG analysis, destabilizes power-weighted sampling at high concurrency while inconsistent reads remain stable; the same IPR that amplifies progress amplifies a thread-collision rate that inconsistent reads appear to absorb. We propose a feedback-damping mechanism and verify two predictions about its dependence on problem size.

2026-05-31T13:29:29Z Evan Coleman http://arxiv.org/abs/2606.01211v1 GPU Acceleration of Learning With Errors KEMs Using OpenACC for Post-Quantum Cryptography 2026-05-31T13:01:18Z

Shor's algorithm proved that asymmetric cryptographic protocols based on the integer factorization and discrete logarithm problems are no longer safe in a world with large-scale quantum computers. As a result, Post-Quantum Cryptography (PQC) has been developed over the last few years, seeking cryptographic primitives resistant to quantum attacks. One of the main hard problems underlying PQC schemes is the Learning with Errors (LWE) problem, which is significantly more computationally intensive than its classical predecessors. In this work, we present a Key Encapsulation Mechanism (KEM) based on plain LWE and develop a GPU-oriented implementation using OpenACC. We evaluate the performance of our accelerated application in terms of both time-to-solution and energy-to-solution, considering bare-metal and containerized executions across multiple NVIDIA GPU models and generations. Our implementation achieves significant acceleration across all tested GPU platforms. In particular, on the NVIDIA Grace Hopper Superchip, it attains up to a $208\times$ speedup over a multithreaded CPU baseline and enables the execution of problem sizes that are impractical on CPU architectures due to memory and synchronization constraints. Energy consumption analysis also shows $\approx 2\times$ better efficiency when using the Superchip compared to systems equipped with x86-based CPUs and NVIDIA H100 GPUs. These results highlight the effectiveness of GPU acceleration for computationally demanding LWE-based cryptographic workloads.

2026-05-31T13:01:18Z Pre-print version of the manuscript submitted to NUMTA2026 Tiziana Liberati Nitin Shukla Matteo Barbieri Gabriella Bettonte Elisabetta Boella Simone Rizzo Daniele Gregori Marco Pedicini http://arxiv.org/abs/2606.01161v1 AcOrch: Accelerating Sampling-based GNN Training under CPU-NPU Heterogeneous Environments 2026-05-31T11:08:51Z

Graph Neural Networks (GNNs) have achieved remarkable success in various applications. Sampling-based GNN training, which conducts mini-batch training on sampled subgraphs, has become a promising solution for large-scale graphs. Given the resource-intensive nature of sampling-based GNN training, Neural Processing Units (NPUs), such as the Ascend AI processor, offer a promising alternative due to their high throughput and energy efficiency, making them well-suited for GNN workloads. However, the multi-stage nature of sampling-based training, which involves subgraph sampling, feature gathering, and model training, with different resource requirements and computation volume. This requires careful coordination to fully utilize the heterogeneous computation resources of CPUs and NPUs. In this work, we present AcOrch, a sampling-based GNN training system optimized for CPU-NPU heterogeneous platforms. AcOrch offers fine-grained task orchestration and adopts a two-level pipelined execution model to overlap sampling, gathering, and training. It analyzes the heterogeneous compute features of NPUs and maps tasks to AI Cube (AIC) units, AI Vector (AIV) units, and CPU cores accordingly. Moreover, the two-level pipeline enables overlapping execution not only between the CPU and NPU, but also among different types of compute units within the NPU (e.g., AIC and AIV units), thereby maximizing the utilization of available resources. Experiments on an Ascend 910B AI processor show that AcOrch achieves an average speedup of 2.31x over the state-of-the-art NPU-native graph learning system, MindSporeGL.

2026-05-31T11:08:51Z 18 pages, 22 figures, to be published in Frontiers of Computer Science Frontiers of Computer Science, 2027, 21(5): 2105103 Kefu Chen Xin Ai Qiange Wang Yanfeng Zhang Ge Yu 10.1007/s11704-025-50893-0 http://arxiv.org/abs/2606.01114v1 Magnum.np.distributed: Accelerating Finite Difference Micromagnetic Simulations with Multiple GPUs 2026-05-31T09:21:58Z

Micromagnetic simulations are essential tools in nanomagnetism and spintronics research. Although widely adopted solvers like Mumax3 and the Python-native magnum.np use GPU acceleration to improve performance, these tools are limited to single-device computation. In this work, we present the first Python-native multi-GPU micromagnetic framework by extending magnum.np with PyTorch Distributed. This leverages high-speed communication and computation across multiple GPUs while retaining the benefits of ease of installation, platform-agnostic design, and compatibility with Python. For computationally intensive demagnetisation effective-field calculations, we achieve a 7.0x speedup across 8 GPUs connected via NVLink, whereas Halo exchange required for Heisenberg exchange shows limited scaling due to kernel dispatch latency. We also demonstrated the framework's versatility by achieving a 6.8x speedup in demagnetisation field computation on CPU with NUMA pinning via the MPI backend of PyTorch Distributed. Faster turnaround times will enable researchers to explore larger, more complex systems and accelerate the design cycle for novel spintronic devices.

2026-05-31T09:21:58Z Tsz Chung Cheng Yuichiro Kurokawa Hiromi Yuasa http://arxiv.org/abs/2606.01065v1 Leyline: KV Cache Directives for Agentic Inference 2026-05-31T07:13:15Z

Modern KV cache management assumes the chatbot workload: prompts arrive once and the cache grows append-only, so prefix caching and forward-only eviction are correct by construction. Agentic LLMs break this assumption. Their conversations evolve through policy-driven editing: failed tool calls are retried, stale outputs dropped, trajectories pivoted. Two distinct cache problems result. First, identical content moves to new positions between turns, invalidating exact-prefix caches even though the underlying KV would still be valid; recent work on position-independent caching for MLA addresses this reuse problem. Second, and this paper's focus, a policy may need to direct the serving system to actively remove or replace a span of cached content and continue without re-prefilling everything that came after. No existing primitive offers this. Production agentic harnesses fall back to re-prefill on every edit, paying full prefix-recomputation cost; kernel-level eviction methods make their own decisions and cannot accept policy directives from outside the kernel. We introduce Leyline, a serving-side primitive that closes this gap. A declarative directive 4-tuple separates what to edit from how to preserve position correctness. The policy declares the edit and its mode (in-place splice or prefix-trimmed re-prefill for semantic forgetting); an architecture-agnostic interface routes to a per-architecture kernel that restores attention math via a closed-form RoPE-rotation correction. The splice kernel lifts replay cache-hit by +11.2 pp and cuts latency by up to 241 ms. A ten-line truncation rule routed through the same interface lifts agentic solve rate by +14.3 pp on debug-gym. The mechanism is open; the policy space it enables is the agenda.

2026-05-31T07:13:15Z Bole Ma Jan Eitzinger Harald Koestler http://arxiv.org/abs/2601.09037v2 Probabilistic Computers for MIMO Detection: From Sparsification to 2D Parallel Tempering 2026-05-31T02:27:56Z

Probabilistic computers built from p-bits offer a promising path for combinatorial optimization, but the dense connectivity required by real-world problems scales poorly in hardware. Here, we address this through graph sparsification with auxiliary copy variables and demonstrate two fully on-chip parallel tempering solvers on an FPGA. Targeting MIMO detection, a dense, NP-hard problem central to wireless communications, we first fit 11 temperature replicas of a 128-node sparsified system (1,408 p-bits) on-chip and achieve bit error rates significantly below conventional linear detectors on $64 \times 64$ BPSK MIMO. We report complete end-to-end solution times of 3~ms per instance, including all loading, sampling, readout, and verification overheads. ASIC projections in 7~nm technology indicate 103~MHz operation at 285.8~mW, suggesting that massive parallelism across multiple chips could approach the throughput demands of next-generation wireless systems. Sparsification, however, introduces a sharp sensitivity to the copy-constraint strength $P$ that requires manual tuning. To eliminate this bottleneck, we utilize Two-Dimensional Parallel Tempering (2D-PT), which exchanges replicas across both temperature ($β$) and constraint ($P$) dimensions. On Sherrington--Kirkpatrick spin glasses, 2D-PT converges roughly $250\times$ faster than optimally tuned 1D-PT, and on $128 \times 128$ MIMO it reaches zero bit errors at high SNR where 1D-PT exhibits an error floor. We further validate 2D-PT entirely on-chip with 54 replicas (1,728 p-bits) on a $16 \times 16$ MIMO instance, where it tracks the maximum-likelihood bound in just 50 Monte Carlo steps -- $10\times$ fewer than 1D-PT -- at projected 111~MHz and 124~mW in 7~nm. Together, these results establish an on-chip p-bit architecture and a scalable, tuning-free algorithmic framework for dense combinatorial optimization.

2026-01-14T00:01:58Z MMHS and KC-C are equally contributing first authors M Mahmudul Hasan Sajeeb Kevin Callahan-Coray Corentin Delacour Sanjay Seshan Tathagata Srimani Kerem Y. Camsari http://arxiv.org/abs/2606.00946v1 Lodestar: An Online-Learning LLM Inference Router 2026-05-31T01:31:02Z

Efficiently serving large language model (LLM) inference tasks is crucial both for user-perceived latency such as time-to-first-token (TTFT) and for GPU utilization. However, LLM request routing, that is, assigning each inference request to a GPU instance, is particularly challenging: execution is highly input-dependent; batching and KV-cache reuse create strong cross-request coupling; and latency responds nonlinearly to context length, model/engine settings, and heterogeneous accelerators. As a result, simple traditional load balancing algorithms, and even heuristics tailored for LLM inference, fail to achieve good performance. We present Lodestar, a novel learning-based request routing system for distributed GPU clusters. Lodestar continuously collects a snapshot of the cluster at per-request level, including real-time instance state, request characteristics, and observed performance, and trains an online reward predictor that it uses to route inference requests to the instance that will maximize given reward (e.g., minimizing TTFT). Lodestar is cloud-native and works seamlessly with existing serving stacks (vLLM). With continuous online adaptation to changing workloads and infrastructure conditions, Lodestar achieves 1.41x lower average TTFT and 1.47x lower P99 TTFT on average (up to 2.15x/1.86x on homogeneous and 4.38x/4.42x on heterogeneous clusters) compared to a state-of-the-art prefix cache and load-aware heuristic, and learns these efficient routing strategies within about 5 minutes, based on experiments in a public cloud GPU cluster.

2026-05-31T01:31:02Z Gangmuk Lim Wanyu Zhao Brighten Godfrey Jiaxin Shan Le Xu Liguang Xie http://arxiv.org/abs/2512.04449v2 AXLE: Coordinated Offloading with Asynchronous Back-Streaming in Computational Memory Systems 2026-05-30T21:46:08Z

CXL-based Computational Memory (CCM) enables near-memory processing within expanded remote memory, offering opportunities to address data movement costs in disaggregated memory systems and to accelerate overall performance. However, existing offloading mechanisms do not fully leverage the trade-offs of different offload models based on different CXL protocols. This work first examines these tradeoffs and their impact on end-to-end performance and system efficiency for workloads with diverse data and computation characteristics. We propose Asynchronous Back-Streaming, a new offloading protocol that coordinates CXL.io and CXL.mem to enable result back-streaming and asynchronous pipelining across CCM and host tasks. We further design AXLE, a system that realizes this protocol with lightweight host-CCM interaction. Overall, AXLE reduces end-to-end runtime by up to 50.14%, reduces CCM and host idle times by an average of 14.53x and 3.93x, respectively, and achieves up to 6x reduction in host core stall time.

2025-12-04T04:43:04Z Will be appeared at The International Symposium on Computer Architecture (ISCA) 2026 Suyeon Lee Kangkyu Park Kwangsik Shin Ada Gavrilovska http://arxiv.org/abs/2507.13833v4 DistFlow: A Fully Distributed RL Framework for Scalable and Efficient LLM Post-Training 2026-05-30T14:24:59Z

Effectively scaling Reinforcement Learning (RL) is crucial for enhancing the reasoning and alignment of Large Language Models. The massive data and complex execution flows inherent in these tasks require a distributed architecture capable of efficient scaling. However, to simplify programming and dependency management, mainstream frameworks often rely on a centralized architecture where a single node dispatches both control and data. This inherent coupling creates significant communication bottlenecks, severely limiting system scalability and efficiency. We present DISTFLOW, a novel, fully distributed RL framework that adopts a multi-controller paradigm. By decoupling data transmission from control dispatch, DISTFLOW establishes a parallelism-aware, decentralized Data Coordinator that leverages local caching, load balancing, and asynchronous double buffer to minimize communication overhead and mitigate straggler effects. For control logic, it introduces a task scheduler built upon Directed Acyclic Graph (DAG) that facilitates fine-grained, independent execution. Experimental results demonstrate that DISTFLOW achieves near-linear scalability up to 512 GPUs and delivers up to a 2.63x throughput improvement over state-of-the-art (SOTA) frameworks. The source code is available at: https://github.com/sii-research/siiRL.

2025-07-18T11:41:49Z Zhixin Wang Jiaming Xu Tianyi Zhou Mingjun Zhang Liming Liu Jiarui Hu Dian Yang Tongyu Wang Ping Zhang Jinlong Hou Siyuan Feng Yuan Qi Yuan Cheng http://arxiv.org/abs/2606.00735v1 ViBE: Co-Optimizing Workload Skew and Hardware Variability for MoE Serving 2026-05-30T13:57:09Z

In distributed Mixture-of-Experts (MoE) inference, input-dependent token routing interacts with GPU performance variability to create persistent stragglers under synchronized execution, where the slowest GPU determines layer latency. This performance variability is inherent to modern accelerators: manufacturing variation, power limits, and thermal conditions introduce measurable execution-time differences across nominally identical GPUs. The core challenge is that MoE execution-time imbalance arises from the interaction of workload skew and hardware asymmetry. Token routing produces uneven and layer-varying expert loads, while GPU throughput depends on device-specific operating characteristics and workload intensity. Prior work mitigates routing skew but assumes homogeneous hardware, optimizing token balance rather than execution latency. As a result, even balanced token assignments can leave hardware-induced stragglers unaddressed. Thus, we propose Variability-Informed Binning of Experts (ViBE), a hardware-aware expert placement framework that minimizes execution-time imbalance across GPUs. ViBE combines per-GPU performance modeling with expert activation profiling to assign high-load experts to faster devices and low-load experts to slower ones, reducing layer-level stragglers without modifying model semantics or hardware. Because both workload characteristics and effective GPU throughput can shift across serving conditions, ViBE supports lightweight recalibration under workload/performance drift to refresh its routing and performance estimates when needed. Results show that ViBE consistently reduces execution-time imbalance and improves SLO attainment by 14%, while lowering P90 TTFT by up to 45%. We further show that the impact of hardware variability increases at scale, making variability-aware placement important for efficient, high-utilization LLM serving.

2026-05-30T13:57:09Z Seokjin Go Marko Scrbak Ephrem Wu Srilatha Manne Divya Mahajan