https://arxiv.org/api/e7/vkoZaocwP5JOPGEnRBqWnQGo 2026-06-10T01:49:19Z 28834 90 15 http://arxiv.org/abs/2606.05933v1 Beyond Greedy Chunking: SLO-Aware Sliding-Window Scheduling for LLM Inference 2026-06-04T09:36:40Z

With the rapid growth of interactive applications in large language model (LLM) online services, maintaining high system throughput while ensuring user-perceived latency has become a key issue in inference scheduling. Existing LLM service systems rely on coarse-grained output constraints, making it difficult to effectively handle resource contention among multiple requests, resulting in low resource utilization efficiency and limited support for fine-grained quality of service (QoS) differentiation. We present SlidingServe, a sliding-window-driven SLO-Aware scheduling system for online LLM inference. SlidingServe designed a lightweight batch latency predictor to estimate the execution time of a batch. Based on this, SlidingServe uses SlidingChunker to combine information from the current iteration and the next iteration to achieve dynamic chunking and improve the overall system throughput while maintaining strict QoS guarantees. SlidingServe introduces Multi-Level Priority Sorter to sort candidate requests in order to balance fairness and efficiency. Additionally, when multiple requests within the same batch are at risk of SLO violating,SlidingServe introduces BatchConstructor, which uses dynamic programming to select the set of requests to execute in the current round, mitigating the SLO violation risk of critical requests.Our evaluation demonstrates that SlidingServe can improve service capacity by up to 30% compared to advanced scheduling systems under various load conditions, and further reduces the rate of SLO violation by 16%-53% under heavy-load inference mode.

2026-06-04T09:36:40Z Yuansheng Chen Yue Zhang Xuan Mo Weigang Wu Jialun Li http://arxiv.org/abs/2606.05914v1 IN2P3 Computing Center 2024 Workload Dataset 2026-06-04T09:20:57Z

This paper provides and analyzes a dataset detailing the characteristics and execution data of all jobs submitted to the IN2P3 Computing Center (Villeurbanne, France), a national research and support unit of the CNRS, in 2024. The main additional value of this contribution compared to previously available datasets consists in the combination of an extended time interval considered, the inclusion of memory usage data and its recency, on top on improving the diversity of datasets provenance. This allows researchers to simulate and evaluate scheduling algorithms on a real workload over a large time window. Thus, specificities due to seasonal, monthly, and weekly user behaviors can be taken into account, which is not possible with smaller or synthetic datasets. It is composed of 44M jobs submitted by 1k users running on a cluster of a maximum of 312 machines supporting 46k concurrent threads and providing 105To of RAM.

2026-06-04T09:20:57Z Guillaume Cochard Bertrand Simon http://arxiv.org/abs/2605.30507v2 A Virtual Processor brings back the Free Lunch 2026-06-04T08:39:06Z

This work introduces a self-optimizing virtual processor (VP) for numerical array programs that shifts parallelization from a manual developer task to a cooperative, agent-like runtime mechanism. Instead of relying on centralized task-graph scheduling, static compiler optimization, or explicitly annotated parallel constructs, the VP uses a decentralized network of cooperative execution segments, derived from the stream of numerical instructions and their data dependencies at runtime. Each segment makes only local decisions about when, where, and how to prepare and execute its computation, including task placement, kernel preparation, and data movement. No central scheduler or mapper instance determines the execution globally; instead, scheduling itself is parallelized and distributed over time - asynchronously and strictly dependency driven. The overall execution strategy emerges from concurrently executing local segments, continuously responding to data availability, cost estimates, system state, hardware capabilities, and problem size. While preserving the sequential semantics of the program our VP automatically exploits parallelism across large program regions rather than being limited to individual loop bodies, modules, or explicitly marked parallel sections; developers are not required to design or encode a parallelization strategy. The current VP primarily targets low-latency strong scaling on local heterogeneous hardware, covering workloads from small, latency-sensitive array operations to large data-parallel computations. The current implementation targets the predefined array instruction set of the ILNumerics ONAL domain-specific language, accessible https://github.com/ILNumerics/ILNumerics.ONAL , while the underlying concept is applicable to general array-based numerical programming models such as MATLAB and NumPy.

2026-05-28T19:43:49Z 10 pages + appendix (3 pages), 7 figures, 4 benchmarks at https://github.com/hokb/decentralized-array-execution-artifacts2026 (GitHub) or https://doi.org/10.5281/zenodo.20407801 (DOI Zenodo) Haymo Kutschbach http://arxiv.org/abs/2506.19260v2 Topology-Aware Differential Privacy in Federated Learning 2026-06-04T07:20:02Z

Federated learning transmits only model updates to protect client data, and differentially private SGD (DP-SGD) bounds content-level leakage through those updates. Neither mechanism accounts for what the communication topology of the federation itself reveals. In cross-silo deployments, a passive adversary with knowledge of the topology and organisational structure has access to information channels that DP-SGD leaves entirely unaddressed. We formalise this threat and derive a principled defense. We introduce TADI (Topology-Aware Distributional Inference), a shadow-trained channel decomposition that isolates per-client leakage into parameter, structural, and organisational components via four channel ablations, and prove an additive per-client mutual-information bound separating a controllable mechanism term from an uncontrollable prior-coupling floor. From this bound we derive Fulcrum, a closed-form balanced min-max optimal noise allocation that strictly dominates uniform DP-SGD whenever the federation's leverage profile is asymmetric, and degenerates exactly to uniform DP-SGD when it is not, making it safe to adopt unconditionally. Evaluated on Fed-ISIC2019, Fed-Heart-Disease, and synthetic CIFAR-10 across six topology families, Fulcrum delivers privacy gains of up to 1.967 nats at no measurable utility cost. The TADI channel decomposition confirms that the parameter channel is bounded by DP-SGD across all settings, the prior-coupling channel is empirically attained under matched-prior conditions, and the bound is conservative in a deployment-favourable direction under realistic cross-silo threat models.

2025-06-24T02:42:08Z 16 pages, 6 figures, 2 tables. Data from the experiments and source code can be found here: https://doi.org/10.5281/zenodo.20507155 Murtaza Rangwala Richard O. Sinnott Rajkumar Buyya http://arxiv.org/abs/2606.07666v1 Hardware-aware Low-latency Quantum Compilation with Data-driven Lightweight Error Detection for Early Fault-Tolerant Systems 2026-06-04T04:30:28Z

Noisy intermediate-scale quantum (NISQ) processors are entering an early fault-tolerance regime where full quantum error correction carries prohibitive resource costs, yet lightweight error detection can meaningfully improve algorithmic success rates. Existing compilation and error-detection toolchains treat these concerns in isolation, with no principled way to balance detection overhead against success probability under latency constraints. We present an integrated hardware-aware compilation and data-driven quantum error-detection (QED) framework that jointly optimises qubit mapping, SWAP insertion, and syndrome-schedule placement via a noise-weighted cost function and a learned multi-objective scheduler. Simulation experiments on an HPC cluster using GPU-accelerated density-matrix simulation (NVIDIA cuQuantum SDK) across VQE, phase-estimation, and Grover benchmarks, three noise profiles, and circuit sizes of 6-20 qubits (depths 10-160), show that joint co-design raises algorithmic success probability by up to 68 percent (95 percent CI: 60 percent to 76 percent) over SABRE on an 8-qubit VQE instance with post-selection.

2026-06-04T04:30:28Z 16 pages, 15 figures, Springer LNCS format. Code available at https://github.com/Sumitchongder/quantum-hw-aware-pipeline Sumit Chongder Indian Institute of Technology Jodhpur http://arxiv.org/abs/2602.02987v2 Large-Scale LLM Inference with Heterogeneous Workloads: Prefill-Decode Contention and Asymptotically Optimal Control 2026-06-04T03:39:17Z

Large Language Models (LLMs) are rapidly becoming critical infrastructure for enterprise applications, driving unprecedented demand for GPU-based inference services. A key operational challenge arises from the two-phase nature of LLM inference: a compute-intensive \emph{prefill} phase that processes user input, followed by a memory-bound \emph{decode} phase that generates output tokens. When these phases share GPU resources, prefill tasks throttle the processing speed of concurrent decodes, creating state-dependent contention. This contention is further complicated by workload heterogeneity, as different applications exhibit vastly different input and output lengths. We develop a stochastic control framework for scheduling heterogeneous LLM workloads across large GPU clusters. We formulate LLM inference as a multiclass many-server queueing network with state-dependent service rates, grounded in empirical iteration-time measurements. We analyze the fluid approximation of this system and solve steady-state linear programs that characterize optimal resource allocation. We design gate-and-route policies that regulate prefill admission and decode routing, and prove that they are asymptotically optimal in the many-GPU limit under both bundled and separate token-pricing schemes. We further extend the framework to incorporate Service Level Indicators (SLIs) such as latency and fairness, providing a general approach to constrained scheduling. Numerical experiments calibrated to empirical iteration-time data demonstrate that our policies outperform standard serving heuristics.

2026-02-03T01:47:37Z Ruihan Lin Zezhen Ding Zean Han Jiheng Zhang http://arxiv.org/abs/2606.05642v1 PoCQ: Proof of Contribution Quality as a Lightweight Blockchain Consensus for Secure Federated Learning 2026-06-04T03:13:40Z

Decentralized Federated Learning (FL) removes reliance on centralized coordinators but remains vulnerable to model poisoning, unreliable validation, and high validation overhead. This paper introduces Proof of Contribution Quality (PoCQ), a blockchain-based consensus framework designed to secure decentralized FL through reputation-aware validation and aggregation. PoCQ evaluates client updates using cryptographic commitments and lightweight norm-based validation, enabling efficient detection of malicious contributions while limiting validation cost. A reputation-driven consensus mechanism dynamically adjusts the influence of participants based on their historical contribution quality, while the blockchain stores only compact audit metadata to preserve scalability. Extensive experiments under poisoning scenarios across three benchmark datasets demonstrate that PoCQ outperforms the strongest state-of-the-art methods, achieving accuracy gains of 34.1% on challenging medical datasets in highly non-iid settings and an 11% improvement in global average accuracy. In addition, PoCQ reduces validation time by 21.27% on average per round, highlighting its effectiveness in jointly enhancing robustness and efficiency for fully decentralized federated learning.

2026-06-04T03:13:40Z Sudad Abed Nasser Sabar Abdun Mahmood Mohammad Jabed Morshed Chowdhury http://arxiv.org/abs/2602.00898v2 Fast Sparse Matrix Permutation for Mesh-Based Direct Solvers 2026-06-04T01:26:29Z

We present a fast sparse matrix permutation algorithm tailored to linear systems arising from triangle meshes. Our approach produces nested-dissection-style permutations while significantly reducing permutation runtime overhead. Rather than enforcing strict balance and separator optimality, the algorithm deliberately relaxes these design decisions to favor fast partitioning and efficient elimination-tree construction. Our method decomposes permutation into patch-level local orderings and a compact quotient-graph ordering of separators, preserving the essential structure required by sparse Cholesky factorization while avoiding its most expensive components. We integrate our algorithm into vendor-maintained sparse Cholesky solvers on both CPUs and GPUs. Across a range of graphics applications, including single factorizations and repeated factorizations, our method reduces permutation time and improves the sparse Cholesky solve performance by up to 6.27x. Our code is available at https://github.com/BehroozZare/fast-permute.

2026-01-31T20:56:42Z SIGGRAPH 2026 In Proceedings of the SIGGRAPH 2026 Conference Papers, SIGGRAPH Conference Papers '26, New York, NY, USA, July 2026 Behrooz Zarebavami Ahmed H. Mahmoud Ana Dodik Changcheng Yuan Serban D. Porumbescu John D. Owens Maryam Mehri Dehnavi Justin Solomon 10.1145/3799902.3811189 http://arxiv.org/abs/2606.05518v1 Latent Reasoning Guidance for Parallel Code Translation 2026-06-03T23:45:49Z

Tackling complex coding tasks often requires autonomous agents and iterative repair pipelines. These increasingly rely on large amounts of test-time computation, often spending many decoding and repair steps before discovering whether a program compiles, runs, or validates. Executable parallel-code translation is an effective setting for earlier guidance because success is behavioral rather than textual. However, most guidance methods act only after complete programs or textual traces are decoded. This motivates the question: can latent reasoning provide an earlier intervention point, before the model commits to code? We study a test-time latent guidance method for this setting that trains a smaller Process Reward Model (PRM) over continuous latent prefixes and uses it to select among alternate hidden-state trajectories before final code decoding, separately from but compatible with post-decoding optimization. On a 76-task ParaTrans benchmark evaluation, latent PRM guidance improves mean validation rate from 32.89% with unguided latent reasoning to 42.1%, outperforming fine-tuned and vanilla baselines in the same setting. These gains persist under the same three-iteration repair loop. These results provide bounded evidence that useful alternative latent continuations exist and that PRM-scored latent branch selection can improve executable outcomes in this setting without retraining the main generative model.

2026-06-03T23:45:49Z Tomer Bitan Erel Kaplan Roee Bar-Yadin Lian Ghrayeb Le Chen Samyak Jhaveri Niranjan Hasabnis Gal Oren http://arxiv.org/abs/2606.05503v1 Bitcoin After Block Rewards 2026-06-03T22:58:41Z

Bitcoin's block reward is scheduled to decline to zero, raising concerns about whether the network can remain secure once miners rely solely on transaction fees. This paper seeks to identify the conditions under which large-scale and persistent deviation from honest mining can arise. We analyze and compare the payoffs of honest and deviating miners in a sequential decision model, and identify a deviation threshold $G_t$ at which honest mining ceases to be privately optimal. Around the 2024 Bitcoin halving, we show that current mining behavior does not exhibit large-scale or structural deviation. However, when the block reward is removed, the $G_t$ criterion implies that deviation can arise even with a very small fraction of transaction fees. Finally, we evaluate three protocol-level mechanisms: Base Fee, Fee Floor, and an adaptive maximum block size rule, and show that their combination raises the deviation threshold and mitigates incentive breakdown in a fee-only regime. These results provide a practical benchmark for assessing Bitcoin's security as block rewards disappear.

2026-06-03T22:58:41Z 30 pages, 9 figures Junhyuk Lee http://arxiv.org/abs/2606.05495v1 SET: Stream-Event-Triggered Scheduling for Efficient CUDA Graph Pipelines 2026-06-03T22:38:01Z

Achieving peak GPU performance remains a significant challenge as the system throughput is constrained by host-device synchronization delays and kernel scheduling overheads, even with aggressive kernel optimizations and batch processing. Furthermore, existing approaches often underutilize hardware resources such as compute cores and copy engines due to scheduling overheads. To address these problems, we propose a CUDA runtime framework for task-parallel pipelines to minimize the synchronization overheads and the gap between kernel executions. The proposed solution combines two innovations: (1) a multi-stream task-parallel pipeline programming model that leverages event-chaining and work-stealing mechanisms to fully utilize available hardware resources; (2) a graph-based execution flow with per-stream buffers to ensure memory safety for multiple in-flight jobs running concurrently. Extensive evaluations on representative real-world workloads show 1.15--1.44X speedup and reduce scheduling overheads by 18--54% compared to state-of-the-art CUDA graph baselines.

2026-06-03T22:38:01Z Accepted by Euro-Par 2026 Zhengxiong Li Tsung-Wei Huang Umit Ogras http://arxiv.org/abs/2603.02376v2 CUCo: An Agentic Framework for Compute and Communication Co-design 2026-06-03T20:59:27Z

Computation and communication in distributed LLM training and inference are traditionally optimized in isolation; expert-crafted systems such as DeepEP, FLUX, and TokenWeave show the potential of co-design but require deep systems expertise and hardware-specific tuning; CUCo is an agentic framework that automates compute-communication co-design of CUDA kernels by combining a structured design-space formalization with a correctness-first fast-path agent for reliable baselines and an evolution-driven slow-path agent for high-performance strategies, achieving up to 1.57x speedup across four multi-GPU workloads and discovering a two-stream overlap strategy on a DeepSeek-V3 MoE layer that hides dispatch behind local compute at an LLM inference cost under $10 per workload.

2026-03-02T20:35:50Z Yoga Sri Varshan Varadharajan Bodun Hu Saurabh Agarwal Aditya Akella http://arxiv.org/abs/2604.01489v2 CuTeGen: An LLM-Based Agentic Framework for Generation and Optimization of High-Performance GPU Kernels using CuTe 2026-06-03T20:09:20Z

High-performance GPU kernels are critical to modern machine learning systems, yet developing them remains a manual, expert-driven process. Recent work has explored using LLMs to automate kernel generation, but generated kernels still fall short of carefully tuned references on standardized benchmarks. We present CuTeGen, an agentic GPU kernel synthesis framework that treats kernel development as a structured generate-test-refine workflow over the CuTe abstraction layer. Two design choices distinguish CuTeGen from prior work: targeting CuTe rather than raw CUDA, which exposes performance-critical structures such as tiling and data movement while remaining stable enough for iterative refinement, and a delayed profiling schedule that withholds low-level performance feedback until the kernel's high-level structure has stabilized. On the 209 tasks of KernelBench Level-1 and Level-2, CuTeGen achieves an average speedup of 1.71$\times$ over PyTorch and outperforms the prior agentic baseline CudaForge (0.89$\times$) at comparable per-task generation cost. Code available at https://github.com/taratt/cutegen.git

2026-04-01T23:55:23Z Tara Saba Zhiyang Chen Jikai Jason Li Anne Ouyang Xujie Si Fan Long http://arxiv.org/abs/2603.27397v3 Benchmarking Quantum Computers via Protocols, Comparing Superconducting and Ion-Trap Quantum Technology 2026-06-03T17:40:39Z

Both Superconducting and Ion-Trap are leading quantum architectures common in the current landscape of the quantum computing field, each with distinct characteristics and operational constraints. Understanding and measuring the underlying \underline{quantumness} of these devices is essential for assessing their readiness for practical applications and guiding future progress and research. Building on earlier work (Meirom, Mor and Weinstein Arxiv 2505.12441), we utilize a benchmarking strategy applicable for comparing these two architectures by measuring "quantumness" directly on optimal sub-chips. Distinct from existing metrics, our approach employs rigorous binary fidelity thresholds derived from the classical limits of state transfer. This enables us to definitively establish quantum advantage of a designated sub-region. Here we apply this quality assurance methodology to platforms from both technologies. This comparison provides a protocol-based evaluation of quantumness advantage, revealing not only the strengths and weaknesses of each tested chip and its sub-chips but also offering a common language for their assessment. By abstracting away technical differences in the final result, we demonstrate a benchmarking strategy that bridges the gap between disparate quantum-circuit technologies, enabling fair performance comparisons and establishing a critical foundation for evaluating future claims of quantum advantage. This work was made possible by policies of two companies who enable independent and objective assessment on their quantum computers and sub-chips. In the name of science, we encourage other companies to emulate the independent qubit availability and the fair pricing which allow researchers to preform such assessments.

2026-03-28T20:29:23Z 28 body pages, 10 appendix pages, 34 figures Nitay Mayo Tal Mor Yossi Weinstein http://arxiv.org/abs/2606.05081v1 Graph Traversal on Tensor Cores: A BFS Framework for Modern GPUs 2026-06-03T16:37:08Z

Modern GPUs have Tensor Cores (TCs) capable of extremely high-throughput matrix operations, yet graph algorithms remain difficult to accelerate because of their irregular and data-dependent execution patterns. This work presents BLEST, a TC-accelerated framework that reformulates Breadth-First Search (BFS) as a bit-level sparse matrix-vector computation while addressing the load imbalance, memory inefficiency, and synchronization overheads that limit prior approaches. BLEST introduces Binarized Virtual Slice Sets (BVSS), a graph representation that partitions work into balanced warp-level units and schedules only frontier-relevant regions of the graph. It further employs an optimized TC layout that maps neighbour checks onto binary MMA instructions without wasted outputs, reducing the number of required MMA calls by 8$\times$ compared with prior layouts. To mitigate atomic and cache bottlenecks, BLEST incorporates a lazy vertex-update scheme. We revisit the switching terminology for BFS and propose a mechanism that dynamically transitions from TCs to CUDA cores when it becomes more efficient. We also extend BLEST to multi-source BFS and closeness centrality workloads. Finally, we introduce a scalable graph reordering method that improves compression for scale-free-like graphs, while using RCM to improve locality for others. Across a broad set of real-world graphs, BLEST achieves average speedups of 22.0$\times$, 7.7$\times$, 8.1$\times$, and 5.9$\times$ over GAP, Gunrock, GSWITCH, and BerryBees, respectively, establishing a new BFS baseline on GPUs. Thanks to its high performance, BLEST can compute the exact closeness centralities of 65.6M vertices in a social network with 3.6B edges in an hour using 100 H100 GPUs.

2026-06-03T16:37:08Z 15 pages, 5 figures, 8 tables, 5 algorithms Deniz Elbek Kamer Kaya