https://arxiv.org/api/zOtIRDtcF7SgC2mmtv0b4G8qneo 2026-06-10T02:58:38Z 28838 105 15 http://arxiv.org/abs/2603.02376v2 CUCo: An Agentic Framework for Compute and Communication Co-design 2026-06-03T20:59:27Z

Computation and communication in distributed LLM training and inference are traditionally optimized in isolation; expert-crafted systems such as DeepEP, FLUX, and TokenWeave show the potential of co-design but require deep systems expertise and hardware-specific tuning; CUCo is an agentic framework that automates compute-communication co-design of CUDA kernels by combining a structured design-space formalization with a correctness-first fast-path agent for reliable baselines and an evolution-driven slow-path agent for high-performance strategies, achieving up to 1.57x speedup across four multi-GPU workloads and discovering a two-stream overlap strategy on a DeepSeek-V3 MoE layer that hides dispatch behind local compute at an LLM inference cost under $10 per workload.

2026-03-02T20:35:50Z Yoga Sri Varshan Varadharajan Bodun Hu Saurabh Agarwal Aditya Akella http://arxiv.org/abs/2604.01489v2 CuTeGen: An LLM-Based Agentic Framework for Generation and Optimization of High-Performance GPU Kernels using CuTe 2026-06-03T20:09:20Z

High-performance GPU kernels are critical to modern machine learning systems, yet developing them remains a manual, expert-driven process. Recent work has explored using LLMs to automate kernel generation, but generated kernels still fall short of carefully tuned references on standardized benchmarks. We present CuTeGen, an agentic GPU kernel synthesis framework that treats kernel development as a structured generate-test-refine workflow over the CuTe abstraction layer. Two design choices distinguish CuTeGen from prior work: targeting CuTe rather than raw CUDA, which exposes performance-critical structures such as tiling and data movement while remaining stable enough for iterative refinement, and a delayed profiling schedule that withholds low-level performance feedback until the kernel's high-level structure has stabilized. On the 209 tasks of KernelBench Level-1 and Level-2, CuTeGen achieves an average speedup of 1.71$\times$ over PyTorch and outperforms the prior agentic baseline CudaForge (0.89$\times$) at comparable per-task generation cost. Code available at https://github.com/taratt/cutegen.git

2026-04-01T23:55:23Z Tara Saba Zhiyang Chen Jikai Jason Li Anne Ouyang Xujie Si Fan Long http://arxiv.org/abs/2603.27397v3 Benchmarking Quantum Computers via Protocols, Comparing Superconducting and Ion-Trap Quantum Technology 2026-06-03T17:40:39Z

Both Superconducting and Ion-Trap are leading quantum architectures common in the current landscape of the quantum computing field, each with distinct characteristics and operational constraints. Understanding and measuring the underlying \underline{quantumness} of these devices is essential for assessing their readiness for practical applications and guiding future progress and research. Building on earlier work (Meirom, Mor and Weinstein Arxiv 2505.12441), we utilize a benchmarking strategy applicable for comparing these two architectures by measuring "quantumness" directly on optimal sub-chips. Distinct from existing metrics, our approach employs rigorous binary fidelity thresholds derived from the classical limits of state transfer. This enables us to definitively establish quantum advantage of a designated sub-region. Here we apply this quality assurance methodology to platforms from both technologies. This comparison provides a protocol-based evaluation of quantumness advantage, revealing not only the strengths and weaknesses of each tested chip and its sub-chips but also offering a common language for their assessment. By abstracting away technical differences in the final result, we demonstrate a benchmarking strategy that bridges the gap between disparate quantum-circuit technologies, enabling fair performance comparisons and establishing a critical foundation for evaluating future claims of quantum advantage. This work was made possible by policies of two companies who enable independent and objective assessment on their quantum computers and sub-chips. In the name of science, we encourage other companies to emulate the independent qubit availability and the fair pricing which allow researchers to preform such assessments.

2026-03-28T20:29:23Z 28 body pages, 10 appendix pages, 34 figures Nitay Mayo Tal Mor Yossi Weinstein http://arxiv.org/abs/2606.05081v1 Graph Traversal on Tensor Cores: A BFS Framework for Modern GPUs 2026-06-03T16:37:08Z

Modern GPUs have Tensor Cores (TCs) capable of extremely high-throughput matrix operations, yet graph algorithms remain difficult to accelerate because of their irregular and data-dependent execution patterns. This work presents BLEST, a TC-accelerated framework that reformulates Breadth-First Search (BFS) as a bit-level sparse matrix-vector computation while addressing the load imbalance, memory inefficiency, and synchronization overheads that limit prior approaches. BLEST introduces Binarized Virtual Slice Sets (BVSS), a graph representation that partitions work into balanced warp-level units and schedules only frontier-relevant regions of the graph. It further employs an optimized TC layout that maps neighbour checks onto binary MMA instructions without wasted outputs, reducing the number of required MMA calls by 8$\times$ compared with prior layouts. To mitigate atomic and cache bottlenecks, BLEST incorporates a lazy vertex-update scheme. We revisit the switching terminology for BFS and propose a mechanism that dynamically transitions from TCs to CUDA cores when it becomes more efficient. We also extend BLEST to multi-source BFS and closeness centrality workloads. Finally, we introduce a scalable graph reordering method that improves compression for scale-free-like graphs, while using RCM to improve locality for others. Across a broad set of real-world graphs, BLEST achieves average speedups of 22.0$\times$, 7.7$\times$, 8.1$\times$, and 5.9$\times$ over GAP, Gunrock, GSWITCH, and BerryBees, respectively, establishing a new BFS baseline on GPUs. Thanks to its high performance, BLEST can compute the exact closeness centralities of 65.6M vertices in a social network with 3.6B edges in an hour using 100 H100 GPUs.

2026-06-03T16:37:08Z 15 pages, 5 figures, 8 tables, 5 algorithms Deniz Elbek Kamer Kaya http://arxiv.org/abs/2604.17709v2 DeInfer: Efficient Parallel Inferencing for Decomposed Large Language Models 2026-06-03T15:24:49Z

Existing works on large language model (LLM) decomposition mainly focus on improving performance on downstream tasks, but they ignore the poor parallel inference performance when trying to scale up the model size. To mitigate this important performance issue, this paper introduces DeInfer, a high-performance inference system dedicated to parallel inference of decomposed LLMs. It consists of multiple optimizations to maximize performance and be compatible with state-of-the-art optimization techniques. Extensive experiments are carried out to evaluate DeInfer's performance, where the results demonstrate its superiority, suggesting it can greatly facilitate the parallel inference of decomposed LLMs.

2026-04-20T01:47:48Z accepted by DAC'26, latest version fixs a minor mistake You-Liang Huang Xinhao Huang Chengxi Liao Zeyi Wen 10.1145/3770743.3804360 http://arxiv.org/abs/2605.01910v2 Stochastic Sparse Attention for Memory-Bound Inference 2026-06-03T14:38:34Z

Autoregressive decoding becomes bandwidth-limited at long contexts, as generating each token requires reading all $n_k$ key and value vectors from KV cache. We present Stochastic Additive No-mulT Attention (SANTA), a method that sparsifies value-cache access by sampling $S \ll n_k$ indices from the post-softmax distribution and aggregates only those value rows. This yields an unbiased estimator of the post-softmax value aggregation while replacing value-stage multiply-accumulates with gather-and-add. We introduce stratified and systematic sampling to design variance-reduced, GPU-friendly variants. Evaluated on Llama-3.1-8B-Instruct at 32k-token contexts, S$^2$ANTA matches baseline accuracy while achieving up to $1.5\times$ decode-step attention-kernel speedup over FlashInfer and FlashDecoding on an NVIDIA RTX 6000 Ada. In batched long-context generation, these kernel gains translate to up to $1.25\times$ end-to-end decode-latency speedup. Finally, we propose Bernoulli $qK^\mathsf{T}$ sampling as a complementary technique to sparsify the score stage, reducing key-feature access through stochastic ternary queries. Both methods are complementary to upstream quantization, low-rank projection, KV-cache compression, and KV-cache selection methods. Together, they point toward sparse, multiplier-free, and energy-efficient inference. We open-source our kernels at: https://github.com/OPUSLab/SANTA.git

2026-05-03T14:44:14Z Code available at https://github.com/OPUSLab/SANTA ICML 2026 Kyle Lee Corentin Delacour Kevin Callahan-Coray Kyle Jiang Can Yaras Samet Oymak Tathagata Srimani Kerem Y. Camsari http://arxiv.org/abs/2606.04934v1 The local complexity of certifying parity 2026-06-03T14:24:47Z

In this paper, we consider the problem of locally certifying that the size of a network is even, or more generally, congruent to some fixed number. The parity property is one of the simplest global properties, and it plays an intriguing role in local certification. On the one hand, it is one of the simplest properties in cycles because it is equivalent to 2-colorability, and hence can be certified with a single bit. On the other hand, in general graphs, no non-trivial lower bound on the size of the certificates is known, and the known upper bound basically consists in certifying the \emph{exact} value of $n$. In addition, the nature of the problem makes all the known lower bound approaches fail. We uncover a surprising landscape for parity across different models and graph structures: * In general graphs equipped with identifiers, when allowing verification radius 2, parity can be certified with a constant number of bits. * But in the model of anonymous graphs and allowing verification radius only 1, parity requires $Ω(\log \log^*n)$ bits. * Finally, in bounded expansion graph classes (such as bounded-degree graphs and planar graphs), the lower bound does not apply: in the same restricted model we can design a constant-size certification. We introduce several new tools that we expect to be useful in other contexts, in particular ways to \emph{encode a parent at each node with a constant number of bits} (via implicit use of the IDs and conflict-free colorings) and a new lower bound technique, with complex topologies and higher-order Ramsey-type arguments.

2026-06-03T14:24:47Z Nicolas Bousquet Laurent Feuilloley Jorge Valenzuela Sébastien Zeitoun http://arxiv.org/abs/2512.10236v3 Design Space Exploration of DMA based Finer-Grain Compute Communication Overlap 2026-06-03T13:48:53Z

Modern ML workloads demand distributing training and inference across multiple GPUs. However, these parallelization techniques often suffer from exposed critical-path communication, leaving a potential 1.7x speedup on the table through compute-communication overlap. Prior overlapping methods harness the fact that ML model state and inputs are already sharded into the number of GPUs, and overlap the compute and communication at shard granularity. However, such coarse-grained overlap suffers from limited network topology support, and suboptimal dataflows. In this work, we instead make a case for finer-grain compute-communication overlap which we term FiCCO. FiCCO operates one level deeper than traditional sharding, and unlocks overlap for a wider set of network topologies and enables finer-grain dataflow. We show that FiCCO opens up a wider design space of execution schedules than possible at shard-level alone. To walk the design space of schedules, we study and characterize the performance inefficiencies on doing overlap and overlay the schedules with the associated inefficiency signatures. Our characterization reveals decomposition and contention based slowdowns to be the major performance limiters, and we correlate the slowdown factors with the static compute/communication operator sizes. This helps us design heuristics (that frameworks and runtimes can harness) to select bespoke FiCCO schedules based on the nature of underlying ML operations. Finally, to further minimize contention inefficiencies inherent with operation overlap, we offload communication to GPU DMA engines. We evaluate several scenarios from realistic ML deployments and demonstrate that our proposed heuristics driven bespoke schedules deliver up to 1.6x speedup. Further, our heuristics provide accurate guidance to pick the optimal schedule in 81% of unseen scenarios.

2025-12-11T02:43:27Z Shagnik Pal Shaizeen Aga Suchita Pati Mahzabeen Islam Lizy K. John http://arxiv.org/abs/2606.04687v1 Clownfish: Scaling DAG-based BFT Consensus via Sparse Edges 2026-06-03T10:11:50Z

Directed Acyclic Graph (DAG) based BFT protocols have demonstrated the capability to achieve significantly high throughput in practice. Recent advancements focused on minimizing the good-case latency of these protocols, approaching the theoretical lower bound. However, the high communication complexity inherent in existing DAG-based protocols limits their scalability. This primarily arises because each vertex in the DAG must include a linear number of edges (references) to vertices from previous rounds. We present Clownfish, a partially synchronous DAG-based BFT protocol designed to address the scalability bottleneck. Clownfish achieves lower communication complexity by selectively reducing the number of edges in DAG vertices. When using a communication-optimal consistent broadcast, Clownfish attains quadratic total communication complexity per round, outperforming prior DAG-based protocols. Clownfish also reduces the additional latency in failure cases by optimizing the round advancement rule. Additionally, Clownfish supports multiple leaders per round to reduce average latency while maintaining its lower communication complexity. Our experimental evaluation demonstrates that Clownfish provides significantly better scalability than existing DAG-based protocols.

2026-06-03T10:11:50Z Feifan Wang Jingfan Yu Zixi Cai Zhixuan Fang http://arxiv.org/abs/2606.04652v1 Rectangular Matrix Multiplication in the Low-Bandwidth Model 2026-06-03T09:22:49Z

We study rectangular matrix multiplication in the low-bandwidth model of distributed computing. There are $n$ computers; initially the input matrices are distributed evenly between computers, and in each communication round every computer can send and receive an $O(\log n)$-bit message. Eventually each computer must output its designated part of the product matrix. While prior work has focused primarily on square $n \times n$ multiplication under various sparsity assumptions, we study rectangular instances with no sparsity assumption. We denote by $\langle a,b,c\rangle$ the task of multiplying an $a\times b$ matrix by a $b\times c$ matrix in this model. We concentrate on two natural aspect ratios, $\langle n,d,n\rangle$ and $\langle d,n,d\rangle$, for $d \le n$, and we study how the round complexity depends on $n$ and $d$. When $d \to n$, both $\langle n,d,n\rangle$ and $\langle d,n,d\rangle$ approach $\langle n,n,n\rangle$, which is the usual task of multiplying square matrices. If we consider multiplication over semirings, the current best upper bound in that case is $O(n^{4/3})$ rounds, and there is a trivial unconditional lower bound of $Ω(n)$. We show that for $\langle d,n,d\rangle$, we can achieve the complexity of $\tilde O(d^{4/3})$, which seems like a natural generalization of the upper bound $\tilde O(n^{4/3})$ when $d=n$. However, the case of $\langle n,d,n\rangle$ is fundamentally different, and also exhibits a phase transition. We show that for $d \le \sqrt{n}$, the complexity of $\langle n,d,n\rangle$ is $Θ(d \sqrt{n})$; we have matching upper and lower bounds. However, the behavior is genuinely different in the region $d \ge \sqrt{n}$, where the upper bound is $O(d^{2/3} n^{2/3})$.

2026-06-03T09:22:49Z Chetan Gupta Jukka Suomela Hossein Vahidi http://arxiv.org/abs/2606.04594v1 Ekka: Automated Diagnosis of Silent Errors in LLM Inference 2026-06-03T08:32:13Z

LLM serving frameworks are quickly evolving with a complex software stack and a vast number of optimizations. The rapid development process can introduce silent errors where output quality silently degrades without any explicit error signals. Diagnosing silent errors is notoriously difficult due to the substantial semantic gap between the high-level symptoms and the low-level root causes. We observe that diagnosis of silent errors can be effectively framed as a differential debugging problem by leveraging the existence of semantically correct reference implementations. We propose Ekka, an automated diagnosis system that identifies root causes by systematically aligning and comparing intermediate execution states between a target and a reference framework. We constructed a benchmark of real-world silent errors from popular serving frameworks, where Ekka shows 80% pass@1 diagnosis accuracy and 88% pass@5 diagnosis accuracy, outperforming state-of-the-art systems. Ekka also diagnoses 4 new silent errors from serving frameworks, all of which have been confirmed by the developers.

2026-06-03T08:32:13Z ICML 2026 Yile Gu Zhen Zhang Shaowei Zhu Xinwei Fu Jun Wu Yida Wang Baris Kasikci http://arxiv.org/abs/2606.01138v2 memorywire: A Vendor-Neutral Wire Format for Agent Memory Operations 2026-06-03T07:59:47Z

Agent-memory frameworks -- mem0, Letta/MemGPT, Cognee, Zep/Graphiti, MemoryOS, MemTensor -- each ship their own SDK, storage layout, and operational vocabulary. There is no shared wire format: every integration is bespoke, every migration rebuilds memory from scratch, and no framework ships a governance surface that lets a human review writes before they enter long-term storage. We present memorywire, a JSON-Schema 2020-12 wire format for five memory operations (remember, recall, forget, merge, expire) over four memory types (semantic, episodic, procedural, emotional), with a MemoryStore interface, a fan-out router, and an optional HITL governance channel. We describe an open-source reference implementation with five backend adapters (sqlite-vec, mem0, Letta, Cognee, pgvector); a microbenchmark on a 100-fact / 50-query labelled corpus (42 with non-empty gold ids + 8 no-match probes) achieving recall@5 = 1.000 on the 42 gold-id queries with ingest p50 = 37.8 ms and recall p50 = 40.6 ms; an adversarial-fusion experiment showing Reciprocal Rank Fusion holds recall@5 = 1.000 across a 1-of-N rank-0 injection sweep (K in {0, 5, ..., 50}) where max fusion collapses to 0.500 with 80% leak at K >= 5; and a 16-scenario cross-adapter conformance suite passing 68 of 80 cells with zero failures. The contribution is not a new algorithm; it is a packaging of established components (RRF, FSMs, STM/LTM consolidation, diff-and-approve workflows) into a venue-neutral protocol with an empirically validated reference, positioned to compose with the Model Context Protocol rather than compete with it.

2026-05-31T10:18:56Z v2: title corrected from pre-launch name "AMP" to "memorywire"; abstract clarifies recall@5 = 1.000 is on the 42 gold-id queries (50 total; 8 no-match probes excluded). 17 pages, 1 figure, 6 tables. Code: github.com/mthamil107/memorywire. Companion to arXiv:2604.18248 (Prompt Injection Detection) Thamilvendhan Munirathinam http://arxiv.org/abs/2606.04446v1 D^2SD: Accelerating Speculative Decoding with Dual Diffusion Draft Models 2026-06-03T04:48:00Z

Speculative decoding accelerates autoregressive large language model inference by drafting multiple tokens and verifying them in a single target-model forward pass. Recent diffusion-based drafters generate an entire block of tokens in parallel but usually commit to a single draft sequence per verification: once the first mismatch occurs, all subsequent draft tokens are discarded, resulting in a limited acceptance rate. Naively batching more draft candidate sequences only introduces a marginal improvement, as redundant or poorly placed branches increase the cost of drafting and verification without proportionally increasing the number of accepted tokens. We propose D^2SD, a dual diffusion draft speculative decoding framework that organizes candidates into a confidence-guided prefix tree, where the first diffusion drafter generates a block along with per-position confidence scores that are used to identify the most likely rejection boundary and select the top-K prefix ranges for recovery; the second variable-prefix diffusion drafter re-anchors at each selected prefix and proposes alternative continuations in one batched pass; the resulting shared-prefix candidates are jointly verified via cascade attention. Empirically, D^2SD shows clear improvements over both the underlying diffusion approach and strong autoregressive speculative decoding baselines.

2026-06-03T04:48:00Z Liyuan Zhang Jiarui Zhang Jinwei Yao Ran Yan Yuchen Yang Jiahao Zhang Tongkai Yang Yi Wu Binhang Yuan http://arxiv.org/abs/2606.01143v3 Schedule-Level Shared-Prefix Reuse for LLM RL Training 2026-06-03T04:16:23Z

GRPO-based LLM post-training commonly samples multiple trajectories from the same prompt and then trains on the resulting group. In long-context GRPO workloads, this shared prompt-side prefix can contain retrieved passages, visual tokens, tool schemas, system instructions, or task context, while the full rollout group is still too large to pack into one training microbatch. Standard dense trainers therefore recompute the same prefix forward and backward for every trajectory. We present a schedule-level reuse mechanism that decouples prefix and suffix computation. The schedule runs prefix forward once, executes suffixes as ordinary microbatches while reading prefix K/V and accumulating prefix-side gK/gV , and then runs prefix backward once on the accumulated gradient cache. This reordered schedule is equivalent to baseline training over real arithmetic and aligns numerically within finite-precision tolerance. Because only K/V and gK/gV are hot during suffix computation, the approach offloads dormant prefix activations, integrates with TP/EP/CP/PP and DP-style placement at the execution level, and preserves aux-loss-based MoE router semantics through logical prefix-token accounting. On dense Llama3-8B, Qwen3-8B, and MoE Qwen3-MoE-30B-A3B configurations, the schedule matches optimizer updates across TP/CP/PP/EP combinations, aligns on a 100-step real GRPO actor-update trace replay, reaches up to 4.395x speedup (2.930x under a conservative compile-on comparison) as prefix ratio and GRPO group size grow, and reduces Phase-B peak HBM by up to 59.1%, extending the Llama3-8B capacity frontier from 17,920 to 29,696 total tokens.

2026-05-31T10:24:10Z Pengbo Li Feiyuan Zhang Guangming Sheng Guangxin He Di Chai Ziniu Li Taiqiang Wu Wenyu Mao Binhang Yuan Kai Chen http://arxiv.org/abs/2606.04268v1 ACEAPEX: Parallel LZ77 Decoding via Encode-Time Absolute Offset Resolution 2026-06-02T22:44:22Z

LZ77-based codecs exhibit a fundamental sequential bottleneck in decoding: each back-reference depends on previously decompressed data, preventing multi-core scaling. We present ACEAPEX, a parallel LZ77 codec that stores all back-references as absolute positions in the decompressed output and organizes data into self-contained 1 MB blocks, enabling embarrassingly parallel block-level decoding. Integrated into lzbench, ACEAPEX achieves 10,160 MB/s on EPYC 4344P (8 cores) and 10,869 MB/s on EPYC 9575F for FASTQ genomic data -- up to 3.1x faster than zstd -3 at comparable compression ratios. We further implement a GPU wavefront decoder on NVIDIA H100 SXM, measuring 44.0 GB/s on enwik9 and 20.3 GB/s on FASTQ (wavefront match phase, BIT-PERFECT verified). With a depth-limited encoder variant (-1.5% ratio on enwik9), GPU throughput reaches 77.2 GB/s on a single H100 and 249.9 GB/s on two H100s in NVLink configuration. To our knowledge, this is the first reported GPU LZ77 decode with near-standard compression ratio verified byte-for-byte.

2026-06-02T22:44:22Z 6 pages, 5 tables Yakiv Shavidze 10.5281/zenodo.20440965