https://arxiv.org/api/zOtIRDtcF7SgC2mmtv0b4G8qneo2026-06-10T02:58:38Z2883810515http://arxiv.org/abs/2603.02376v2CUCo: An Agentic Framework for Compute and Communication Co-design2026-06-03T20:59:27ZComputation and communication in distributed LLM training and inference are traditionally optimized in isolation; expert-crafted systems such as DeepEP, FLUX, and TokenWeave show the potential of co-design but require deep systems expertise and hardware-specific tuning; CUCo is an agentic framework that automates compute-communication co-design of CUDA kernels by combining a structured design-space formalization with a correctness-first fast-path agent for reliable baselines and an evolution-driven slow-path agent for high-performance strategies, achieving up to 1.57x speedup across four multi-GPU workloads and discovering a two-stream overlap strategy on a DeepSeek-V3 MoE layer that hides dispatch behind local compute at an LLM inference cost under $10 per workload.2026-03-02T20:35:50ZYoga Sri Varshan VaradharajanBodun HuSaurabh AgarwalAditya Akellahttp://arxiv.org/abs/2604.01489v2CuTeGen: An LLM-Based Agentic Framework for Generation and Optimization of High-Performance GPU Kernels using CuTe2026-06-03T20:09:20ZHigh-performance GPU kernels are critical to modern machine learning systems, yet developing them remains a manual, expert-driven process. Recent work has explored using LLMs to automate kernel generation, but generated kernels still fall short of carefully tuned references on standardized benchmarks. We present CuTeGen, an agentic GPU kernel synthesis framework that treats kernel development as a structured generate-test-refine workflow over the CuTe abstraction layer. Two design choices distinguish CuTeGen from prior work: targeting CuTe rather than raw CUDA, which exposes performance-critical structures such as tiling and data movement while remaining stable enough for iterative refinement, and a delayed profiling schedule that withholds low-level performance feedback until the kernel's high-level structure has stabilized. On the 209 tasks of KernelBench Level-1 and Level-2, CuTeGen achieves an average speedup of 1.71$\times$ over PyTorch and outperforms the prior agentic baseline CudaForge (0.89$\times$) at comparable per-task generation cost. Code available at https://github.com/taratt/cutegen.git2026-04-01T23:55:23ZTara SabaZhiyang ChenJikai Jason LiAnne OuyangXujie SiFan Longhttp://arxiv.org/abs/2603.27397v3Benchmarking Quantum Computers via Protocols, Comparing Superconducting and Ion-Trap Quantum Technology2026-06-03T17:40:39ZBoth Superconducting and Ion-Trap are leading quantum architectures common in the current landscape of the quantum computing field, each with distinct characteristics and operational constraints. Understanding and measuring the underlying \underline{quantumness} of these devices is essential for assessing their readiness for practical applications and guiding future progress and research. Building on earlier work (Meirom, Mor and Weinstein Arxiv 2505.12441), we utilize a benchmarking strategy applicable for comparing these two architectures by measuring "quantumness" directly on optimal sub-chips. Distinct from existing metrics, our approach employs rigorous binary fidelity thresholds derived from the classical limits of state transfer. This enables us to definitively establish quantum advantage of a designated sub-region. Here we apply this quality assurance methodology to platforms from both technologies. This comparison provides a protocol-based evaluation of quantumness advantage, revealing not only the strengths and weaknesses of each tested chip and its sub-chips but also offering a common language for their assessment. By abstracting away technical differences in the final result, we demonstrate a benchmarking strategy that bridges the gap between disparate quantum-circuit technologies, enabling fair performance comparisons and establishing a critical foundation for evaluating future claims of quantum advantage. This work was made possible by policies of two companies who enable independent and objective assessment on their quantum computers and sub-chips. In the name of science, we encourage other companies to emulate the independent qubit availability and the fair pricing which allow researchers to preform such assessments.2026-03-28T20:29:23Z28 body pages, 10 appendix pages, 34 figuresNitay MayoTal MorYossi Weinsteinhttp://arxiv.org/abs/2606.05081v1Graph Traversal on Tensor Cores: A BFS Framework for Modern GPUs2026-06-03T16:37:08ZModern GPUs have Tensor Cores (TCs) capable of extremely high-throughput matrix operations, yet graph algorithms remain difficult to accelerate because of their irregular and data-dependent execution patterns. This work presents BLEST, a TC-accelerated framework that reformulates Breadth-First Search (BFS) as a bit-level sparse matrix-vector computation while addressing the load imbalance, memory inefficiency, and synchronization overheads that limit prior approaches. BLEST introduces Binarized Virtual Slice Sets (BVSS), a graph representation that partitions work into balanced warp-level units and schedules only frontier-relevant regions of the graph. It further employs an optimized TC layout that maps neighbour checks onto binary MMA instructions without wasted outputs, reducing the number of required MMA calls by 8$\times$ compared with prior layouts. To mitigate atomic and cache bottlenecks, BLEST incorporates a lazy vertex-update scheme. We revisit the switching terminology for BFS and propose a mechanism that dynamically transitions from TCs to CUDA cores when it becomes more efficient. We also extend BLEST to multi-source BFS and closeness centrality workloads. Finally, we introduce a scalable graph reordering method that improves compression for scale-free-like graphs, while using RCM to improve locality for others. Across a broad set of real-world graphs, BLEST achieves average speedups of 22.0$\times$, 7.7$\times$, 8.1$\times$, and 5.9$\times$ over GAP, Gunrock, GSWITCH, and BerryBees, respectively, establishing a new BFS baseline on GPUs. Thanks to its high performance, BLEST can compute the exact closeness centralities of 65.6M vertices in a social network with 3.6B edges in an hour using 100 H100 GPUs.2026-06-03T16:37:08Z15 pages, 5 figures, 8 tables, 5 algorithmsDeniz ElbekKamer Kayahttp://arxiv.org/abs/2604.17709v2DeInfer: Efficient Parallel Inferencing for Decomposed Large Language Models2026-06-03T15:24:49ZExisting works on large language model (LLM) decomposition mainly focus on improving performance on downstream tasks, but they ignore the poor parallel inference performance when trying to scale up the model size. To mitigate this important performance issue, this paper introduces DeInfer, a high-performance inference system dedicated to parallel inference of decomposed LLMs. It consists of multiple optimizations to maximize performance and be compatible with state-of-the-art optimization techniques. Extensive experiments are carried out to evaluate DeInfer's performance, where the results demonstrate its superiority, suggesting it can greatly facilitate the parallel inference of decomposed LLMs.2026-04-20T01:47:48Zaccepted by DAC'26, latest version fixs a minor mistakeYou-Liang HuangXinhao HuangChengxi LiaoZeyi Wen10.1145/3770743.3804360http://arxiv.org/abs/2605.01910v2Stochastic Sparse Attention for Memory-Bound Inference2026-06-03T14:38:34ZAutoregressive decoding becomes bandwidth-limited at long contexts, as generating each token requires reading all $n_k$ key and value vectors from KV cache. We present Stochastic Additive No-mulT Attention (SANTA), a method that sparsifies value-cache access by sampling $S \ll n_k$ indices from the post-softmax distribution and aggregates only those value rows. This yields an unbiased estimator of the post-softmax value aggregation while replacing value-stage multiply-accumulates with gather-and-add. We introduce stratified and systematic sampling to design variance-reduced, GPU-friendly variants. Evaluated on Llama-3.1-8B-Instruct at 32k-token contexts, S$^2$ANTA matches baseline accuracy while achieving up to $1.5\times$ decode-step attention-kernel speedup over FlashInfer and FlashDecoding on an NVIDIA RTX 6000 Ada. In batched long-context generation, these kernel gains translate to up to $1.25\times$ end-to-end decode-latency speedup. Finally, we propose Bernoulli $qK^\mathsf{T}$ sampling as a complementary technique to sparsify the score stage, reducing key-feature access through stochastic ternary queries. Both methods are complementary to upstream quantization, low-rank projection, KV-cache compression, and KV-cache selection methods. Together, they point toward sparse, multiplier-free, and energy-efficient inference. We open-source our kernels at: https://github.com/OPUSLab/SANTA.git2026-05-03T14:44:14ZCode available at https://github.com/OPUSLab/SANTAICML 2026Kyle LeeCorentin DelacourKevin Callahan-CorayKyle JiangCan YarasSamet OymakTathagata SrimaniKerem Y. Camsarihttp://arxiv.org/abs/2606.04934v1The local complexity of certifying parity2026-06-03T14:24:47ZIn this paper, we consider the problem of locally certifying that the size of a network is even, or more generally, congruent to some fixed number. The parity property is one of the simplest global properties, and it plays an intriguing role in local certification. On the one hand, it is one of the simplest properties in cycles because it is equivalent to 2-colorability, and hence can be certified with a single bit. On the other hand, in general graphs, no non-trivial lower bound on the size of the certificates is known, and the known upper bound basically consists in certifying the \emph{exact} value of $n$. In addition, the nature of the problem makes all the known lower bound approaches fail.
We uncover a surprising landscape for parity across different models and graph structures:
* In general graphs equipped with identifiers, when allowing verification radius 2, parity can be certified with a constant number of bits.
* But in the model of anonymous graphs and allowing verification radius only 1, parity requires $Ω(\log \log^*n)$ bits.
* Finally, in bounded expansion graph classes (such as bounded-degree graphs and planar graphs), the lower bound does not apply: in the same restricted model we can design a constant-size certification.
We introduce several new tools that we expect to be useful in other contexts, in particular ways to \emph{encode a parent at each node with a constant number of bits} (via implicit use of the IDs and conflict-free colorings) and a new lower bound technique, with complex topologies and higher-order Ramsey-type arguments.2026-06-03T14:24:47ZNicolas BousquetLaurent FeuilloleyJorge ValenzuelaSébastien Zeitounhttp://arxiv.org/abs/2512.10236v3Design Space Exploration of DMA based Finer-Grain Compute Communication Overlap2026-06-03T13:48:53ZModern ML workloads demand distributing training and inference across multiple GPUs. However, these parallelization techniques often suffer from exposed critical-path communication, leaving a potential 1.7x speedup on the table through compute-communication overlap. Prior overlapping methods harness the fact that ML model state and inputs are already sharded into the number of GPUs, and overlap the compute and communication at shard granularity. However, such coarse-grained overlap suffers from limited network topology support, and suboptimal dataflows. In this work, we instead make a case for finer-grain compute-communication overlap which we term FiCCO. FiCCO operates one level deeper than traditional sharding, and unlocks overlap for a wider set of network topologies and enables finer-grain dataflow.
We show that FiCCO opens up a wider design space of execution schedules than possible at shard-level alone. To walk the design space of schedules, we study and characterize the performance inefficiencies on doing overlap and overlay the schedules with the associated inefficiency signatures. Our characterization reveals decomposition and contention based slowdowns to be the major performance limiters, and we correlate the slowdown factors with the static compute/communication operator sizes. This helps us design heuristics (that frameworks and runtimes can harness) to select bespoke FiCCO schedules based on the nature of underlying ML operations. Finally, to further minimize contention inefficiencies inherent with operation overlap, we offload communication to GPU DMA engines. We evaluate several scenarios from realistic ML deployments and demonstrate that our proposed heuristics driven bespoke schedules deliver up to 1.6x speedup. Further, our heuristics provide accurate guidance to pick the optimal schedule in 81% of unseen scenarios.2025-12-11T02:43:27ZShagnik PalShaizeen AgaSuchita PatiMahzabeen IslamLizy K. Johnhttp://arxiv.org/abs/2606.04687v1Clownfish: Scaling DAG-based BFT Consensus via Sparse Edges2026-06-03T10:11:50ZDirected Acyclic Graph (DAG) based BFT protocols have demonstrated the capability to achieve significantly high throughput in practice. Recent advancements focused on minimizing the good-case latency of these protocols, approaching the theoretical lower bound. However, the high communication complexity inherent in existing DAG-based protocols limits their scalability. This primarily arises because each vertex in the DAG must include a linear number of edges (references) to vertices from previous rounds.
We present Clownfish, a partially synchronous DAG-based BFT protocol designed to address the scalability bottleneck. Clownfish achieves lower communication complexity by selectively reducing the number of edges in DAG vertices. When using a communication-optimal consistent broadcast, Clownfish attains quadratic total communication complexity per round, outperforming prior DAG-based protocols. Clownfish also reduces the additional latency in failure cases by optimizing the round advancement rule. Additionally, Clownfish supports multiple leaders per round to reduce average latency while maintaining its lower communication complexity. Our experimental evaluation demonstrates that Clownfish provides significantly better scalability than existing DAG-based protocols.2026-06-03T10:11:50ZFeifan WangJingfan YuZixi CaiZhixuan Fanghttp://arxiv.org/abs/2606.04652v1Rectangular Matrix Multiplication in the Low-Bandwidth Model2026-06-03T09:22:49ZWe study rectangular matrix multiplication in the low-bandwidth model of distributed computing. There are $n$ computers; initially the input matrices are distributed evenly between computers, and in each communication round every computer can send and receive an $O(\log n)$-bit message. Eventually each computer must output its designated part of the product matrix.
While prior work has focused primarily on square $n \times n$ multiplication under various sparsity assumptions, we study rectangular instances with no sparsity assumption. We denote by $\langle a,b,c\rangle$ the task of multiplying an $a\times b$ matrix by a $b\times c$ matrix in this model. We concentrate on two natural aspect ratios, $\langle n,d,n\rangle$ and $\langle d,n,d\rangle$, for $d \le n$, and we study how the round complexity depends on $n$ and $d$.
When $d \to n$, both $\langle n,d,n\rangle$ and $\langle d,n,d\rangle$ approach $\langle n,n,n\rangle$, which is the usual task of multiplying square matrices. If we consider multiplication over semirings, the current best upper bound in that case is $O(n^{4/3})$ rounds, and there is a trivial unconditional lower bound of $Ω(n)$.
We show that for $\langle d,n,d\rangle$, we can achieve the complexity of $\tilde O(d^{4/3})$, which seems like a natural generalization of the upper bound $\tilde O(n^{4/3})$ when $d=n$. However, the case of $\langle n,d,n\rangle$ is fundamentally different, and also exhibits a phase transition. We show that for $d \le \sqrt{n}$, the complexity of $\langle n,d,n\rangle$ is $Θ(d \sqrt{n})$; we have matching upper and lower bounds. However, the behavior is genuinely different in the region $d \ge \sqrt{n}$, where the upper bound is $O(d^{2/3} n^{2/3})$.2026-06-03T09:22:49ZChetan GuptaJukka SuomelaHossein Vahidihttp://arxiv.org/abs/2606.04594v1Ekka: Automated Diagnosis of Silent Errors in LLM Inference2026-06-03T08:32:13ZLLM serving frameworks are quickly evolving with a complex software stack and a vast number of optimizations. The rapid development process can introduce silent errors where output quality silently degrades without any explicit error signals. Diagnosing silent errors is notoriously difficult due to the substantial semantic gap between the high-level symptoms and the low-level root causes. We observe that diagnosis of silent errors can be effectively framed as a differential debugging problem by leveraging the existence of semantically correct reference implementations. We propose Ekka, an automated diagnosis system that identifies root causes by systematically aligning and comparing intermediate execution states between a target and a reference framework. We constructed a benchmark of real-world silent errors from popular serving frameworks, where Ekka shows 80% pass@1 diagnosis accuracy and 88% pass@5 diagnosis accuracy, outperforming state-of-the-art systems. Ekka also diagnoses 4 new silent errors from serving frameworks, all of which have been confirmed by the developers.2026-06-03T08:32:13ZICML 2026Yile GuZhen ZhangShaowei ZhuXinwei FuJun WuYida WangBaris Kasikcihttp://arxiv.org/abs/2606.01138v2memorywire: A Vendor-Neutral Wire Format for Agent Memory Operations2026-06-03T07:59:47ZAgent-memory frameworks -- mem0, Letta/MemGPT, Cognee, Zep/Graphiti, MemoryOS, MemTensor -- each ship their own SDK, storage layout, and operational vocabulary. There is no shared wire format: every integration is bespoke, every migration rebuilds memory from scratch, and no framework ships a governance surface that lets a human review writes before they enter long-term storage. We present memorywire, a JSON-Schema 2020-12 wire format for five memory operations (remember, recall, forget, merge, expire) over four memory types (semantic, episodic, procedural, emotional), with a MemoryStore interface, a fan-out router, and an optional HITL governance channel. We describe an open-source reference implementation with five backend adapters (sqlite-vec, mem0, Letta, Cognee, pgvector); a microbenchmark on a 100-fact / 50-query labelled corpus (42 with non-empty gold ids + 8 no-match probes) achieving recall@5 = 1.000 on the 42 gold-id queries with ingest p50 = 37.8 ms and recall p50 = 40.6 ms; an adversarial-fusion experiment showing Reciprocal Rank Fusion holds recall@5 = 1.000 across a 1-of-N rank-0 injection sweep (K in {0, 5, ..., 50}) where max fusion collapses to 0.500 with 80% leak at K >= 5; and a 16-scenario cross-adapter conformance suite passing 68 of 80 cells with zero failures. The contribution is not a new algorithm; it is a packaging of established components (RRF, FSMs, STM/LTM consolidation, diff-and-approve workflows) into a venue-neutral protocol with an empirically validated reference, positioned to compose with the Model Context Protocol rather than compete with it.2026-05-31T10:18:56Zv2: title corrected from pre-launch name "AMP" to "memorywire"; abstract clarifies recall@5 = 1.000 is on the 42 gold-id queries (50 total; 8 no-match probes excluded). 17 pages, 1 figure, 6 tables. Code: github.com/mthamil107/memorywire. Companion to arXiv:2604.18248 (Prompt Injection Detection)Thamilvendhan Munirathinamhttp://arxiv.org/abs/2606.04446v1D^2SD: Accelerating Speculative Decoding with Dual Diffusion Draft Models2026-06-03T04:48:00ZSpeculative decoding accelerates autoregressive large language model inference by drafting multiple tokens and verifying them in a single target-model forward pass. Recent diffusion-based drafters generate an entire block of tokens in parallel but usually commit to a single draft sequence per verification: once the first mismatch occurs, all subsequent draft tokens are discarded, resulting in a limited acceptance rate. Naively batching more draft candidate sequences only introduces a marginal improvement, as redundant or poorly placed branches increase the cost of drafting and verification without proportionally increasing the number of accepted tokens.
We propose D^2SD, a dual diffusion draft speculative decoding framework that organizes candidates into a confidence-guided prefix tree, where the first diffusion drafter generates a block along with per-position confidence scores that are used to identify the most likely rejection boundary and select the top-K prefix ranges for recovery; the second variable-prefix diffusion drafter re-anchors at each selected prefix and proposes alternative continuations in one batched pass; the resulting shared-prefix candidates are jointly verified via cascade attention. Empirically, D^2SD shows clear improvements over both the underlying diffusion approach and strong autoregressive speculative decoding baselines.2026-06-03T04:48:00ZLiyuan ZhangJiarui ZhangJinwei YaoRan YanYuchen YangJiahao ZhangTongkai YangYi WuBinhang Yuanhttp://arxiv.org/abs/2606.01143v3Schedule-Level Shared-Prefix Reuse for LLM RL Training2026-06-03T04:16:23ZGRPO-based LLM post-training commonly samples multiple trajectories from the same prompt and then trains on the resulting group. In long-context GRPO workloads, this shared prompt-side prefix can contain retrieved passages, visual tokens, tool schemas, system instructions, or task context, while the full rollout group is still too large to pack into one training microbatch. Standard dense trainers therefore recompute the same prefix forward and backward for every trajectory. We present a schedule-level reuse mechanism that decouples prefix and suffix computation. The schedule runs prefix forward once, executes suffixes as ordinary microbatches while reading prefix K/V and accumulating prefix-side gK/gV , and then runs prefix backward once on the accumulated gradient cache. This reordered schedule is equivalent to baseline training over real arithmetic and aligns numerically within finite-precision tolerance. Because only K/V and gK/gV are hot during suffix computation, the approach offloads dormant prefix activations, integrates with TP/EP/CP/PP and DP-style placement at the execution level, and preserves aux-loss-based MoE router semantics through logical prefix-token accounting. On dense Llama3-8B, Qwen3-8B, and MoE Qwen3-MoE-30B-A3B configurations, the schedule matches optimizer updates across TP/CP/PP/EP combinations, aligns on a 100-step real GRPO actor-update trace replay, reaches up to 4.395x speedup (2.930x under a conservative compile-on comparison) as prefix ratio and GRPO group size grow, and reduces Phase-B peak HBM by up to 59.1%, extending the Llama3-8B capacity frontier from 17,920 to 29,696 total tokens.2026-05-31T10:24:10ZPengbo LiFeiyuan ZhangGuangming ShengGuangxin HeDi ChaiZiniu LiTaiqiang WuWenyu MaoBinhang YuanKai Chenhttp://arxiv.org/abs/2606.04268v1ACEAPEX: Parallel LZ77 Decoding via Encode-Time Absolute Offset Resolution2026-06-02T22:44:22ZLZ77-based codecs exhibit a fundamental sequential bottleneck in decoding: each back-reference depends on previously decompressed data, preventing multi-core scaling. We present ACEAPEX, a parallel LZ77 codec that stores all back-references as absolute positions in the decompressed output and organizes data into self-contained 1 MB blocks, enabling embarrassingly parallel block-level decoding. Integrated into lzbench, ACEAPEX achieves 10,160 MB/s on EPYC 4344P (8 cores) and 10,869 MB/s on EPYC 9575F for FASTQ genomic data -- up to 3.1x faster than zstd -3 at comparable compression ratios. We further implement a GPU wavefront decoder on NVIDIA H100 SXM, measuring 44.0 GB/s on enwik9 and 20.3 GB/s on FASTQ (wavefront match phase, BIT-PERFECT verified). With a depth-limited encoder variant (-1.5% ratio on enwik9), GPU throughput reaches 77.2 GB/s on a single H100 and 249.9 GB/s on two H100s in NVLink configuration. To our knowledge, this is the first reported GPU LZ77 decode with near-standard compression ratio verified byte-for-byte.2026-06-02T22:44:22Z6 pages, 5 tablesYakiv Shavidze10.5281/zenodo.20440965