Multiple-Bases Belief Propagation List Decoding for Quantum LDPC Codes

2026-05-13T22:49:35Z

In this paper, we propose a belief-propagation (BP)-based decoder, termed the Multiple-Bases Belief-Propagation List Decoder (MBBP-LD), for quantum low-density parity-check (QLDPC) codes. The key idea is to generate \emph{structured decoding diversity} by constructing multiple redundant parity-check representations via cycle-free subtree decompositions of the Tanner graph, and running BP decoding in parallel across these representations. This extends the classical Multiple-Bases Belief-Propagation (MBBP) framework to the quantum setting while preserving the linear-time complexity and efficiency of standard BP decoding, and avoids the need for super-linear post-processing. Simulation results demonstrate that MBBP-LD improves upon existing BP-based decoders, including BP with ordered statistics decoding (BP-OSD) and belief propagation with guided decimation (BPGD) across several QLDPC codes, while requiring substantially fewer total BP iterations. For bivariate bicycle codes $[[144,12,12]]$ and $[[288,12,18]]$, MBBP-LD achieves up to $20\%$ reduction in error rate compared to BPGD and up to $30\%$ compared to BP-OSD in the low- and moderate-error regimes. For the larger B1 code $[[882, 24, 18 \leq d \leq 24]]$, MBBP-LD attains comparable or improved performance relative to BPGD while maintaining BP-like decoding latency under parallel implementation.

Probabilistic Gradient Coding via Structure-Preserving Sparsification

2026-05-13T22:18:06Z

Gradient coding is a distributed computing technique aiming to provide robustness against slow or non-responsive computing nodes, known as stragglers, while balancing the computational load for responsive computing nodes. Among existing gradient codes, a construction based on combinatorial designs, called BIBD gradient code, achieves the best trade-off between robustness and computational load in the worst-case adversarial straggler setting. However, the range of system parameters for which BIBD gradient codes exist is limited. In this paper, we overcome these limitations by proposing two new probabilistic gradient codes, termed the \emph{Sparse Gaussian} (SG) gradient code and the \emph{Expansion-Preserving} (EP) gradient code. Through probabilistic constructions, the former preserves the combinatorial structure of BIBDs, while the latter preserves key spectral properties. Both codes are based on a common two-step framework: first generating a random matrix and then applying distinct sparsification procedures. The SG gradient code constructs its encoding matrix from a correlated multivariate Gaussian distribution masked by Bernoulli random variables, while the EP gradient code derives its encoding matrix from sparsified expander-like graph structures that preserve key spectral properties. Experimentally, both codes achieve worst-case error performance comparable to that of the BIBD gradient code (when such a code with the same parameters exists). Moreover, they substantially extend the feasible range of system parameters beyond BIBD and soft BIBD gradient codes, offering practical and theoretically grounded solutions for large-scale distributed computing tasks.

Construction of Non-special Divisors on Kummer Covers With Arbritary Ramification For LCP Codes

2026-05-13T19:08:10Z

Linear Complementary Pairs (LCP) of algebraic geometry (AG) codes offer strong resistance against side-channel and fault-injection attacks, but their construction depends critically on the explicit identification of non-special divisors of degree $g$ and $g-1$. Existing constructions are restricted to Kummer extensions where divisors are supported exclusively on totally ramified places, significantly limiting the range of applicable function fields and codes. We remove this restriction by developing a framework for general Kummer extensions $y^m = \prod_{i=1}^r (x-α_i)^{λ_i}$ over finite fields with arbitrary ramification. Using Galois group actions and invariant divisor techniques, we establish necessary and sufficient conditions for non-speciality with no constraint on the support, yielding explicit constructions where previous methods fail. Our approach replaces the computationally intensive Weierstrass semigroup machinery with a more direct and efficient framework. As an application, we construct new explicit families of LCP AG codes with determined parameters $[n,k,d]$, covering three ramification regimes. The resulting codes meet or approach the Goppa designed distance, offering greater flexibility for cryptographic applications.

Function-Correction with Optimal Data Protection for the General Hamming Code Membership

2026-05-13T18:35:53Z

This paper investigates single-error-correcting function-correcting codes~(SEFCCs) for the $[2^n-1,\,2^n-1-n,\,3]$-Hamming code membership function~(HCMF) for general $n\geq 2$. Necessary and sufficient conditions for valid parity assignments are established, and the distance-$3$ codeword graph is shown to induce a connected bipartite structure for all $n\geq 2$, which is exploited to develop a systematic SEFCC construction achieving the largest possible minimum distance of~$2$. A novel framework is then developed that reduces the minimization of distance-$2$ codeword pairs to a max-cut problem on the distance-$4$ graphs of the two partite sets. Eigenvectors corresponding to the minimum eigenvalue of these graphs are shown to directly yield optimal parity assignments. We reduce the problem of finding these eigenvectors to an optimization problem involving moments of the Walsh coefficients of a related function, which we solve for even~$n$ by deriving a tight lower bound shown to be attained by bent functions, establishing a precise connection between optimal SEFCC design and bent Boolean functions.

Enhanced quantum capacity thresholds from symmetry

2026-05-13T17:21:15Z

The quantum capacity captures the value of a quantum channel for transmitting quantum information, establishing the fundamental limits on quantum communication. In spite of its central role in quantum information theory, the quantum capacity of most channels is unknown, with wide gaps between the best upper and lower bounds. Even deciding whether a channel has nonzero capacity -- finding its capacity threshold -- is difficult. In this paper we report significant increases in the capacity thresholds of two prototypical noise models: the depolarizing channel and Pauli channels. In the case of the depolarizing channel, this is the first improvement in 18 years, giving a bigger increase beyond the hashing bound than all previous improvements combined. Our starting point is the representation theoretic framework recently proposed by Bhalerao and Leditzky (2025) to compute coherent information for special permutation invariant states. We generalize their framework to the full symmetric subspace, which allow us to optimize coherent information over rank two states in that space. A representation theoretic calculation shows that exponentially many Kraus operators of the channel annihilate the symmetric space, corresponding to a massive decrease in environment entropy for states on the symmetric space compared to the maximally mixed state. This explains the enhanced coherent information as a manifestation of degeneracy for the resulting codes.

Local Information-Theoretic Security via Euclidean Geometry

2026-05-13T17:11:58Z

This paper introduces a methodology based on Euclidean information theory to investigate local properties of secure communication over discrete memoryless wiretap channels. We formulate a constrained optimization problem that maximizes a legitimate user's information rate while imposing explicit upper bounds on both the information leakage to an eavesdropper and the informational cost of encoding the secret message. By leveraging local geometric approximations, this inherently non-convex problem is transformed into a tractable quadratic programming structure. It is demonstrated that the optimal Lagrange multipliers governing this approximated problem can be found by solving a linear program. The constraints of this linear program are derived from Karush-Kuhn-Tucker conditions and are expressed in terms of the generalized eigenvalues of channel-derived matrices. This framework facilitates the derivation of an analytical formula for an approximate local secrecy capacity. Furthermore, we define and analyze a new class of secret local contraction coefficients. These coefficients, characterized as the largest generalized eigenvalues of a matrix pencil, quantify the maximum achievable ratio of approximate utility to approximate leakage, thus measuring the intrinsic local leakage efficiency of the channel. We establish bounds connecting these local coefficients to their global counterparts defined over true mutual information measures. The efficacy of the proposed framework is demonstrated through detailed analysis and numerical illustrations for both general multi-mode channels and the canonical binary symmetric wiretap channel.

High-Rate Quantized Matrix Multiplication I

2026-05-13T16:41:48Z

This paper investigates the problem of quantized matrix multiplication (MatMul), which has become crucial for the efficient deployment of large language models (LLMs). We consider a Generic MatMul setting, where both matrices must be quantized (weight+activation quantization) without specific apriori (calibration) statistical information about the factors. We review the fundamental information-theoretic tradeoff between quantization rate and distortion (high-rate theory), and contrast those with the performance of popular quantization schemes (absmax INT and floating-point (FP)), for which we also derive accurate heuristic approximations. Part II of this paper studies the weight-only quantization setup where second-order statistics of the activation matrices are available at the encoder.

Scale-Sensitive Shattering: Learnability and Evaluability at Optimal Scale

2026-05-13T15:41:30Z

We study the optimal scale at which real-valued function classes exhibit uniform convergence and learnability. Our main result establishes a scale-sensitive generalization of the fundamental theorem of PAC learning: for every bounded real-valued class and every $γ>0$, uniform convergence at scale $γ$, agnostic learnability at scale $γ/2$, and finiteness of the fat-shattering dimension at every scale $γ'>γ$ are equivalent. This resolves a question by Anthony and Bartlett (Cambridge Univ. Press 1999) on the precise scales governing learnability, refuting a conjecture attributed there to Phil Long that a multiplicative 2-factor gap is unavoidable, and improves the upper bounds of Bartlett and Long (JCSS 1998), which incur such a loss. The key technical ingredient is a direct bound on empirical $\ell_\infty$ covering numbers, avoiding the standard detour through packing numbers. As a consequence, we obtain sharp asymptotic metric-entropy bounds in terms of the fat-shattering scale $γ$: an $O(\log^2 n)$ bound holds already at scale $γ/2$, while an $O(\log n)$ bound holds at scale $2γ$. We further show that the $O(\log^2 n)$ bound is sometimes tight. These results resolve open questions by Alon et al. (JACM 1997) and Rudelson and Vershynin (Ann. of Math. 2006). As an application, we establish a sharp dichotomy for bounded integral probability metrics: every such IPM is either estimable or cannot be weakly evaluated within any multiplicative factor $c<3$, while $3$-weak evaluability always holds, resolving an open question from Aiyer et al. (ICML 2026). We also highlight several open questions on quantitative sample complexity and evaluability.

Stabilizer-Code Channel Transforms Beyond Repetition Codes for Improved Hashing Bounds

2026-05-13T15:39:42Z

The quantum hashing bound guarantees that rates up to $1-H(p_I, p_X, p_Y, p_Z)$ are achievable for memoryless Pauli channels, but it is not generally tight. A known way to improve achievable rates for certain asymmetric Pauli channels is to apply a small inner stabilizer code to a few channel uses, decode, and treat the resulting logical noise as an induced Pauli channel; reapplying the hashing argument to this induced channel can beat the baseline hashing bound. We generalize this induced-channel viewpoint to arbitrary stabilizer codes used purely as channel transforms. Given any $ [\![ n, k ]\!] $ stabilizer generator set, we construct a full symplectic tableau, compute the induced joint distribution of logical Pauli errors and syndromes under the physical Pauli channel, and obtain an achievable rate via a hashing bound with decoder side information. We perform a structured search over small transforms and report instances that improve the baseline hashing bound for a family of Pauli channels with skewed and independent errors studied in prior work.

Channels with Input-Correlated Synchronization Errors

2026-05-13T13:38:36Z

"Independent and identically distributed" errors do not accurately capture the noisy behavior of real-world data storage and information transmission technologies. Motivated by this, we study channels with input-correlated synchronization errors, meaning that the distribution of synchronization errors (such as deletions and insertions) applied to the $i$-th input $x_i$ may depend on the whole input string $x$. We begin by identifying conditions on the input-correlated synchronization channel under which the channel's information capacity is achieved by a stationary ergodic input source and is equal to its coding capacity. These conditions capture a wide class of channels, including channels with correlated errors observed in DNA-based data storage systems and their multi-trace versions, and generalize prior work. To showcase the usefulness of the general capacity theorem above, we combine it with techniques of Pernice-Li-Wootters (ISIT 2022) and Brakensiek-Li-Spang (FOCS 2020) to obtain explicit capacity-achieving codes for multi-trace channels with runlength-dependent deletions, motivated by error patterns observed in DNA-based data storage systems.

The Geometry of LLM Quantization: GPTQ as Babai's Nearest Plane Algorithm

2026-05-13T13:18:37Z

Quantizing the weights of large language models (LLMs) from 16-bit to lower bitwidth is the de facto approach to deploy massive transformers onto more affordable accelerators. While GPTQ emerged as one of the standard methods for one-shot post-training quantization at LLM scale, its inner workings are described as a sequence of algebraic updates that obscure geometric meaning or worst-case guarantees. In this work, we show that, when executed back-to-front (from the last to first dimension) for a linear layer, GPTQ is mathematically identical to Babai's nearest plane algorithm for the classical closest vector problem (CVP) on a lattice defined by the Hessian matrix of the layer's inputs. This equivalence is based on a sophisticated mathematical argument, and has two analytical consequences: first, the GPTQ error propagation step gains an intuitive geometric interpretation; second, GPTQ inherits the error upper bound of Babai's algorithm under the assumption that no weights are clipped. Leveraging this bound, we design post-training quantization methods that avoid clipping, and outperform the original GPTQ. In addition, we provide efficient GPU inference kernels for the resulting representation. Taken together, these results place GPTQ on a firm theoretical footing and open the door to importing decades of progress in lattice algorithms towards the design of future quantization algorithms for billion-parameter models. Source code is available at https://github.com/IST-DASLab/GPTQ-Babai.

Effective Context in Transformers: An Analysis of Fragmentation and Tokenization

2026-05-13T13:08:08Z

Transformers predict over a representation of a sequence. The same data can be written as bytes, characters, or subword tokens, and these representations may be lossless. Yet, under a fixed context window, they need not expose the same information to the model. This raises a basic question: how does the choice of representation change what a finite-context predictor can achieve? We study this question on Markov sources and uncover two complementary phenomena. First, we observe that moving to smaller representation units can hurt prediction even when the context window is enlarged to cover the relevant source history. To explain this, we introduce fragmentation: a lossless recoding that replaces each source symbol by several smaller units. We prove that fragmentation can strictly increase the optimal finite-context log-loss, showing that the gap is not merely an optimization or capacity issue, but can be intrinsic to the representation. This gives a theoretical account of the finite-context gap observed in byte- and character-level models such as ByT5 and CANINE relative to subword-tokenized models. Second, we study the opposite direction: greedy tokenization -- BPE, WordPiece, and related methods -- which groups source symbols into larger units. We show that tokenization can make a short token window behave like a longer source-context window, and we give a loss guarantee describing when this is achievable. The guarantee depends on how reliably token windows span the needed source history, together with the compression rate of the tokenizer. This also yields a simple diagnostic for real tokenizers: measuring how much source context a fixed token window reliably contains. Together, the two directions establish a finite-context information-theoretic framework for reasoning about representation choices in Transformers.

Revisiting CUR Perturbation Analysis: A Local Tangent-Space Expansion

2026-05-13T12:33:17Z

CUR decompositions approximate a matrix using selected columns, rows, and their intersection. Classical CUR theory provides exactness results for low-rank matrices and perturbation bounds controlled by the size of the noise. In this work we develop a local perturbation expansion for a fixed-index rank-truncated CUR map near an admissible rank-$r$ matrix. We show that the Fréchet derivative of the rank-truncated CUR map is a sampling-induced oblique tangent-space projector determined by the selected rows and columns. Consequently, the local recovery error for an underlying low-rank matrix is governed not by the full perturbation norm alone, but by the image of the perturbation under this sampling-induced tangent projector. In particular, perturbations that are invisible to the selected rows and columns are removed to first order. We compare this behavior with the classical local expansion of the rank-$r$ SVD truncation. SVD removes orthogonal-normal perturbations to first order, whereas rank-truncated CUR removes perturbations in the kernel of the sampling-induced oblique tangent projector. Numerical experiments illustrate these regimes and confirm the predicted first- and second-order local rates.

The Diffusion Encoder

2026-05-13T11:54:43Z

We construct a new kind of encoder, leveraging the expressive power of diffusion models. In a traditional variational autoencoder, the encoder and decoder jointly negotiate a latent representation of the input. This is made possible by the reparameterization trick, which simplifies training at the cost of restricting the encoder to a simple family of distributions. Replacing this encoder with a diffusion model requires rethinking how the decoder pressure can be transmitted back to the encoder, given that they tend to update their internal estimates of the latent in opposing directions. We solve this problem with an alternating training scheme, inspired by the expectation-maximization algorithm. Our method enables more reliable synchronization between encoder and decoder, while preserving the simple and efficient training objective of standard diffusion models.

Efficient compression of neural networks and datasets

2026-05-13T10:49:50Z

Compression and generalization are fundamentally related through Solomonoff induction and the minimum description length principle (MDL), which predict that simpler models generalize better when data arises from low-complexity distributions. In this article, we combine insights from algorithmic information theory and techniques from neural network pruning to improve model generalization by identifying the most effective data compression method. Since exact MDL optimization is intractable, we cast it as $\ell_0$ regularized learning and explain why parameter sparsity provides an effective computable approximation of model description length. To identify the best practical approach, we systematically compare and refine complementary sparse optimization methods. In particular, we improve probabilistic pruning through a procedure that does not require Monte Carlo sampling and refine smooth $\ell_0$ approximations with a binary search routine that reduces hyperparameter complexity. Across convolutional networks and transformers evaluated on image and text datasets, our refined methods improve upon their predecessors, achieve substantial model compression with minimal accuracy loss, and yield short data description lengths. Finally, we use these methods in a controlled teacher-student setting to empirically verify the prediction of Solomonoff induction that compressed models learn more sample-efficiently and generalize better.