https://arxiv.org/api/2EPZaYWQnUCzsJprQL1T5fr1xuQ2026-06-27T17:10:49Z54938106515http://arxiv.org/abs/2605.07105v1Theoretical Limits of Language Model Alignment2026-05-08T01:32:22ZLanguage model (LM) alignment improves model outputs to reflect human preferences while preserving the capabilities of the base model. The most common alignment approaches are (i) reinforcement learning, which maximizes the expected reward under a KL-divergence constraint, and (ii) best-of-$N$ alignment, which selects the highest-reward output among $N$ independent samples. Despite their widespread use, the fundamental limits of reward improvement under a KL budget remain poorly understood. We characterize the information-theoretic limits of KL-regularized alignment by deriving the maximum achievable expected reward gain for a fixed KL-divergence budget. Our first result provides a closed-form expression for the optimal reward improvement, governed by a Jeffreys divergence term rather than the $\sqrt{\texttt{KL}}$ used in prior analyses. We further reformulate this expression as a covariance under the base model, yielding a practical estimator that predicts achievable alignment gains from base model samples alone. We extend our analysis to the proxy reward setting, showing that the gap between ideal and proxy alignment (reward hacking) grows with the magnitude of reward error and when the KL penalty factor decreases. We then prove that reward ensembling mitigates reward hacking, providing a theoretical justification for this technique used in practice. Empirically, we compute the KL-reward Pareto frontier for two tasks for LMs, safety and summarization, and show that best-of-$N$ closely approaches the theoretical limit, while PPO and GRPO remain substantially suboptimal. Our theoretical results shed light on several empirically observed phenomena in the alignment literature and suggest that algorithmic improvements are needed to achieve optimal alignment without high inference costs.2026-05-08T01:32:22ZLucas Monteiro PaesNatalie MackrazBarry-John TheobaldFederico Danielihttp://arxiv.org/abs/2605.00834v2Polynomial-Time Optimal Group Selection via the Double-Commutator Eigenvalue Problem2026-05-08T00:02:42ZThe algebraic diversity framework generalizes temporal averaging over multiple observations to algebraic group action on a single observation for second-order statistical estimation. The central open problem in this framework is $\textit{group selection}$: given an $M$-dimensional observation with unknown covariance structure, find the finite group whose spectral decomposition best matches the covariance. Naive enumeration of all subgroups of the symmetric group $S_M$ requires exponential time in $M$. We prove that this combinatorial problem reduces to a generalized eigenvalue problem derived from the double commutator of the covariance matrix, yielding a polynomial-time algorithm with complexity $O(d^2M^2 + d^3)$, where $d$ is the dimension of a generator basis. The minimum eigenvector of the double-commutator matrix directly constructs the optimal group generator in closed form, with no iterative optimization. The reduction is exact: the double-commutator minimum eigenvalue is zero if and only if the optimal generator lies in the span of the basis, and its magnitude provides a certifiable optimality gap when it does not. This problem does not appear in the standard catalogs of computational complexity (Garey and Johnson, 1979) and represents a new class linking group theory, matrix analysis, and statistical estimation. We establish connections to independent component analysis (JADE), structured matrix nearness problems, and simultaneous matrix diagonalization, and we show that the double-commutator formulation is the unique approach that is simultaneously polynomial-time, closed-form, and certifiable. We extend the framework to non-Abelian symmetry recovery via a Sequential GEVP with deflation, and add two identifiability theorems characterizing the commutant-lattice ambiguity and the dichotomy on whether $\mathrm{Aut}(\mathbf{R})$ recovers a generative subgroup or only a supergroup.2026-04-04T09:02:16Zv2: 2 theorems, 4 open problems, §X.A correction added; 1 reference addedMitchell A. Thorntonhttp://arxiv.org/abs/2605.08263v1Decentralized Conformal Novelty Detection via Quantized Model Exchange2026-05-07T23:46:42ZThis work studies decentralized novelty detection with global false discovery rate (FDR) control across heterogeneous composite null distributions, without sharing the raw data due to privacy and bandwidth considerations. We propose a framework based on the exchange of quantized surrogate models, allowing independent agents to share low-precision representations of locally learned non-conformity score functions. We prove that evaluating data against these quantized composite scores preserves conditional exchangeability, providing rigorous finite-sample guarantees for global FDR control. Empirical studies on synthetic datasets confirm our theoretical results, demonstrating that the proposed approach maintains competitive statistical power while drastically reducing the communication cost.2026-05-07T23:46:42ZKyle LohYu Xianghttp://arxiv.org/abs/2605.06988v1The Cost of Consensus: Malignant Epistemic Herding and Adaptive Gating in Distributed Multi-Agent Search2026-05-07T22:07:25ZDistributed agents in real-world settings frequently must coordinate under uncertainty with only partial observations. Coordination is necessary to share beliefs to aid in task completion, but communication costs bandwidth, introduces latency, and if done poorly, can degrade collective reasoning. This tension is especially acute in bandwidth-constrained deployments such as distributed sensing networks, autonomous reconnaissance, and collaborative cyber defense, where excessive transmission carries direct operational costs. Existing work has focused on multi-agent exploration and communication strategies, but not on how communication frequency and content jointly shape the collective belief state. Central to this challenge is the degree to which agents maintain compatible internal beliefs about the environment, a property we term \textit{epistemic alignment}. When agents share beliefs effectively, they converge on correct hypotheses; when communication is poorly designed, agents may converge confidently on wrong ones. We formalize this distinction and show it is not detectable from coordination metrics alone such as Jensen-Shannon Divergence or rate to consensus.2026-05-07T22:07:25ZDavid FarrIain CruickshankKate StarbirdJevin Westhttp://arxiv.org/abs/2605.06977v1$f$-Divergence Regularized RLHF: Two Tales of Sampling and Unified Analyses2026-05-07T21:48:26ZReinforcement Learning from Human Feedback (RLHF) has become a cornerstone technique for post-training large language models. While most existing approaches rely on the reverse KL-regularization, recent empirical studies have begun exploring alternative divergences (e.g., forward KL, chi-squared) as regularizers in RLHF. However, a unified theoretical understanding of general $f$-divergence regularization remains under-explored. To fill this gap, this work develops a comprehensive theoretical framework for online RLHF with a general $f$-divergence regularized objective. Rather than treating each possible divergence function individually, we adopt a holistic perspective across the entire function class and propose two algorithms based on distinct sampling principles. The first extends the classical optimism principle with a carefully designed exploration bonus, while the second introduces a new method that exploits the sensitivity of the optimal policy to reward perturbations under $f$-divergence regularization. Theoretical analysis shows that $O(\log T)$ regret and $O(1/T)$ sub-optimality gap are achievable, establishing provable efficiency of both algorithms and, to the best of our knowledge, the first performance bounds for online RLHF under general $f$-divergence regularization.2026-05-07T21:48:26ZICML 2026Di WuChengshuai ShiJing YangCong Shenhttp://arxiv.org/abs/2509.25584v2Skip-It? Theoretical Conditions for Layer Skipping in Vision-Language Models2026-05-07T21:48:21ZVision-language models achieve incredible performance across a wide range of tasks, but their large size makes inference costly. Recent work has shown that multimodal processing contains significant redundancies, making it possible to skip certain layers with minimal performance loss. Yet current pruning techniques remain ad-hoc, relying on heuristics or hyperparameter sweeps rather than principled criteria for determining when layer skipping is beneficial. In this paper, we propose a unified framework that characterizes the redundancy conditions under which pruning can enhance efficiency without sacrificing performance. Central to our approach are experimentally verifiable and interpretable notions of redundancy that can be evaluated without requiring downstream task performance as a metric. Applying this framework, we corroborate prior findings that both early and late vision tokens are redundant across models, and we validate our conditions by showing they align with actual performance degradation. Beyond these empirical results, our framework provides a theoretically grounded understanding of redundancy in VLMs and unifies many of the ideas behind modern layer-skipping techniques.2025-09-29T23:16:44ZMax HartmanVidhata JayaramanMoulik ChorariaAkhil BhimarajuLav R. Varshneyhttp://arxiv.org/abs/2601.21424v3Lossy Common Information in a Learnable Gray-Wyner Network2026-05-07T20:57:31ZMany computer vision tasks share substantial overlapping information, yet conventional codecs tend to ignore this, leading to redundant and inefficient representations. The Gray-Wyner network, a classical concept from information theory, offers a principled framework for separating common and task-specific information. Inspired by this idea, we develop a learnable three-channel codec that disentangles shared information from task-specific details across multiple vision tasks. We characterize the limits of this approach through the notion of lossy common information, and propose an optimization objective that balances inherent tradeoffs in learning such representations. Through comparisons of three codec architectures on two-task scenarios spanning six vision benchmarks, we demonstrate that our approach substantially reduces redundancy and consistently outperforms independent coding. These results highlight the practical value of revisiting Gray-Wyner theory in modern machine learning contexts, bridging classic information theory with task-driven representation learning.2026-01-29T09:00:43ZAnderson de AndradeAlon HarellIvan V. Bajićhttp://arxiv.org/abs/2509.21112v3Adapt or Regress: Rate-Memory-Compatible Spatially-Coupled Codes2026-05-07T19:49:41ZSpatially-coupled (SC) codes are a class of low-density parity-check (LDPC) codes that have excellent performance thanks to the degrees of freedom they offer. An SC code is designed by partitioning a base matrix into components, the number of which implies the code memory, then coupling and lifting them. In the same system, various error-correction coding schemes are typically needed. For example, in wireless communication standards, several channel conditions and data rates should be supported. In storage and computing systems, stronger codes should be adopted as the device ages. Adaptive code design enables switching from one code to another when needed, ensuring reliability while reducing hardware cost. In this paper, we introduce a class of reconfigurable SC codes named rate-memory-compatible SC (RMC-SC) codes, which we design probabilistically. In particular, rate compatibility in RMC-SC codes is achieved via increasing the SC code memory, which also makes the codes memory-compatible and improves performance. We express the expected number of short cycles in the SC code protograph as a function of the fixed probability distribution characterizing the already-designed SC code as well as the unknown distribution characterizing the additional components. We use the gradient-descent algorithm to find a locally-optimal distribution, in terms of cycle count, for the new components. The method can be recursively used to design any number of SC codes needed, and we show how to extend it to other cases. Next, we perform the finite-length optimization using a Markov chain Monte Carlo (MC$^2$) approach that we update to design the proposed RMC-SC codes. Experimental results demonstrate significant reductions in cycle counts and remarkable performance gains achieved by RMC-SC codes compared with a literature-based straightforward scheme.2025-09-25T12:57:54Z11 pages (double column), 4 figures, submitted to the IEEE Information Theory Workshop (ITW)Bade AksoyDoğukan ÖzbayrakAhmed Hareedyhttp://arxiv.org/abs/2605.06829v1A Unified Measure-Theoretic View of Diffusion, Score-Based, and Flow Matching Generative Models2026-05-07T18:32:15ZWe survey continuous-time generative modeling methods based on transporting a simple reference distribution to a data distribution via stochastic or deterministic dynamics. We present a unified framework in which diffusion models, score-based generative models, and flow matching are instances of learning a time-dependent vector field that induces a family of marginals $(ρ_t)_{t \in [0,1]}$ governed by continuity and Fokker-Planck equations. Such a unified theory is timely because these methods are converging methodologically, yet fragmented notation and competing derivations continue to obscure their shared structure and the practical tradeoffs governing sampling, stability, and computation. Within this framework, we (i) derive reverse-time sampling for diffusion and score-based models as controlled stochastic dynamics, (ii) show that the probability flow ODE yields identical marginals and connects diffusion to likelihood-based normalizing flows, and (iii) interpret flow matching as direct regression of the velocity field under a chosen interpolation, clarifying when it coincides with or differs from score-based training. We compare objectives, sampling schemes, and discretization errors under unified notation, discuss connections to Schrodinger bridges and entropic optimal transport, and summarize theoretical guarantees and open problems on approximation, stability, and scalability.2026-05-07T18:32:15Z62 pages, 1 figure, jmlr preprintAditya RanganathMukesh Singhalhttp://arxiv.org/abs/2605.06826v1How Does Attention Help? Insights from Random Matrices on Signal Recovery from Sequence Models2026-05-07T18:28:01ZWe study the spectral properties of sample covariance matrices constructed from pooled sequence representations, where token embeddings are drawn from a fixed two-class Gaussian mixture table and pooled via (fixed) attention weights. Working in the high-dimensional regime $d,V,N\to\infty$ with $d/V\toδ$ and $d/N\toγ$, we derive exact characterizations of the limiting eigenvalue distribution, outlier eigenvalues, and eigenvector alignment with the hidden signal. The bulk spectrum follows a non-Marchenko--Pastur law given by the free multiplicative convolution $κ(MP_δ\boxtimes MP_γ)$, reflecting the finite vocabulary structure. Signal recovery undergoes two successive BBP-type phase transitions characterized by the scalars: $δ,γ,α=w^{\top} R w$ and $κ=\|w\|^2$, where $w$ denotes the attention pooling weights and $R$ the positional correlation matrix. An aftermath of our analysis demonstrates that the optimal attention weights maximizing the signal-to-noise ratio $α/κ$ are given by the (normalized) top eigenvector of $R$, and we show (as a particular case of our analysis) that parameter-free causal self-attention with $τ/d$ score scaling yields deterministic harmonic weights that improve signal recovery over mean pooling whenever early tokens carry more signal. Extensive simulations confirm sharp agreement between theory and finite-dimensional experiments.2026-05-07T18:28:01ZMohamed El Amine Seddikhttp://arxiv.org/abs/2411.10179v2Explicit constructions of optimal blocking sets and minimal codes2026-05-07T18:09:31ZA strong $s$-blocking set in a projective space is a set of points that intersects each codimension-$s$ subspace in a spanning set of the subspace. We present an explicit construction of such sets in a $(k - 1)$-dimensional projective space over $\mathbb{F}_q$ of size $O_s(q^s k)$, which is optimal up to the constant factor depending on $s$. This also yields an optimal explicit construction of affine blocking sets in $\mathbb{F}_q^k$ with respect to codimension-$(s+1)$ affine subspaces, and of $s$-minimal codes. Our approach is motivated by a recent construction of Alon, Bishnoi, Das, and Neri of strong $1$-blocking sets, which uses expander graphs with a carefully chosen set of vectors as their vertex set. The main novelty of our work lies in constructing specific hypergraphs on top of these expander graphs, where tree-like configurations correspond to strong $s$-blocking sets. We also discuss some connections to size-Ramsey numbers of hypergraphs, which might be of independent interest.2024-11-15T13:28:30Z19 pages, 4 figures, 1 appendix. This version contains a detailed proof of Lemma 8 and minor correctionsCombinatorica 46, 13 (2026)Anurag BishnoiIstván Tomon10.1007/s00493-026-00202-5http://arxiv.org/abs/2510.08539v4On the optimization dynamics of RLVR: Gradient gap and step size thresholds2026-05-07T17:44:57ZReinforcement Learning with Verifiable Rewards (RLVR), which uses simple binary feedback to post-train large language models, has found significant empirical success. However, a principled understanding of why it works is lacking. This paper builds a theoretical foundation for RLVR by analyzing its training process at both the full-response (trajectory) and token levels. Central to our analysis is a new quantity called the Gradient Gap, which formalizes the direction of improvement from low-reward to high-reward regions of the response space. We prove that convergence critically depends on aligning the update direction with this Gradient Gap. Moreover, we derive a sharp step-size threshold based on the magnitude of the Gradient Gap: below it, learning converges, whereas above it, performance collapses. Our theory further predicts how the critical step size must scale with response length and the success rate, thereby explaining why practical heuristics such as length normalization improve stability and showing that, with a fixed learning rate, the success rate can stagnate strictly below $100\%$. Importantly, our theory holds flexibly for any policy-gradient algorithm and so characterizes the dynamics of popular approaches such as REINFORCE and GRPO. We validate these predictions through controlled bandit simulations and language model experiments on post-training Qwen2.5-Math-7B with GRPO.2025-10-09T17:53:41ZJoe SukYaqi Duanhttp://arxiv.org/abs/2511.08416v3Generative AI Meets 6G and Beyond: Diffusion Models for Semantic Communications2026-05-07T17:16:30ZSemantic communications mark a paradigm shift from bit-accurate transmission toward meaning-centric communication, essential as wireless systems approach theoretical capacity limits. The emergence of generative AI has catalyzed generative semantic communications, where receivers reconstruct content from minimal semantic cues by leveraging learned priors. Among generative approaches, diffusion models stand out for their superior generation quality, stable training dynamics, and rigorous theoretical foundations. However, the field currently lacks systematic guidance connecting diffusion techniques to communication system design, forcing researchers to navigate disparate literatures. This article provides the first comprehensive tutorial on diffusion models for generative semantic communications. We present score-based diffusion foundations and systematically review three technical pillars: conditional diffusion for controllable generation, efficient diffusion for accelerated inference, and generalized diffusion for cross-domain adaptation. In addition, we introduce an inverse problem perspective that reformulates semantic decoding as posterior inference, bridging semantic communications with computational imaging. Through analysis of human-centric, machine-centric, and agent-centric scenarios, we illustrate how diffusion models enable extreme compression while maintaining semantic fidelity and robustness. By bridging generative AI innovations with communication system design, this article aims to establish diffusion models as foundational components of next-generation wireless networks and beyond.2025-11-11T16:27:43ZAccepted by IEEE COMST, GitHub repository: https://github.com/qin-jingyun/Awesome-DiffComm, project page: https://qin-jingyun.github.io/Awesome-DiffCommHai-Long QinJincheng DaiGuo LuShuo ShaoSixian WangTongda XuWenjun ZhangPing ZhangKhaled B. Letaiefhttp://arxiv.org/abs/2605.06452v1Tight Contraction Rates for Primitive Channels under Quantum $f$-Divergences2026-05-07T15:48:23ZData-processing inequalities capture the phenomenon that two probability distributions can only become less distinguishable under any common post-processing. For more fine-grained inequalities, one turns to strong data-processing inequality (SDPI) constants, which give the strongest inequalities for a given channel and reference state for a fixed measure of distinguishability. These quantities have been used to quantify the rate at which time-homogeneous Markov chains contract towards a fixed point both in the classical and quantum setting. In this work, we establish that quantum $f$-divergences satisfy a local reverse Pinsker inequality, which implies the asymptotic contraction rate of a primitive channel to its stationary state is upper bounded by the SDPI constant of any non-commutative $χ^2$-divergence. Using quantum-detailed balance, we establish a sufficient condition for these bounds to be tight. Finally, we apply these results to Petz, Matsumoto, and Hirche-Tomamichel $f$-divergences, establishing new and strengthening previously known results.2026-05-07T15:48:23Z6+1 pagesMatthew Simon TanMarco TomamichelIan Georgehttp://arxiv.org/abs/2604.28153v2Optimal Transmitter Placement in Realistic Urban Environments2026-05-07T15:24:24ZIn a wireless network, the spatial location of the transmitters has a large impact on the achievable rate at each user location. The optimal placement of -- for example -- cellular base stations is a difficult non-convex problem, and is usually addressed with simplified propagation models and simplified heuristics that may account for specifics such as the site topology, building locations, and user density. We propose a mathematically rigorous framework for optimal transmitter placement that explicitly integrates detailed site-specific maps, spatial material properties, and realistic signal attenuation. We introduce a novel aggregated network quality functional which captures the essential trade-off between maximizing network coverage and minimizing cost, and establish the problem's sub-modularity under certain practical conditions. To solve the resulting resource-constrained optimization problem for sparse, discrete transmitter configurations, we propose the Interference-Aware Submodular Placement Algorithm (IA-SPA) and prove theoretical performance guarantees on its gap from optimality. IA-SPA is general and can incorporate existing BS locations and prohibited areas (e.g. a lake), making it useful for either clean-slate or incremental deployments. We show the utility of our approach using a ray tracing-based simulation framework applied to 3D maps of San Francisco and Florence, where we compare to known base station deployments by AT&T, T-Mobile and Iliad. We demonstrate that our proposed placement strategy achieves significant increases in mean data rate (about 2x) and edge rate ($2-8$x) compared to existing tower deployments, using the same number of transmitters.2026-04-30T17:40:22ZThis work has been submitted to the IEEE for possible publicationLukas TausRichard TsaiJeffrey G. Andrews