Rate-Distortion Theory for Deductive Sources under Closure Fidelity

2026-05-28T10:07:23Z

We study lossy compression of a finite statement source generated in a fixed deductive environment. The source symbols are statements in a knowledge base endowed with a shared proof system, and reconstruction fidelity is measured by preservation of deductive closure rather than by symbolwise equality. Fixing the proof system and a canonical scan order yields a decomposition of the source alphabet into an irredundant core and redundant stored consequences. At zero distortion, each core symbol induces a set of distortion-free reconstructions. In the nonconfusable (disjoint-core) regime, we show that the minimum zero-distortion rate equals the source mass of the core times the entropy of the source conditioned on that core. In the general confusable-core regime, we characterise the exact zero-distortion rate via a hypergraph-entropy quantity induced by jointly realisable core subsets, with a reduction to Korner-style graph entropy under a natural pairwise realisability condition. For reconstruction alphabets contained in the deductive closure of the source knowledge base, we further prove that the full rate-distortion function depends only on the core, so redundant states are invisible to both rate and distortion. Finally, when the decoder is limited to a bounded inference-depth budget (a bounded number of iterations of the immediate-consequence operator), we obtain an exact rate-depth-distortion characterisation. Under an additional order-robustness assumption identifying the chosen core with the order-free essential set, this characterisation interpolates between classical symbolwise compression and unconstrained deductive compression.

Envy-Free Allocation of Indivisible Goods via Noisy Queries

2026-05-28T10:02:53Z

We introduce a problem of fairly allocating indivisible goods (items) in which the agents' valuations cannot be observed directly, but instead can only be accessed via noisy queries. In the two-agent setting with Gaussian noise and bounded valuations, we derive upper and lower bounds on the required number of queries for finding an envy-free allocation in terms of the number of items, $m$, and the negative-envy of the optimal allocation, $Δ$. In particular, when $Δ$ is not too small (namely, $Δ\gg m^{1/4}$), we establish that the optimal number of queries scales as $\frac{\sqrt m }{(Δ/ m)^2} = \frac{m^{2.5}}{Δ^2}$ up to logarithmic factors. Our upper bound is based on non-adaptive queries and a simple thresholding-based allocation algorithm that runs in polynomial time, while our lower bound holds even under adaptive queries and arbitrary computation time.

Using Set Shaping Theory to Trade RAM Accesses for CPU Computation

2026-05-28T09:59:55Z

This paper studies Set Shaping Theory (SST) in a database-index setting under a revised interpretation: SST is not treated as a competing hashing method, but as a structural pre processing layer that can be applied before an existing indexing algorithm. The experimental question is therefore whether a method improves when it is used with SST rather than with out it. The study compares linear probing, double hashing, quadratic probing, and Robin Hood hashing against their corresponding SST-augmented variants for shaping orders K = 2,4,8. Beyond mean time, the benchmark reports mean successful probes, 95th and 99th percentile probes, collisions per stored record, and maxi mum cluster length. Experiments cover load factors from 0.75 to 0.95, database sizes from M =5000 to M =500000, query multipliers up to 200 lookups per stored record, and both uniform and hotspot query distributions. The results highlight two fundamental advantages. First, SST reduces the number of RAM accesses required during retrieval. By prevent ing clusters and long probe chains from forming at insertion time, the lookup phase requires fewer memory jumps, lower probe counts, and reduced tail latency. Second, the method introduces a new way of thinking about data storage: the data are not treated as fixed objects that must be placed passively into a table, but as reversible representations that can be struc turally adapted before being written. A small metadata tag records which transformation was selected, allowing the original key to remain recoverable and the lookup process to remain deterministic.This article is connected to the Set Shaping Theory simulator project, available online at https://sst-simulator.github.io/Set-Shaping-Theory-Simulator/ where it is possible to simulate part of the results presented in the article.

A Unified Two-Stage Generative Diffusion Framework for Channel Estimation and Port Selection in Multiuser MIMO-FAS

2026-05-28T09:40:32Z

Fluid antenna systems (FAS) have emerged as a promising technology for next-generation wireless systems. However, practical multiuser multiple-input multiple-output FAS (MIMO-FAS) faces two inherently coupled challenges: acquiring accurate high-dimensional channel state information (CSI) from limited RF chains and solving the combinatorial port selection problem, where the effectiveness of the latter highly depends on the result of the former. In this paper, we propose a unified two-stage diffusion framework that formulates the joint task as a maximum-a-posteriori (MAP) inference problem and decomposes it into two sequential sampling stages through a plug-in approximation. For Stage I, a continuous flow-based diffusion model serves as a powerful implicit prior for 2D FAS channels, and a parallel guided generation scheme realizes approximate posterior sampling, enabling accurate multiuser channel recovery even under severely low sub-sampling ratios. For Stage II, a discrete diffusion model is trained to approximate the conditional port selection distribution by combining supervised learning on heuristic labels with reinforcement fine-tuning, effectively overcoming the local optima of conventional heuristic algorithms. Extensive simulations demonstrate that the proposed framework simultaneously achieves exceptional channel estimation accuracy and globally optimized port selection, substantially improving the minimum achievable rate.

Matching Rates and Optimal Allocation for Federated Probe-Logit Distillation under Heterogeneous Bandwidth Budgets

2026-05-28T09:08:39Z

In federated language modeling, $K$ nodes each hold $n$ samples but cannot pool data or exchange full-precision gradients or weights. We study the minimax rate at which a conditional distribution over $V$ tokens can be estimated when each node may upload at most $B$ bits per query in a public probe set. In federated probe-logit distillation (FPLD), each node transmits a scalar-quantized logit vector on the probe set, and an aggregator distills a global parametric student. Prior work (Dubey and Huo, 2026) establishes a high-probability KL rate $O(d/(Kn) + ρ\sqrt{V \log V / m} + K^{-1} \cdot 2^{-2B/V})$ plus optimization slack, with the bandwidth term in its trace-sharpened form. Whether this bandwidth-term rate is tight, and how the upper bound generalizes to heterogeneous per-node bandwidths, are left open. We close both gaps. First, the dithered FPLD construction has a matching single-round lower bound $Ω(K^{-1} \cdot 2^{-2B/V})$ under non-degeneracy, pinning the bandwidth-axis rate at $Θ(K^{-1} \cdot 2^{-2B/V})$. $T$-round sequential refinement with nested/scaled residual quantizers achieves $O(K^{-1} \cdot 2^{-2TB/V})$; vanilla FPLD's $T$-independent bandwidth term is suboptimal for every $T > 1$. Second, we establish a heterogeneous-bandwidth upper bound for per-node budgets $B_i$, paired with a closed-form optimal allocation $B_i^* = B_{\mathrm{tot}}/K + (V/2) \log_2(w_i / \bar{w}_g)$, a log-tilted water-filling rule that is the per-node analogue of reverse water-filling for distortion-rate optimization. A plug-in adaptive variant estimates the weights from a short warm-up phase and attains $1 + O(\sqrt{\log(K/δ)/(m T_0)})$ relative suboptimality. Synthetic n-gram simulations confirm that empirical KL is bracketed by the upper and lower bounds and that the optimal allocation strictly dominates uniform and inverse-weighted baselines under heterogeneous clipping.

A Comprehensive Survey on Semantic Communication in Non-Terrestrial Networks: Architectures, Methodologies, and Challenges

2026-05-28T09:02:39Z

The sixth-generation wireless networks are envisioned to deliver ubiquitous, seamless, and intelligent connectivity that reaches far beyond the limits of terrestrial infrastructure. Non-terrestrial networks (NTNs) are central to this vision, extending coverage to underserved regions, remote terrain, and disaster zones that terrestrial deployment cannot economically reach. However, NTN architecture faces numerous limitations: severe path loss over long distances, long propagation delays, large and time-varying Doppler shifts, limited visibility windows, and tight on-board energy and computing budgets. Semantic communication (SemCom), which conveys the meaning of data rather than its raw bit-level representation, is unusually well matched to these conditions: extreme compression rate for task-oriented eases bandwidth scarcity, deep joint source-channel coding prevents the cliff effect due to low signal-to-noise ratio, and generative-AI reconstructs content from sparse cues that survive rain-faded or blocked links. This observation, that each NTN limitation maps onto a SemCom property that addresses it, motivates our survey. We first walk through the NTN limitations one by one, pairing each with the SemCom design choices that complement it, then we organize the literature along three axes: the NTN platform, the semantic methodology, and the supporting techniques, and follow this with platform-by-platform deep dives on satellite-centric, UAV/HAPS-centric, and integrated SAGIN systems. The survey concludes by identifying open research problems, gaps in existing standards, and future directions, including the application of foundation models, energy-aware scheduling, and quantum-assisted SemCom for deep space communication.

Rate Maximization for Multi-Waveguide PASS: A Hierarchical User Scheduling and Joint Optimization Framework

2026-05-28T09:00:28Z

Pinching-antenna systems (PASS) have emerged as a promising flexible-antenna architecture capable of dynamically reconfiguring wireless channels by activating dielectric particles along waveguides. The sum rate maximization problem in multi-waveguide PASS is investigated in this study. Both in-waveguide propagation loss and coupling effects are explicitly modeled. To tackle the optimization problem, a hierarchical user scheduling (HUS) algorithm is proposed. The HUS algorithm minimizes the sum of squared distances between users and their associated waveguides to mitigate path loss. Additionally, spatially separated users are assigned within each time slot to reduce inter-user interference. Furthermore, a joint optimization framework integrating power allocation and pinching-antenna (PA) positioning is developed to further improve system sum rate. Specifically, PAs' positions are optimized via one-dimensional search, while the power allocation problem is solved by using the Lagrangian duality and fractional programming. Numerical results show that the HUS algorithm clearly outperforms random pairing, and the proposed power allocation algorithm shows a marked performance improvement over the maximum ratio transmission algorithm. Moreover, the results explicitly demonstrate the considerable impact of in-waveguide propagation loss and coupling effects on the performance of PASS.

How Much Is a Dataset Worth? Scaling Laws, the Vendi Score, and Matrix Spectral Functions

2026-05-28T06:40:29Z

Neural scaling laws appraise data through dataset size, while the Vendi Score uses quantum entropy to measure dataset value. We show both that common neural-scaling-law objectives and the Vendi Score are submodular. We further show that the Vendi Score is a special case of a broader class of submodular objectives that we call matrix spectral functions. This also includes determinantal (DPP) objectives, as well as many others. We also introduce weakly matrix monotone functions and show how they lead to weakly submodular matrix spectral functions, yielding a broad family of practical objectives for data appraisal. We develop secular-equation-based updates that avoid repeated eigendecompositions during greedy optimization, reducing marginal-gain evaluation for $m$-dimensional embeddings by an $O(m)$ factor relative to oracle queries. This yields an average empirical speedup of about 35,000x, making direct optimization of the Vendi Score feasible on ImageNet-1K-scale datasets. Thus enabled, we compare how well several objectives predict the value of training subsets for held-out test performance under fixed-size, class-balanced, and fixed training-budget regimes, including the Vendi Score, DPPs, facility location, and three new matrix spectral variants. Across multiple datasets, facility location performs the best. Direct optimization also reveals that, while the Vendi Score is predictive over moderate score ranges, pushing the objective to higher values can make it a poor downstream performance proxy. We also find that uniformly at random fixed-size subsets, both unconstrained and class-balanced, are remarkably concentrated in both appraisal scores and held-out performance. Finally, we show that size, class balance, and training budget do not alone determine data value: even when controlling for these factors, performance ranges smoothly from good to bad.

User-Centric Clustering for uRLLC in Cell-Free RAN via Extreme Value Theory

2026-05-28T06:34:48Z

Ultra-reliable low-latency communication (uRLLC) is a pivotal enabler for B5G/6G networks, yet it faces severe challenges from rare but critical extreme events, which are characterized by heavy tails in the delay distribution. While the cell-free radio access network (CF-RAN) architecture offers essential spatial diversity to combat these uncertainties, conventional user-centric clustering designs typically focus on average metrics, thereby inadequately addressing such tail behaviors. We propose a novel, tail-risk-aware, user-centric clustering framework operating within the finite blocklength (FBL) regime. Our approach employs extreme value theory (EVT), specifically the peaks-over-threshold (POT) model, to accurately quantify the probability of queue latency violations. This framework is applied to formulate an energy efficiency (EE) maximization problem under strict tail latency constraints. The problem is solved via an efficient online algorithm that integrates Lyapunov optimization with successive convex approximation (SCA). Simulation results demonstrate that the proposed scheme, through its dynamic adaptation of cluster formation to mitigate tail risks, achieves a superior reliability-efficiency trade-off and leads to a significant suppression of extreme latency events.

Björck Sequences: Extension to Arbitrary Lengths, Correlation Analysis, and Applications to Wireless Systems

2026-05-28T05:13:17Z

In this paper, we propose a sequence construction framework that extends prime-length Björck sequences, a class of Constant Amplitude Zero Autocorrelation (CAZAC) sequences, to arbitrary lengths using Goldbach's conjecture for even and odd integers. The framework is generic and applies to any CAZAC family defined for prime lengths and supports extensions to both cyclically shifted sequences and sequences with different root indices. We analytically characterize the resulting correlation behavior and show that the construction preserves orthogonality among cyclic shifts while maintaining favorable zero-lag cross-correlation across different root-index sequences. We further investigate Björck sequences as candidates for reference signals in next-generation wireless systems. Using the proposed framework, we extend Björck sequences to arbitrary lengths and evaluate their time- and frequency-offset estimation performance in terrestrial (TNs) and non-terrestrial networks (NTNs). Results show performance comparable to Zadoff--Chu (ZC) sequences in low-Doppler TN environments and improved robustness in high-Doppler NTN scenarios due to superior ambiguity-function properties. We also identify an inherent Doppler-dependent behavior that can cause sequence misidentification under large Doppler shifts. To address this, we propose two mitigation strategies: (i) leveraging coarse Doppler estimates prior to detection, and (ii) selecting appropriately spaced subsets of orthogonal sequences. Ambiguity function-based analysis demonstrates the effectiveness of these approaches in improving estimation reliability. Overall, this work enables practical arbitrary-length CAZAC sequence design and establishes Björck sequences as a strong alternative for reference signal design in high-Doppler environments.

Constacyclic codes with best-known parameters

2026-05-28T04:36:01Z

In this paper, we construct several infinite families of $q$-ary constacyclic codes over a finite field $\mathbb{F}_q$ with length $n$, dimension around $n/2$, and minimum distance at least $cn/\log_q n$ for some positive constant $c$. They contain many constacyclic codes with optimal, or almost-optimal, or best-known parameters. We also consider constacyclic codes of various lengths.

Transient Acceleration and Cross-Dissipation Interference in Fisher-Regularized Wasserstein Gradient Flows

2026-05-28T03:59:49Z

We study transient nonequilibrium dynamics in Fisher-regularized Wasserstein gradient flows and identify a sign-changing cross-dissipation mechanism generated by the coupling between transport dissipation and Fisher-information geometry. Using the Ornstein--Uhlenbeck Fokker--Planck system as an analytically tractable setting, we derive an exact reduced variance dynamics on the Gaussian manifold, \[ \dot{u}=2(1-u)+\frac{\varepsilon}{u}, \] where $u(t)=σ^2(t)$ is the variance and $\varepsilon>0$ is the Fisher regularization strength. The reduced dynamics reveal distinct transient regimes induced by the interaction between transport relaxation and information-geometric curvature. The associated cross-dissipation term changes sign at the critical scale $σ=1$, separating cooperative acceleration for localized states with $σ<1$ from transient interference at larger variance scales. In the subcritical regime, Fisher curvature accelerates the descent of the baseline free energy; beyond the critical transition, it partially opposes the Ornstein--Uhlenbeck pullback and generates transient overshoot toward a displaced Fisher-regularized equilibrium. We also establish a bounded transient-acceleration-window result, showing that the cooperative acceleration phase has finite duration with an upper bound depending only on the Fisher regularization strength. Finite-difference simulations support the analytical predictions and suggest that qualitatively similar sign-transition behavior may persist beyond Gaussian closure for non-Gaussian initial conditions, including bimodal and Laplace distributions. Overall, the results provide a transient dynamical perspective on Fisher-regularized dissipative systems and show how information-geometric curvature can reorganize intermediate-time Wasserstein relaxation while preserving the globally dissipative structure of the flow.

NOVA: Fundamental Limits of Knowledge Discovery Through AI

2026-05-28T03:12:35Z

Can AI systems discover genuinely new knowledge through iterative self improvement, and if so, at what cost? We introduce the NOVA framework, which models the common ``generate, verify, accumulate, retrain'' loop as an adaptive sampling process over a knowledge space. We identify sufficient conditions under which accumulated genuine knowledge eventually covers a finite domain, and show how their violations produce distinct failure modes: contamination, forgetting, exploration failure, and acceptance failure. We then analyze imperfect verification and identify a contamination trap: as easy-to-find knowledge is exhausted, the model mass assigned to new valid artifacts shrinks, so even small false-positive rates can cause invalid artifacts to enter the knowledge base faster than genuine discoveries. We clarify that Good--Turing estimation is a local batch-diversity diagnostic, not an estimator of the historically undiscovered valid mass that governs long-term discovery. Under a separate tail-equivalence assumption relating the model's effective discovery distribution to a Zipf law with exponent $α>1$, we prove that the cumulative generation cost required to obtain $D$ distinct genuine discoveries satisfies $R_{\mathrm{cum}}(D)=Θ(c_{\mathrm{gen}}D^α)$, where $c_{\mathrm{gen}}$ is the per-candidate generation cost. This scaling law quantifies asymptotic diminishing returns as the discovery frontier advances. Finally, we formalize human amplification through guidance, generation, and verification, explaining why expert input is most valuable near autonomous exploration barriers.

TelecomTS: A Multi-Modal Observability Dataset for Time Series and Language Analysis

2026-05-27T22:16:59Z

Modern enterprises generate vast streams of time series metrics when monitoring complex systems, known as observability data. Unlike conventional time series from domains such as climate, observability data are zero-inflated, highly stochastic, and exhibit minimal temporal structure. Despite their importance, observability datasets remain underrepresented in public benchmarks due to proprietary restrictions and privacy concerns. Existing datasets are often anonymized and normalized, removing scale information and limiting their use for tasks such as anomaly detection, root cause analysis, and multi-modal reasoning. To address this gap, we introduce TelecomTS, a large-scale observability dataset derived from a 5G telecommunications network. TelecomTS features heterogeneous, de-anonymized covariates with explicit absolute scale information and provides a diverse suite of downstream tasks, including anomaly detection, root cause analysis, and multi-modal question-answering. Benchmarking state-of-the-art time series, language, reasoning, and multi-modal foundation models reveals that existing approaches struggle with the abrupt, noisy, and high-variance dynamics characteristic of observability data. Our experiments further underscore the importance of preserving covariates' absolute scale, emphasizing the need for foundation time series models that natively leverage scale information for practical real-world observability applications. The code is available at: https://github.com/Ali-maatouk/TelecomTS.

MIMO-AFDM Outperforms MIMO-OFDM in the Face of Hardware Impairments

2026-05-27T21:13:52Z

The impact of both multiplicative and additive hardware impairments (HWIs) on multiple-input multiple-output affine frequency division multiplexing (MIMO-AFDM) systems is investigated. For small-scale MIMO-AFDM systems, a tight bit error rate (BER) upper bound associated with the maximum likelihood (ML) detector is derived. By contrast, for large-scale systems, a closed-form BER approximation associated with the linear minimum mean squared error (LMMSE) detector is presented, including realistic imperfect channel estimation scenarios. Our first key observation is that the full diversity order of a hardware-impaired AFDM system remains unaffected, which is a unique advantage. Furthermore, our analysis shows that 1) the BER results derived accurately predict the simulated ML performance in moderate-to-high signal-to-noise ratios (SNRs), while the theoretical BER curve of the LMMSE detector closely matches that of the Monte-Carlo based one. 2) MIMO-AFDM is more resilient to multiplicative distortions, such as phase noise and carrier frequency offset, compared to its orthogonal frequency division multiplexing (OFDM) counterparts. This is attributed to its inherent chirp signal characteristics; 3) MIMO-AFDM consistently achieves superior BER performance compared to conventional MIMO-OFDM systems under the same additive HWI conditions, as well as different velocity values. The latter is because MIMO-AFDM is also resilient to the additional inter-carrier interference (ICI) imposed by the nonlinear distortions of additive HWIs. In a nutshell, compared to OFDM, AFDM demonstrates stronger ICI resilience and achieves the maximum full diversity attainable gain even under HWIs, thanks to its intrinsic chirp signalling structure as well as to the beneficial spreading effect of the discrete affine Fourier transform.