Repeated Sequences Reveal Gaps between Large Language Models and Natural Language

2026-05-24T03:49:20Z

Evaluating whether large language models (LLMs) capture the structure of natural language beyond local fluency remains an open challenge. Existing evaluation methods, largely based on task performance or short-context behavior, provide limited insight into the long-range statistical organization of generated text. We propose a complementary evaluation framework based on repeated subsequences. By analyzing their distribution across scales and relating it to higher-order Rényi entropies, we probe how texts reuse previously established structure under finite-length conditions. Experiments on human-written texts and length-matched GPT-generated texts show that, while power-law models can describe restricted ranges of block length, the observed entropy growth is often equally or better characterized by logarithmic--power forms. Across datasets, natural language exhibits stable entropy-growth patterns over accessible ranges, with consistent average behavior despite variability across individual texts. In contrast, GPT-generated texts show systematic and statistically significant shifts in estimated exponents with model size. These results demonstrate that repeated-subsequence entropy provides a quantitative structural diagnostic that reveals systematic differences in long-range organization, distinguishing natural language from state-of-the-art LLM outputs beyond surface-level fluency.

Multivariate Multicycle Codes for Complete Single-Shot Decoding

2026-05-24T01:40:16Z

We introduce multivariate multicycle (MM) codes, a new family of quantum error-correcting codes (QECCs) that unifies bivariate bicycle, multivariate bicycle, abelian two-block group algebra, generalized bicycle, trivariate tricycle, and toric codes. MM codes are Calderbank-Shor-Steane (CSS) codes defined from length-$\textit{t+1}$ chain complexes with $\textit{$t \ge 4$}$. The chief advantage of these codes is that they possess metachecks and high confinement that permit complete single-shot decoding. We offer a framework that facilitates the construction of long-length chain complexes through the use of Koszul complex. In particular, obtaining explicit boundary maps (parity check and metacheck matrices) is particularly straightforward in our approach. This simple but very general parameterization of codes permitted us to efficiently perform a numerical search, where we identify several MM code candidates that demonstrate these capabilities at high rates and high code distances. Examples of new codes with parameters $[[n,k,d]]$ include $[[96, 12, 8]]$, $[[144, 12, 12]]$, $[[216, 12, 14]]$, $[[288, 12, 16]]$, $[[324, 12, 20]]$, $[[432, 12, 27]]$, $[[486, 24, 12]]$, $[[630, 70, 9]]$, and $[[648, 18, 23]]$. Notably, our codes achieve confinement profiles that surpass all known single-shot-decodable quantum CSS codes of practical blocksize. Our codes are also the first explicit instances of collapsed 5D through 9D higher dimensional QECCs, with check weights significantly lower than those of recent small instances of quantum Tanner codes.

Equivalence of Families of Polycyclic Codes over Finite Fields

2026-05-24T01:38:22Z

We study the equivalence of families of polycyclic codes associated with polynomials of the form $x^n - a_{n-1}x^{n-1} - \ldots - a_1x - a_0$ over a finite field. We begin with the specific case of polycyclic codes associated with a trinomial $x^n - a_{\ell} x^{\ell} - a_0$ (for some $0< \ell

Secure Semantic Communication over Wiretap Channels: Rate-Distortion-Equivocation Tradeoff

2026-05-24T00:24:14Z

This paper investigates an information-theoretic model of secure semantic-aware communication. For this purpose, we consider the lossy joint source-channel coding (JSCC) of a memoryless semantic source transmitted over a memoryless wiretap channel. The source consists of two correlated parts that represent semantic and observed aspects of the information. Our model assumes separate fidelity and secrecy constraints on each source component and, in addition, encompasses two cases for the source output, in order to evaluate the performance gains if the encoder has an extended access to the source. Specifically, in Case 1, the encoder has direct access only to the samples from a single (observed) source component, while in Case 2 it has additional direct access to the samples of the underlying semantic information. We derive single-letter converse and achievability bounds on the rate-distortion-equivocation region. The converse bound explicitly contains rate-distortion functions, making it easy to evaluate, especially for some common distributions. The proposed achievability coding scheme involves novel stochastic superposition coding with two private parts to enable analysis of the equivocation for each source component, separately. Our results generalise some of the previously established source and source-channel coding problems. The general results are further specialised to Gaussian and Bernoulli sources transmitted over Gaussian and binary wiretap channels, respectively. The numerical evaluations illustrate the derived bounds for these distributions.

Secure Joint Source-Channel Coding of Multimodal Semantic Sources

2026-05-23T23:16:56Z

We study the problem of secure joint source-channel coding for multimodal semantic sources transmitted over noisy wiretap channels. The source model consists of $m$ modalities (e.g., image, audio, and sensor data), all represented as random variables. The encoder observes independent and identically distributed samples of an arbitrary non-empty subset of modalities. The samples are encoded and transmitted over a discrete memoryless wiretap channel. The legitimate receiver reconstructs all modalities. We extend the rate-distortion-perception problem formulation to multimodal sources. We establish converse and achievability bounds on the fundamental limits of transmission rate, fidelity, and secrecy, under per-modality distortion and perception constraints, and per-subset equivocation constraints. We show that the fundamental limit for secrecy consists of three operationally distinct components: the level of compression, the secret key rate, and the statistics of the wiretap channel.

On the Sample Complexity of Robust Binary Hypothesis Testing

2026-05-23T21:29:23Z

We study the sample complexity of robust binary hypothesis testing under three standard contamination models: $\varepsilon$-additive (Huber), $\varepsilon$-subtractive, and $\varepsilon$-total variation (TV), denoted by $n^*_{\mathrm{Hub}}(\varepsilon)$, $n^*_{\mathrm{Sub}}(\varepsilon)$, and $n^*_{\mathrm{TV}}(\varepsilon)$, respectively. For subtractive contamination, we show that least favourable distributions exist and provide explicit formulas for the same, bringing this model in line with the classical Huber and TV models. Next we show that in all three models, sample complexity may be highly unstable in the contamination parameter $\varepsilon$, increasing by polynomial factors even for $o(\varepsilon)$ perturbations. Similarly, there may be polynomial factor gaps between the sample complexities when $\varepsilon$ is known exactly versus when it is known up to $o(\varepsilon)$ error. Despite the instability of the sample complexity in all models, we show that the sample complexities across models are comparable up to constant-factor rescaling of $\varepsilon$. Specifically, for any fixed $δ_0>0$, the following hold for all distributions $p$ and $q$: (i) $n^*_{\mathrm{Hub}}(\varepsilon) \lesssim n^*_{\mathrm{TV}}(\varepsilon) \lesssim n^*_{\mathrm{Hub}}(2\varepsilon)$, (ii) $n^*_{\mathrm{Sub}}(\varepsilon) \lesssim n^*_{\mathrm{TV}}(\varepsilon) \lesssim n^*_{\mathrm{Sub}}((2+δ_0)\varepsilon)$, and (iii) $n^*_{\mathrm{Sub}}(\varepsilon) \lesssim n^*_{\mathrm{Hub}}(\varepsilon) \lesssim n^*_{\mathrm{Sub}}((1+δ_0)\varepsilon)$, and the scaling constants are tight. Finally, we extend our results to adaptive versions of the contamination models.

Age of Information Optimization for Status Updates in Integrated Sensing and Communication Systems

2026-05-23T19:59:59Z

In this paper, we study age of information (AoI) optimization for status updating in an integrated sensing and communication (ISAC) system. We consider a discrete-time architecture in which a base station interacts with a physical environment and a remote monitor, and at each time slot can operate in one of three modes: sensing, communication, or joint sensing and communication. Each mode is unreliable and incurs a different operational cost. The objective is to minimize a discounted infinite-horizon cost that combines the AoI at the monitor with action-dependent sensing and communication costs. For the single source scenario, we formulate the problem as a Markov decision process with a two-dimensional AoI state and prove that the optimal stationary policy admits an ordered threshold structure in the AoI state space. Since the AoI evolves over an infinite space, we truncate the state space to reduce complexity and rigorously bound the resulting error. The analysis analytically determines the truncation size needed to keep the error below a given threshold. For the multi-source scenario, we formulate the scheduling problem as a restless multi-armed bandit. We develop both a Whittle index policy and an approximate Whittle index policy for scheduling under two different regimes, one where indexability is guaranteed, and one where it is not. Numerical results illustrate the structure of the optimal policy in the single-source case and show that the proposed approximate Whittle index policy performs comparably to the Whittle index policy in the indexable regime, while remaining effective beyond it.

Joint Service Placement and Resource Optimization in Hierarchical Edge-Cloud Networks

2026-05-23T16:15:05Z

Hierarchical edge-cloud computing-aided Internet of Things (IoT) networks offer low-latency and cost-efficient services to a growing number of data-intensive IoT devices. However, optimizing service placement, which involves determining the most suitable locations within a network to deploy various services, is critical to balancing workloads dynamically and ensuring efficient resource utilization. In this paper, we jointly optimize service placement, edge/cloud cooperation, task offloading, and bandwidth allocation to enhance processing efficiency and response times. The main objective is to minimize both the overall end-to-end latency and the system cost, including service deployment and operational costs. The formulated problem belongs to the class of non-convex mixed-integer nonlinear programming, where finding a feasible solution is already challenging. Towards a stable system, we first transform the original problem into a more tractable form and then decompose it into sub-problems which are solved at different timescales. Combining tools from relaxation and the successive convex approximation method, we develop iterative algorithms to solve these problems efficiently. With an appropriate penalty parameter, the proposed algorithms guarantee convergence to at least a local optimum. We produce extensive numerical results to demonstrate the superior performance of the proposed algorithms over benchmark schemes as well as emphasize the significance of the joint service placement and resource allocation in enhancing system performance and efficiency.

Privacy-Preserving Proof of Human Authorship via Zero-Knowledge Process Attestation

2026-05-23T12:55:30Z

Process attestation verifies human authorship by collecting behavioral biometric evidence, including keystroke dynamics, typing patterns, and editing behavior, during the creative process. However, the very data needed to prove authenticity can reveal intimate details about an author's cognitive state, health conditions, and identity, constituting sensitive biometric data under GDPR Article 9. We resolve this privacy-attestation paradox using zero-knowledge proofs. We present ZK-PoP, a construction that allows a verifier to confirm that (a) sequential work function chains were computed correctly, (b) behavioral feature vectors fall within human population distributions, and (c) content evolution is consistent with incremental human editing, all without learning the underlying behavioral data, exact timing, or intermediate content. Our construction uses Groth16 proofs over arithmetic circuits with Pedersen commitments and Bulletproof range proofs. We prove that ZK-PoP is computationally zero-knowledge, computationally sound, and achieves unlinkability across sessions. Evaluation shows proof generation in under 30 seconds for a 1-hour writing session, with 192-byte proofs verifiable in 8.2 ms, while incurring less than 5% accuracy loss in simulation at practical privacy levels (epsilon >= 1.0) compared to non-private baselines.

On Constructing and Decoding Quantum Triorthogonal Codes

2026-05-23T11:13:06Z

A triorthogonal code is a binary quantum Calderbank-Shor-Steane (CSS) code defined by a triorthogonal matrix. Triorthogonal codes are a key ingredient in magic-state distillation, since they allow for transversal $\mathsf{T}$ gates, a non-Clifford logical operation useful for achieving universal fault-tolerant quantum computation. Their construction is challenging because it must satisfy simultaneous pairwise and triple-wise overlap constraints, as well as row-weight requirements. In this work, we study the construction and decoding of triorthogonal codes with prescribed dual-distance properties. We derive an existence criterion for even-weight triorthogonal generator matrices with a target dual minimum distance. The criterion combines triorthogonality constraints with MacWilliams identities via Krawtchouk-polynomial conditions on the dual weight distribution, yielding an integer linear programming formulation for the construction problem. We find new nontrivial triorthogonal codes that are not necessarily generated by classical triply-even codes. The decoding performance of high-distance triorthogonal codes obtained via the doubling construction is then evaluated over the dephasing channel. We compare bounded-distance decoding, belief propagation plus ordered-statistics post-processing, and a GRAND-based decoder adapted to the quantum setting, which turns out to be a promising option.

The Normalized Maximum Likelihood for Regular Non-Smooth Models: Measure-Theoretic Foundations and Geometric Sampling

2026-05-23T08:57:48Z

The Normalized Maximum Likelihood (NML) codelength, or stochastic complexity, represents a principled criterion for universal coding. While recent coarea-based formulations provided a calculation method for smooth models, this framework collapses for the non-smooth estimators ubiquitous in modern machine learning (e.g., Lasso, Sparse SVMs). In this work, we provide a rigorous framework for computing the NML for regular path-differentiable Lipschitz (PDL) estimators. By applying classical geometric measure theory and bridging the coarea formula with conservative Jacobians, we prove that the stochastic complexity for non-smooth models is well-posed and theoretically consistent with the outputs of modern Automatic Differentiation. To compute this quantity exactly, we introduce the Propose-and-Project Metropolis-Hastings (PDL-PPMH) sampler, a geometric MCMC algorithm capable of traversing the non-differentiable level sets of the maximum likelihood estimator. We theoretically justify its components, including a stochastic tangent space proposal and a provably convergent non-smooth projection solver. We demonstrate the method's robustness by sampling from a high-dimensional Lasso posterior ($P=2000$), while simultaneously quantifying the computational scaling that governs the trade-off between exactness and mixing time. Crucially, we empirically demonstrate that our exact NML criterion provides a highly data-efficient alternative to cross-validation, achieving statistically indistinguishable predictive optima without requiring data splitting. Altogether, our work paves the way for the theoretical analysis of the NML codelength for regular non-smooth models.

Reed-Solomon Codes with Optimal Repair Bandwidth: A Basis-Transformation Approach

2026-05-23T05:45:58Z

Maximum distance separable (MDS) codes are widely used in distributed storage, but naively repairing a single failure in an $(n,k)$ MDS code requires downloading the full contents of $k$ surviving nodes. Minimum storage regenerating (MSR) codes, introduced by Dimakis et al., minimize repair bandwidth while preserving the MDS property by contacting $d>k$ helper nodes and downloading only a fraction of each helper. For scalar MDS codes, Guruswami and Wootters established a linear repair framework, and Tamo, Ye, and Barg subsequently gave the first explicit Reed-Solomon (RS) codes achieving the MSR point. Their construction yields RS-MSR codes with subpacketization $\ell=s\prod_{i=1}^n p_i$, where $s=d+1-k$ and the distinct primes $p_i$ satisfy $p_i\equiv 1\pmod{s}$. In this paper, we show that this congruence condition is not intrinsic to the RS repair problem. We develop a basis-transformation approach to the construction of repair-enabling subspaces. The approach consists of three deterministic operations -- Euclidean Square Partition, Transposition, and Column Aggregation -- which construct the required repair-enabling subspaces directly from the standard monomial basis of the repair field. Consequently, we obtain RS-MSR codes with subpacketization $\ell=s\prod_{i=1}^n p_i$ for arbitrary distinct primes $p_i>s$. For fixed $s$, this improves the subpacketization of the Tamo--Ye--Barg construction by a factor asymptotic to $\varphi(s)^{n+\mathrm{o}(n)}$, where $\varphi(\cdot)$ denotes Euler's totient function.

SinFormer: A Tailored Transformer for Robust Radio Frequency Fingerprint Identification

2026-05-23T04:18:32Z

With the rapid proliferation of wireless and Internet of Things (IoT) devices, ensuring secure and reliable device identification has become a significant challenge. Traditional security techniques, such as IP or MAC address-based authentication, are susceptible to spoofing, whereas Radio Frequency Fingerprint Identification (RFFI) offers a more secure alternative by exploiting the unique hardware imperfections in devices' RF signals. In this paper, we propose a novel deep learning-based framework for RFFI that enhances both accuracy and reliability in challenging RF environments. The core of our approach is the Signal Inception Transformer (SinFormer), which leverages a specialized multi-scale self-attention mechanism to effectively capture both large-scale and fine-grained fingerprints in signals, significantly improving identification accuracy. To further enhance robustness and reliability, we introduce a two-stage training strategy that enables the model to learn general signal features and maintain performance under adverse conditions, such as low Signal-to-Noise Ratio (SNR) or channel variations. The effectiveness of the proposed method is validated using a real-world dataset. Experimental results show that the SinFormer framework consistently outperforms existing methods in accuracy and robustness across diverse and challenging scenarios.

Airy Beam Dispersion in Near-Field Wideband Terahertz Communications

2026-05-23T03:54:20Z

This letter investigates Airy beam dispersion in near-field wideband terahertz communications. Unlike conventional focusing beams, whose dispersion mainly appears as focal-point migration, Airy beams exhibit frequency-dependent shifts of both the reference focusing point and the self-bending main-lobe trajectory. Based on the Fresnel diffraction integral, a closed-form trajectory expression is derived to characterize the dispersion behavior across subcarriers. Furthermore, a true-time-delay (TTD)-assisted Airy beamforming structure is developed to actively control the trajectory dispersion. By properly designing the time delay parameters, the proposed scheme can either generate frequency-dependent curved trajectory clusters for sensing-oriented scanning or suppress trajectory drift for reliable communication.

Designs, linear codes, plateaued functions, and their interconnections

2026-05-23T02:31:55Z

In this paper, we mainly investigate profound interconnections between combinatorial designs, linear codes, and Boolean functions.