https://arxiv.org/api/yRH0TaNdyYSoZt2fbBY973VxTKM 2026-07-18T22:06:54Z 28366 15 15 http://arxiv.org/abs/2607.14304v1 Spectral Concentration and Recovery in Sparse High-Dimensional Random Geometric Graphs 2026-07-15T19:09:14Z

We study sparse random geometric graphs generated by connecting pairs of high-dimensional vectors whose inner product exceeds a threshold. The latent vectors are sampled either uniformly from the sphere or from a standard Gaussian distribution. Although every edge appears with probability $p$, the edges are dependent through their shared latent vectors. For the spherical model, at the connectivity scale $np=Ω(\log n)$, we prove $\|A-\mathbb E A\|=O\left(\sqrt{np\log n}+npτ\right)$, with high probability, where $τ$ is the cap threshold. This sharpens the spectral norm bound of Liu, Mohanty, Schramm, and Yang (2023) under weaker assumptions. An analogous result holds for the Gaussian model after removing the fluctuations of the vector norms, yielding improved global synchronization guarantees for the homogeneous Kuramoto model. We then recover the latent geometry from the leading eigenspace. When $np\gg\log n$, both the latent vector and relative Gram matrix errors vanish provided $d\ll np\log(1/p)/\log n$. The required lower dimension is only $d\gg\log(1/p)$ for the spherical model and $d\gg\log^2(1/p)\log n$ for the Gaussian model, improving the recovery guarantees of Li and Schramm (2023). Finally, we prove the first exact recovery result for the Gaussian mixture block model of Li and Schramm (2023). At the optimal connectivity scale $np=Ω(\log n)$, a polynomial-time semidefinite program exactly recovers all labels in a moderate-separation regime, whereas larger separation makes exact recovery impossible because isolated vertices appear with high probability. Our proofs combine orthogonal polynomial expansions, decoupling, and matrix concentration, avoiding the trace-moment arguments used in previous work.

2026-07-15T19:09:14Z 62 pages Manuel Fernandez Yizhe Zhu http://arxiv.org/abs/2504.14659v3 Markovian Continuity of the MMSE 2026-07-15T18:51:10Z

Minimum mean square error (MMSE) estimation is widely used in signal processing, information theory, and related fields. Despite its practical robustness, the MMSE can be discontinuous under standard notions of stochastic convergence. To bridge this gap, we review classical counterexamples to the continuity of the MMSE and observe that they share a common pathology: along the approximating sequence, the observation is strictly more informative about the limit estimand than the limit observation is. Motivated by practical acquisition mechanisms, we study MMSE continuity under two natural constraints: (1) continuity of the second moment, and (2) a degradedness (Markov) restriction ensuring that each approximating observation is no more informative than the limit observation is about the limit estimand. Under these conditions, we establish continuity of the MMSE and of the MMSE estimator. We provide complementary semicontinuity results and continuity guarantees in related settings and establish continuity under linear estimation. We further extend the analysis to the families of Bregman divergences and continuous metric cost functions, including the Kullback-Leibler and Jensen-Shannon divergences as special cases.

2025-04-20T15:42:41Z This work has been submitted to the IEEE for possible publication Elad Domanovitz Anatoly Khina http://arxiv.org/abs/2607.14064v1 Minimax Theory of Likelihood-Based Deep Learning for Speckle Regression 2026-07-15T17:36:23Z

Speckle noise is a multiplicative noise commonly encountered in coherent imaging modalities such as synthetic aperture radar, optical coherence tomography, and digital holography. Although deep learning methods, in practice, have achieved state-of-the-art performance for speckle denoising, their fundamental statistical limits remain largely unexplored. Unlike additive noise models, multiplicative speckle noise makes the regression function unidentifiable from the conditional mean, rendering conventional least-squares-based deep learning approaches inapplicable. We study the minimax estimation of smooth nonparametric regression functions using likelihood-based deep neural network (DNN) estimators under a model with both multiplicative speckle noise and additive Gaussian noise. Our framework accommodates both low-dimensional and sparse high-dimensional features. We establish finite-sample upper bounds on the estimation error of the proposed DNN estimators and derive minimax lower bounds for nonparametric function recovery under our model, showing that they match up to logarithmic factors in the sample size. Moreover, these minimax rates coincide, up to logarithmic factors, with those for nonparametric regression under additive Gaussian noise alone, demonstrating that the intrinsic difficulty of estimation remains essentially unchanged despite the challenges posed by multiplicative speckle noise. Numerical experiments further supports consistency of our DNN-based despeckling methods and demonstrate their effectiveness.

2026-07-15T17:36:23Z 34 pages, 2 figures Soham Jana http://arxiv.org/abs/2607.07513v2 Fast Rates for Semi-Supervised Learning via Data-Augmentation Graph Regularization 2026-07-15T17:32:34Z

Self-supervised learning matches supervised accuracy from a fraction of the labels, but the labeled-sample efficiency behind this has lacked a theoretical explanation. We provide one. Data augmentation induces a similarity graph on the unlabeled data, so downstream learning on that graph is graph-Laplacian-regularized learning. We prove a fast transductive rate, $O(1/n_L)$ in the number of labels, in place of the supervised $O(1/\sqrt{n_L})$, by carrying the leave-one-out stability apparatus of Johnson and Zhang (JMLR 2007) over to the augmentation graph, and without the unrealistic assumptions of limit-based analyses (exact kernel, generalizing features). The bound makes augmentation quality explicit: the expected error is at most $C/n_L + R_{\mathrm{DA}}(y)$, where the data-augmentation alignment error $R_{\mathrm{DA}}(y)$ is proportional to the graph-cut mass of augmentations that cross a label boundary, so good augmentations let few labels suffice. The analysis uses a streamlined loss that drops the projector, negative-sample, and orthogonality overhead of standard objectives yet still recovers the top-$K$ ideal features in the infinite-data limit, the augmentation-kernel eigenspace studied by Zhai et al. The bound gives a mechanistic account of the accuracy-versus-label-count curve through augmentation quality, verified in a controlled model where the constants are known.

2026-07-08T15:11:21Z 23 pages, 2 figures Adam M. Oberman http://arxiv.org/abs/2501.10898v2 High-dimensional Sobolev tests on hyperspheres 2026-07-15T15:36:02Z

We derive the limit null distribution of the class of Sobolev tests of uniformity on the hypersphere as the dimension and the sample size diverge to infinity at arbitrary rates. The limiting non-null behavior of these tests is also explored: (i) a general consistency result for Sobolev tests in high dimensions is shown and (ii) the asymptotic power of the tests under sequences of integrated von Mises-Fisher local alternatives is obtained. The asymptotic results are applied to test for high-dimensional rotational symmetry and spherical symmetry. Numerical experiments illustrate the derived behavior of the uniformity and spherical symmetry tests under the null and under local and fixed alternatives. A real data application tests the high-dimensional normality of the cosmic microwave background.

2025-01-18T23:15:09Z 25 pages, 1 figure, 4 tables. Supplementary materials: 6 pages, 5 figures Bruno Ebner Eduardo García-Portugués Thomas Verdebout http://arxiv.org/abs/2607.13918v1 Partially Correlated Verifier Cascades in LLM Harnesses: Concave Log-Odds, Polynomial Reliability, and Blind-Spot Ceilings 2026-07-15T14:58:37Z

Serial verification gates are a core reliability primitive in LLM harnesses: a candidate answer is returned only if $k$ verifier calls all accept it. Under conditionally independent gates, the recent Odds Law (arXiv:2606.15712) shows that posterior log-odds grow linearly in $k$, so failure decays exponentially, and states that "a tight theory of partially correlated verifier cascades remains open." This note gives a minimal such theory. Modeling the per-instance false-accept rate on the generator's own errors as a latent variable $α\sim G$ (de Finetti), the exact cascade posterior is $\ell_k = \ell_0 - \ln m_k$, with $m_k$ the $k$-th moment of $G$. Then: (i) $\ell_k$ is concave in $k$ for every non-degenerate $G$ -- the Odds Law is its tangent at the first gate and an upper bound; (ii) for Beta$(a,b)$ latents, failure decays polynomially, $1-r_k \asymp k^{-b}$, with correlation parameter $ρ_v = 1/(a+b+1)$; (iii) a blind-spot atom of mass $1-π$ at $α=1$ caps the evidence extractable from any number of gates at $-\ln(1-π)$ nats, so reliability saturates below 1; (iv) letting the true-accept rate also vary ($β\sim H$) yields a trichotomy -- gates eventually always help, plateau, or actively harm -- decided by the upper-tail exponents of $G$ and $H$, with closed-form crossover $k^\dagger$. The mechanism is survivorship: errors surviving gates are the high-$α$ ones. The theory is measurable: $R$ repeated verdicts per instance identify the first $R$ moments of $G$, so two verdicts identify $ρ_v$; beta-binomial likelihood and NPMLE recover the reliability curve and the ill-posed ceiling. In synthetic tests, independence-based extrapolation underestimates failure by 20x at $k=5$ and ~3000x at $k=10$; the correlated fit at $R=8$ tracks held-out depths. The practical lever is decorrelation -- changing model family, modality, or evidence source -- not adding gates.

2026-07-15T14:58:37Z 14 pages, 2 figures. Code and synthetic-recovery experiments: https://github.com/jianganghan/harness-verifier-cascades Jiangang Han http://arxiv.org/abs/2508.14400v3 Gaussian Multiplier Bootstrap Procedure for the $k$th Largest Coordinate of High-Dimensional Statistics 2026-07-15T14:47:22Z

We consider the problem of Gaussian multiplier bootstrap procedures for the $k$th largest statistics and functions of the top $k$ order statistics, which are commonly encountered in high-dimensional statistical inference. Such a problem has been studied previously for $k=1$ (i.e., maxima). However, in many applications, a general $k$ ($k\geq 1$) is of great interest. We provide the upper bounds for the errors between Gaussian approximations and Gaussian multiplier approximations. The dimension $p$ is allowed to be larger than the sample size $n$. The effectiveness of the proposed methods is demonstrated via the computer numerical results and a real-world data analysis.

2025-08-20T03:53:20Z Yixi Ding Qizhai Li Yuke Shi Liuquan Sun Luobin Zhang http://arxiv.org/abs/2601.11347v2 Optimal e-values for testing the mean of a bounded random variable against a composite alternative 2026-07-15T14:21:42Z

We derive explicitly the e-values with optimal (relative) growth rate in the worst case for testing the mean of a bounded random variable, thereby providing the first application of the (RE)GROW quality criteria beyond the assumption of mutually absolutely continuous hypotheses for e-values originally proposed by Grünwald et al. (2024). For both criteria, we explicitly characterise the alternatives that are most difficult to test against and show that they admit a meaningful interpretation. We give two important examples in which REGROW provides a powerful quality criterion to choose optimal e-variables whereas GROW leads to trivial solutions.

2026-01-16T14:53:15Z Sebastian Arnold Eugenio Clerico http://arxiv.org/abs/2607.13852v1 A Frequentist Approach to Change Point Detection: Methods and Applications 2026-07-15T13:59:53Z

In this paper we study the problem of change point detection in functional time series where the observations are allowed to vary on both sparse and dense support. We address the problem of mean shift as well as the process volatility. Our methodology is based on the maximization of the conditional probability of change point given all the other parameters. Further, it has been proved that the proposed estimator is consistent.

2026-07-15T13:59:53Z Debanjana Datta http://arxiv.org/abs/2607.13761v1 Nonstandard likelihood-ratio limits under semidefinite rank constraints 2026-07-15T12:21:26Z

We study likelihood-ratio tests for the hypothesis that a positive-semidefinite matrix has rank at most a prescribed value. The null hypothesis is stratified: points of maximal allowed rank lie on a regular boundary stratum, whereas lower-rank points are singular. Consequently, the usual chi-bar-square calibration on the top stratum does not by itself describe the whole composite null, especially along sequences whose rank changes at the local $n^{-1/2}$ scale. After profiling regular nuisance parameters, we derive a common reduced Gaussian experiment for every fixed null rank and for all admissible local rank transitions. On the top stratum, the classical chi-bar-square law is recovered. At lower ranks, the limit generally involves projection onto a nonconvex rank-constrained semidefinite set. Our main calibration result shows that, under isotropy, the top-stratum law is least favourable over all fixed null strata and all local null rank transitions. We also prove the corresponding transition dominance under arbitrary anisotropy when the active corank is one. Finally, on the top stratum, we obtain a conditional shape derivative for the limiting distribution and its critical value. Gaussian covariance models and finite-sample experiments illustrate nuisance profiling, rank transitions, anisotropy, and orientation sensitivity.

2026-07-15T12:21:26Z 24 pages, 23 pages supplement, 1 R code Didier Concordet http://arxiv.org/abs/2412.08475v3 Rethinking Mean Square Error: Information, Generalized Estimation, and the James-Stein Paradox 2026-07-15T10:31:53Z

The James-Stein estimator's dominance over maximum likelihood in mean square error has been called a paradox because maximum likelihood is known to be superior in many other respects. One response, due to Efron, is to question maximum likelihood. Another is to question MSE. We pursue the second and compare MSE with $Λ$-information (Vos and Wu, 2025) as criteria for assessing estimators. The comparison rests on two distinctions: between point estimators and generalized estimators -- functions of the sample and parameter jointly, with the score as archetype -- as inferential objects, and between pointwise and family-aware assessment criteria. An elementary lemma shows that no pointwise criterion, MSE or any other risk built from a loss function, admits a uniformly optimal estimator; $Λ$-information, which is family-aware and parameter-invariant, is uniformly maximized by the score. A point estimator is assessed through the generalized estimators it induces, and under the score map its $Λ$-efficiency is the fraction of Fisher information the statistic retains, placing the criterion in Fisher's information-loss tradition. On unbiased estimators, $Λ$-efficiency coincides with variance-based efficiency. Returning to James-Stein, the paradox dissolves: maximum likelihood is fully efficient because it is sufficient, while the James-Stein statistic is exactly two-to-one in the sample, and the information it destroys -- computed exactly -- is concentrated precisely where its MSE advantage is greatest. MSE retains its proper domain under genuine squared-error loss.

2024-12-11T15:42:35Z 16 pages Paul W. Vos http://arxiv.org/abs/2607.13667v1 On the existence and non-existence of centres of mass on Hilbert spheres 2026-07-15T10:12:29Z

Fréchet means and $L^p$ centres of mass provide notions of average location in metric spaces. On finite-dimensional spheres, existence follows from compactness. On infinite-dimensional spheres, it is not known whether a centre of mass always exists. We show that this is not always the case, and give a simple assumption under which a centre of mass exists. We then show that finding the sample centre of mass of data $x_1, \ldots, x_n$ on the sphere is always an optimisation problem on a subsphere of manifold dimension at most $n$, regardless of the potentially infinite dimension of the sphere. We conclude with some statistical implications.

2026-07-15T10:12:29Z Shahin Tavakoli http://arxiv.org/abs/2403.13750v3 Data integration of non-probability and probability samples with deterministic predictive mean matching 2026-07-15T09:10:44Z

We study deterministic predictive mean matching mass imputation estimators to integrate data from probability and non-probability samples. We consider two approaches: predicted-to-predicted (PMM~A) and predicted-to-observed (PMM~B) matching. We prove the consistency of mean estimators, derive a variance decomposition, and propose estimators of variance. We establish consistency of the PMM~A estimator under model misspecification and underline key differences from the nearest neighbour method. Our PMM~B approach can be employed with non-parametric regression techniques, such as kernel regression, and the analytical expression for variance applies to nearest neighbour matching for non-probability samples. Extensive simulation studies compare properties of the proposed estimators with existing alternatives and examine the effects of model misspecification. The paper concludes with an empirical study on the integration of job vacancy survey and vacancies submitted to public employment offices (admin and online data). Open-source software is available.

2024-03-20T16:58:49Z 63 pages; codes: https://github.com/ncn-foreigners/paper-nonprob-pmm Aniela Czerniawska Piotr Chlebicki Łukasz Chrostowski Maciej Beręsewicz http://arxiv.org/abs/2607.13564v1 Manipulation testing based on Benford's Law for discrete scores 2026-07-15T08:04:52Z

This paper addresses the problem of running variable manipulation in Regression Discontinuity Designs. Leveraging the observation that manipulation often alters the density balance around the cutoff, we detect these structural imbalances using Benford's Law -a natural statistical regularity widely applied in fraud detection. Our framework serves as a vital precautionary safeguard alongside traditional McCrary-type tests. It eliminates researcher-chosen parameters that can skew outcomes, while delivering a deeper diagnostic breakdown of the density's behavior. Crucially, whereas the classic McCrary test can overlook systemic imbalances due to its rigid symmetric setup, our method separates the data into directional components. This allows researchers to pinpoint the exact origin of a deviation and spot hidden manipulation that standard frameworks fail to capture. To achieve this, we introduce an innovative method for selecting a bandwidth consistent with BL, and construct two distinct, complementary tests using threshold values adapted from Nigrini (2012) that successfully transition the law's application from digits to probabilities. Empirical applications confirm the enhanced protective value of this diagnostic framework.

2026-07-15T08:04:52Z 5 figures, 22 pages Roy Cerqueti Marco Ventura http://arxiv.org/abs/2604.17136v2 On the normality of the concatenated Fibonacci constant 2026-07-14T23:57:43Z

We study the concatenated Fibonacci constant $\mathcal{F} := 0.F_{1}F_{2}F_{3}\cdots = 0.11235813\cdots$, obtained by concatenating the Fibonacci numbers in the fractional part, and ask whether it is normal. We show that several classical sufficient conditions for normality by concatenation do not apply to the Fibonacci sequence because of its exponential growth, while a criterion of Pollack and Vandehey implies that the normality of $\mathcal{F}$ in base $10$ would follow if almost all Fibonacci numbers were $(\varepsilon,k)$-normal in base $10$. The Benford bias of leading digits and the Pisano periodicity of trailing digits are shown to contribute asymptotically negligible fractions of the total digits, isolating the distribution of the deep digits of large Fibonacci numbers as the remaining obstruction. Large-scale numerical experiments on the first $500{,}000$ Fibonacci numbers in bases $10$ and $2$ indicate that global single-digit counts and $k$-block statistics for $k = 2, 3, 4$ are compatible with i.i.d.-like fluctuations at the scales tested, and that a positional decomposition concentrates the visible structured deviation at the boundaries between consecutive Fibonacci numbers, while pooled interior blocks remain close to uniform. Our computations suggest that any obstruction to normality lies in the asymptotic behavior of the deep digits of $F_{n}$.

2026-04-18T20:24:11Z AMSart style, 21 pages, 8 tables, no figures, 26 references José Ricardo G. Mendonça