https://arxiv.org/api/eiHP3x/sT0sf4jt7fzffZ+HnsFM 2026-06-13T18:42:06Z 36171 630 15 http://arxiv.org/abs/2605.25452v1 Different Statistical Perspectives for Understanding Generalisation in Graph Neural Networks 2026-05-25T06:02:31Z

Graph Neural Networks (GNN) are currently the most popular approach for learning and prediction on graph-structured data and are deployed in various fields, from social network analysis to drug discovery. However, there is limited mathematical understanding of the performance of GNNs. We discuss the various perspectives used to study statistical generalisation in GNNs. We identify three broad frameworks. The first approach, rooted in learning theory, relies on uniform convergence bounds and the complexity of the hypothesis class of specific GNN architectures. This approach also builds on the expressivity of GNNs, typically studied through the lens of graph isomorphism tests. The second principle is to simplify the neural architecture by analysing GNNs under the asymptotics of infinitely many parameters or infinite graph size. This approach approximates GNNs using Gaussian processes, neural tangent kernels or graphon neural network operators, which allow studying the generalisation or stability of trained GNNs. The third framework studies GNNs under random graph models, often the contextual stochastic block model, and derives non-asymptotic error rates using tools from high-dimensional statistics. We highlight some key theoretical results and discuss a few limitations and open research questions for each perspective.

2026-05-25T06:02:31Z 15 pages, 4 figures, submission for Special Issue in AStA Advances in Statistical Analysis Nil Ayday Mahalakshmi Sabanayagam Debarghya Ghoshdastidar http://arxiv.org/abs/2605.25380v1 Rank-Based Tests for Mutual Independence of High-Dimensional Random Vectors via $L_q$ Norm 2026-05-25T03:09:44Z

We consider the problem of testing mutual independence among the components of a high-dimensional random vector. Building on the rank-based max-sum framework, we introduce fixed finite-$L_q$ power-sum statistics under three general classes of rank-based correlations: simple linear rank statistics, non-degenerate rank-based U-statistics and degenerate rank-based U-statistics. The proposed statistics interpolate between the dense-alternative sensitivity of the $L_2$ statistic and the sparse-alternative sensitivity of the $L_\infty$ statistic. We establish the asymptotic independence between any fixed finite-$L_q$ block and the corresponding $L_\infty$ statistic, and combine $L_2,L_4,L_6$ and $L_\infty$ p-values through a Cauchy rule. Numerical studies show that the resulting $L_{2,4,6,\infty}$ procedure is highly robust to the sparsity of the alternative and has strong empirical power across the considered designs.

2026-05-25T03:09:44Z Ping Zhao Hongfei Wang Long Feng http://arxiv.org/abs/2605.05076v2 High-Dimensional Statistics: Reflections on Progress and Open Problems 2026-05-24T22:21:17Z

Over the past two decades, the field of high-dimensional statistics has experienced substantial progress, driven largely by technological advances that have dramatically reduced the cost and effort for data collection and storage across a broad range of domains, including biology, medicine, astronomy, and the social and environmental sciences. Modern datasets are increasingly complex, often exhibiting rich dependency, heterogeneity, and other features that challenge traditional statistical methods. In response, high-dimensional statistics has evolved to address more sophisticated estimation and inference problems. This evolution has, in turn, fostered deep connections with and contributions to a wide range of research areas, including optimization, concentration of measure, random matrix theory, information theory, and theoretical computer science. Given the rapid pace of recent developments in high-dimensional statistics, our goal is to synthesize representative advances, highlight common themes and open problems, and point to important works that offer entry points into the field.

2026-05-06T16:11:09Z Arian Maleki Subhabrata Sen Sivaraman Balakrishnan Verena Zuber Chao Gao Rishabh Dudeja Christos Thrampoulidis Anru Zhang Weijie Su Jason M. Klusowski Po-Ling Loh Ali Shojaie http://arxiv.org/abs/2605.04536v2 Transversality and Geometric Regularisation in Distributional Statistical Models 2026-05-24T19:16:59Z

The distributional statistical framework replaces classical probability densities by distribution-kernel pairs $(T, \varphi)$, where $T$ is a tempered distribution and $\varphi$ is a rapidly decaying kernel. We develop the thesis that the kernel acts as a geometric regulariser, placing parametric statistical models in generic (transversal) position relative to degeneracy loci encoding non-identifiability, singular information, moment indeterminacy, and representation failure. Using the transversality theorems of Whitney, Thom, and Mather, we prove a finite-dimensional weak transversality theorem: for a generic kernel in any sufficiently rich family, the kernel-induced feature map avoids degeneracy strata of sufficiently high codimension. We establish verifiable conditions -- formulated as rank conditions on the Jacobian of the joint feature map -- under which the transversality hypothesis can be checked, and verify them for location families, the log-normal, Stein discrepancies, and graphical models. The present results apply to parametric models; extensions to semiparametric and nonparametric settings are discussed. The degeneracy classification includes representation degeneracy (Type 0) for models without closed-form densities and higher-order instabilities (Type IV) in non-chordal graphical models. Identifiability, robustness, moment determinacy, Fisher information regularity, Stein discrepancy, inferential separation, and the Behrens-Fisher problem all admit a unified geometric interpretation as transversality conditions on the feature map. This paper serves as a geometric companion to a series of papers developing the distributional framework.

2026-05-06T06:24:41Z 22 pages, no figures no tables. In the second version some sketches were replaced by proofs, an example of M-determinancy was added R. Labouriau http://arxiv.org/abs/2509.25507v2 One-shot Conditional Sampling: MMD meets Nearest Neighbors 2026-05-24T17:48:33Z

How can we generate samples from a conditional distribution that we never fully observe? This question arises across a broad range of applications in both modern machine learning and classical statistics, including image post-processing in computer vision, approximate posterior sampling in simulation-based inference, and conditional distribution modeling in complex data settings. In such settings, compared with unconditional sampling, additional feature information can be leveraged to enable more adaptive and efficient sampling. Building on this, we introduce Conditional Generator using MMD (CGMMD), a novel framework for conditional sampling. Unlike many contemporary approaches, our method frames the training objective as a simple, adversary-free direct minimization problem. A key feature of CGMMD is its ability to produce conditional samples in a single forward pass of the generator, enabling practical one-shot sampling with low test-time complexity. We establish rigorous theoretical bounds on the loss incurred when sampling from the CGMMD sampler, and prove convergence of the estimated distribution to the true conditional distribution. In the process, we also develop a uniform concentration result for nearest-neighbor based functionals, which may be of independent interest. Finally, we show that CGMMD performs competitively on synthetic tasks involving complex conditional densities, as well as on practical applications such as image denoising and image super-resolution.

2025-09-29T21:04:50Z Accepted at the 43rd International Conference on Machine Learning (ICML 2026) Anirban Chatterjee Sayantan Choudhury Rohan Hore http://arxiv.org/abs/2605.25169v1 Learning Treatment Effects during Resource Allocation via Priority-Queue Randomization 2026-05-24T17:01:17Z

Public service programs often allocate limited resources under uncertainty about their benefits, creating a need for randomization to support credible evaluation. In practice, however, applicants commonly enter waitlists where resources are prioritized toward individuals judged to have higher need through tiered priority queues, making direct randomization difficult. Motivated by this, we develop an experimental design framework for learning treatment effects while treating those most in need where incoming applicants are randomized into priority queues based on their assessed risk scores. Treatments are then provided across queues in priority order and first-in-first-out within queue as budget becomes available. Our contributions are two-fold. First, we characterize what causal effects are identified under this priority-queue allocation. When arrivals are exogenous, treatments are conditionally randomized, and hence standard estimands are identified; when arrivals are endogenous, queue randomization instead provides an instrument for treatment, identifying local treatment effects induced by the queuing process. Second, we develop optimized queue-assignment designs that trade off statistical efficiency against prioritizing higher-need applicants. We show in the process that, despite dependence in treatment assignments induced by the design, usual iid efficiency bounds remain well-justified design objectives. We illustrate the proposed designs using data from a housing allocation program in a large U.S. county.

2026-05-24T17:01:17Z JungHo Lee Johnna Sundberg Pim Welle Bryan Wilder http://arxiv.org/abs/2603.10941v2 Covariate-adjusted statistical dependence representation through partial copulas: bounds and new insights 2026-05-24T17:00:38Z

In this paper, we revisit the notion of partial copula, originally introduced to test conditional independence, highlighting its capability to represent the dependence between two random variables after removing their dependence with a covariate. Building upon results previously presented in the literature, we show that partial copulas can be seen as a nonlinear analogue of partial correlation. Then, we prove several results showing how dependence properties of the conditional copulas constrain the form of the partial copula. Finally, a simulation study is conducted to illustrate the results and to show the potential of the partial copula as a way to describe covariate-adjusted statistical dependence. This highlights the potential of the method to be used in causal inference problems and to recover the true sign of a causal effect.

2026-03-11T16:28:00Z Vinícius Litvinoff Justus Felipe Fontana Vieira http://arxiv.org/abs/2505.01357v3 Weight-calibrated estimation for factor models of high-dimensional time series 2026-05-24T14:43:30Z

The factor modeling for high-dimensional time series is powerful in discovering latent common components for dimension reduction and information extraction. Most available estimation methods can be divided into two categories: the covariance-based under asymptotically-identifiable assumption and the autocovariance-based with white idiosyncratic noise. This paper follows the autocovariance-based framework and develops a novel weight-calibrated method to improve the estimation performance. It adopts a linear projection to tackle high-dimensionality, and employs a reduced-rank autoregression formulation. The asymptotic theory of the proposed method is established, relaxing the assumption on white noise. Additionally, we make the first attempt in the literature by providing a systematic theoretical comparison among the covariance-based, the standard autocovariance-based, and our proposed weight-calibrated autocovariance-based methods in the presence of factors with different strengths. Extensive simulations are conducted to showcase the superior finite-sample performance of our proposed method, as well as to validate the newly established theory. The superiority of our proposal is further illustrated through the analysis of one financial and one macroeconomic data sets.

2025-05-02T15:52:42Z This version is the accepted version by Journal of the American Statistical Association Xinghao Qiao Zihan Wang Qiwei Yao Bo Zhang http://arxiv.org/abs/2605.19938v2 Variance-Reduced Manifold Sampling via Polynomial-Maximization Density Estimation 2026-05-24T14:07:56Z

Uniform sampling on implicitly defined manifolds is a core primitive in motion planning, constrained simulation, and probabilistic machine learning. MASEM addresses this problem by entropy-maximizing resampling, but its resampling weights depend on a local k-nearest-neighbour density estimate whose errors can be amplified by aggressive resampling temperatures. We ask whether a polynomial-maximization moment estimator can replace the plug-in density rule without changing the surrounding MASEM architecture. The proposed PMM-MASEM module computes shell spacings from nested k-nearest-neighbour radii, estimates their standardized cumulants, and uses a gated PMM2/PMM3 estimator only when the spacing distribution departs from the flat Exp(1) regime; otherwise it falls back to the plug-in/MLE rule. This fallback is essential: on a flat homogeneous manifold the plug-in estimator is already the MLE, so PMM should not outperform it. A local Known-DGP Monte Carlo experiment confirms this gate: the selector returns MLE on flat Exp(1) spacings and reduces density MSE by 22--36% on asymmetric gamma and boundary-spacing regimes. The evidence is not uniformly positive: PMM3 worsens a platykurtic uniform spacing law, and a lightweight resampling-proxy experiment improves seven-lobes coverage but degrades the sine and swiss-roll proxies. The current evidence therefore supports an applicability-boundary result rather than a general MASEM improvement claim.

2026-05-19T14:57:39Z 16 pages, 5 figures, 3 tables. Code supplement: https://github.com/SZabolotnii/Ku-PMM-MASEM-code-supplement Serhii Zabolotnii http://arxiv.org/abs/2605.25079v1 Trans-dimensional Bayesian model averaging for $^{13}$C-based metabolic flux analysis: Evidence-based flux inference under structural model uncertainty 2026-05-24T13:42:07Z

Accurate quantification of intracellular metabolic fluxes is central to systems biology and biotechnology. Flux estimation relies on biochemical network models, with $^{13}$C metabolic flux analysis (MFA) being the state-of-the-art approach. However, isotope labeling data are often insufficient to uniquely support a single network formulation. In such cases, flux estimates become model-dependent, highlighting the need for methods that explicitly account for structural uncertainty. Bayesian model averaging (BMA) provides a principled framework for this purpose, but its application to $^{13}$C-MFA has so far been restricted to uncertainty in reaction bidirectionality within fixed network topologies. We introduce a scalable Bayesian inference framework for $^{13}$C-MFA, Bayesian model set averaging, that applies BMA to encompass uncertainty in reactions and pathways. Our approach combines reversible jump Markov chain Monte Carlo for trans-dimensional exploration of model spaces with diffusive nested sampling for robust estimation of model evidences, enabling averaging over large families of metabolic network models. Using illustrative and application-scale synthetic case studies, we demonstrate that the method yields robust flux estimates, reveals when multiple network configurations are statistically indistinguishable, and recovers data-supported model structures. Importantly, rather than committing to a single model, the framework manages structural uncertainty: under limited data, competing models are retained, whereas increasing data informativeness improved model and flux recovery. The approach scales to billions of model variants, providing a practical foundation for uncertainty- and misspecification-aware quantitative flux inference in $^{13}$C-MFA.

2026-05-24T13:42:07Z Johann F. Jadebeck Anton Stratmann Martin Beyß Katharina Nöh http://arxiv.org/abs/2503.05632v2 A Functional Approach to Curve Alignment and Shape Analysis 2026-05-24T13:04:23Z

In many image analysis problems, the contours of objects carry important statistical information about shape. Such contours are typically affected by deformation variables including scaling, translation, rotation, and reparametrization. Previous studies in statistical shape analysis have mainly focused on analyzing contours and shapes through discrete observations. While this approach might offer computational advantages, it overlooks the continuous nature of these objects and their underlying geometric structure. It also ignores potential dependencies between the deformation variables and their effect on the shape, which may result in a loss of statistical information and reduced interpretability. In this paper, we introduce a novel framework for analyzing shapes within the context of Functional Data Analysis (FDA). Basis expansion techniques are employed to derive analytic solutions for the estimation of deformation variables, namely scaling, translation, rotation, and reparametrization, thereby achieving curve alignment. A generative model for random contours is then developed using principal component analysis techniques. Numerical experiments on simulated data and the \textit{MPEG-7} database demonstrate that our method successfully identifies deformation parameters and captures the underlying distribution of random contours in settings where traditional FDA methods fail.

2025-03-07T17:55:14Z Issam-Ali Moindjié Cédric Beaulac Marie-Hélène Descary http://arxiv.org/abs/2605.24995v1 Information-Theoretic Reliability is Robust to Analytic Choice: A 24-Specification Multiverse on Public Cognitive Test-Retest Data 2026-05-24T10:48:31Z

Background. The reliability paradox describes the empirical observation that cognitive tasks producing robust group-level effects often yield poor between-individual reliability. Existing approaches rely predominantly on the intraclass correlation coefficient (ICC), which captures only linear, second-moment dependence between test and retest. Methods. We introduce a normalized, information-theoretic complement to ICC, NLRΔ, defined as the difference between empirically estimated mutual information and the analytic Gaussian baseline implied by the test-retest correlation. We pair NLRΔ with ICC(2,1), bias-corrected and accelerated (BCa) bootstrap intervals, Benjamini-Hochberg false discovery rate (FDR) control, and a 24-cell multiverse over the KSG nearest-neighbour parameter, correlation method, and minimum-sample threshold. The full pipeline is governed by pre-specified claim contracts, content-addressed provenance, and SHA-256-verified raw data ingestion, and is released as the MixMind Reliability Framework. Results. Across 50 estimable primary measures from the Flanker, Stroop, Stop-Signal, Go/No-Go, and Posner task families, the median NLRΔ is -0.138 nats, with interquartile range [-0.257, -0.034]. Zero of 50 primary measures exceed the headline rule. The companion ICC(2,1) analysis recovers the classical reliability paradox pattern, and the 24-specification multiverse yields 0 of 1,200 estimable cells passing the headline rule. Conclusions. On these two public datasets, replacing or augmenting ICC with an information-theoretic reliability measure does not rescue cognitive tasks from the reliability paradox. The robust null is invariant to the analytic choices examined here. We release the full pipeline, raw-data hashes, and contracts to enable exact replication and extension to other datasets and tasks.

2026-05-24T10:48:31Z 12 pages, 2 figures, 3 tables; software and reproducibility materials archived at Zenodo DOI 10.5281/zenodo.20207371 Maria Westrin http://arxiv.org/abs/2605.24858v1 Optimal Estimation of Discrete Multiview Distributions under Heteroskedastic Multinomial Sampling 2026-05-24T04:28:38Z

Multiview latent-variable models provide a fundamental framework for discrete data analysis, with applications to latent structure models, topic models, and mixtures of product distributions. In the discrete setting, the joint distribution of the observed views can be represented as a nonnegative low-rank tensor, which we call a multiview density tensor. We study the problem of estimating this tensor from multinomial count data. A key challenge is that multinomial sampling induces heteroskedastic and dependent noise, so the difficulty of estimation depends not only on the ambient dimensions and rank, but also on how the probability mass is distributed across different locations of sample space. We propose a general scaling framework for density tensor estimation under multinomial sampling. This framework leads to a spectral estimator for which we prove a Frobenius-norm upper bound that directly handles heteroskedasticity and negative dependence. For the original multiview model, we obtain fiber-mass-dependent Frobenius upper bounds and minimax lower bounds showing that this dependence is unavoidable. Under $\ell_1$ loss, we develop both oracle and feasible data-driven estimators based on the same scaling principle, establish minimax lower bounds, and show near-optimality for the oracle rule at fixed rank and for slice normalization under bounded slice-to-fiber imbalance. Simulations support the theory and demonstrate the robustness of the proposed methods.

2026-05-24T04:28:38Z Runshi Tang Julien Chhor Olga Klopp Alexandre B. Tsybakov Anru R. Zhang http://arxiv.org/abs/2605.24854v1 Deep Regression for Repeated Measurements under Covariate Shift 2026-05-24T04:17:14Z

This paper studies nonparametric regression with repeated measurements when the response in the target domain is unobservable or costly to collect. We adopt a transfer learning framework that leverages a source domain with observable responses under covariate shift. The target regression function is estimated by correcting the distribution shift via the density ratio. We consider both known and unknown density ratio scenarios, which reflect different data available for nonparametric regression estimation. In both cases, we further address two settings: the uniformly bounded density ratio and the unbounded case with finite moment conditions. Under the unknown density ratio scenario, both the density ratio and the target regression function are estimated using rectified linear unit (ReLU) feedforward neural networks (FNNs), whereas under the known density ratio scenario, only the target regression function is estimated by ReLU FNNs. Theoretically, we establish non-asymptotic error bounds for the proposed estimators and prove that they achieve the minimax optimal convergence rate under the repeated measurements setting. Notably, we develop a novel approximation theory where the constants of the network parameters depend polynomially, rather than exponentially as in existing works, on the dimension, thereby mitigating the curse of dimensionality. Consequently, we derive sharper non-asymptotic bounds for the stochastic error. The finite sample performance of the proposed method is demonstrated through numerical simulations and a real data application.

2026-05-24T04:17:14Z 59 pages, 2 figures, 2 tables, including appendix Yingxuan Wang Xiangyu Xing Wangli Xu http://arxiv.org/abs/2605.24848v1 Distributional Conformal Prediction for Markov Processes 2026-05-24T03:41:28Z

We introduce the Markov Distributional Conformal Prediction (MDCP) method that extends the distributional conformal prediction (previously developed for regression) to the setting of a strictly stationary Markov process. Instead of relying on a specific model structure to do prediction, the idea of distributional conformal prediction interval aligns with the Model-Free (MF) Prediction Principle. In analogy to MF prediction of Markov processes, our method exploits the probability integral transform based on estimated transition distribution functions to transform the Markov data to an i.i.d.~dataset. We show a non-asymptotic error bound of MDCPs unconditional coverage rate under a $β$-mixing condition and other standard assumptions on the kernel estimators. The asymptotic validity of the conditional prediction interval is also verified. In addition, we show that our conditional prediction interval is still asymptotically valid with Markov processes being $L^p$-$m$-approximable instead of satisfying the mixing property. Numerical simulations and real data experiments are deployed to empirically illustrate the finite-sample performance of MDCP, and compare it with the MF bootstrap prediction method.

2026-05-24T03:41:28Z 54 pages, 5 figures Dehao Dai Kejin Wu Dimitris N. Politis