https://arxiv.org/api/o2ZiQYLISGft/zoXW6gxJc2qtPY 2026-06-13T17:32:06Z 36171 615 15 http://arxiv.org/abs/2605.26000v1 Statistical Inference for Stochastic Gradient Descent Beyond Finite Variance 2026-05-25T16:18:39Z

Stochastic gradient descent (SGD) is a foundational algorithm for large-scale statistical learning and stochastic optimization. However, statistical inference based on SGD iterates remains challenging when stochastic gradients have infinite variance, as the relevant limiting distributions depend on unknown nuisance parameters. In this paper, we develop an efficient, model-agnostic methodology for constructing confidence regions from SGD trajectories that applies in both finite- and infinite-variance regimes. The procedure is based on a joint weak convergence result for the Polyak-Ruppert averaged estimator and an empirical second-moment normalizer constructed from stochastic gradients along the SGD trajectory. This joint limit yields a self-normalized statistic in which the leading tail-dependent scaling terms cancel. We then use a subsampling calibration scheme to estimate the relevant critical values, avoiding explicit estimation of tail indices, slowly varying functions, or stable-law parameters. The resulting confidence regions are straightforward to implement and are asymptotically valid under both the finite- and infinite-second-moment regimes. Simulation studies show reliable coverage in various settings, supporting the proposed method as a practical tool for uncertainty quantification in stochastic optimization.

2026-05-25T16:18:39Z Jose Blanchet Peter Glynn Wenhao Yang http://arxiv.org/abs/2606.07561v1 Boundary Variance Inflation Causes Acquisition Bias in Gaussian Processes 2026-05-25T15:59:40Z

Gaussian processes with stationary kernels on bounded domains exhibit inflated posterior variance near the boundary. Despite being a long-recognized artifact in geostatistics and a source of over-exploration in Bayesian optimization, the causes and effects of boundary-induced acquisition bias are underexplored. We trace the root cause to a simple geometric mechanism: the truncation of the kernel correlation neighborhood at the domain boundary creates an observation-independent distortion that worsens with dimensionality. We show how this distortion manifests across three acquisition classes: variance maximization concentrates selections at the corners, whereas negative integrated posterior variance and expected predictive information gain move selections inward to axis-aligned interior shells. These patterns arise without reference to any objective function, meaning that acquisition behavior can be dominated by kernel geometry rather than the desired task-specific uncertainty. To quantify this, we introduce a function-free selection-profile diagnostic for arbitrary acquisitions, kernels, and bounded-domain geometries.

2026-05-25T15:59:40Z 14 pages, 8 figures; appendices included Maria Bånkestad Sanna Jarl Jens Sjölund http://arxiv.org/abs/2510.26051v4 Estimation and Inference in Boundary Discontinuity Designs: Distance-Based Methods 2026-05-25T15:41:35Z

We study nonparametric distance-based (isotropic) local polynomial methods for estimating the boundary average treatment effect curve, a causal functional that captures treatment effect heterogeneity in boundary discontinuity designs. We establish identification, estimation, and inference results both pointwise and uniformly along the treatment assignment boundary. We show that the geometric regularity of the boundary, a one-dimensional manifold, plays a central role in determining feasible convergence rates and valid inference procedures. Our theoretical contributions are threefold. First, we derive uniform lower and upper bounds on the convergence rate of the misspecification bias of isotropic local polynomial estimators. Second, we obtain uniform distributional approximations that justify boundary-robust inference. Third, we establish minimax lower bounds for a broad class of nonparametric isotropic regression estimators. These results yield practical guidance for empirical implementation, including new bandwidth selection rules that adapt to local irregularities of the treatment-assignment boundary. We illustrate the proposed methods using simulation evidence and an empirical application, and provide companion general-purpose software.

2025-10-30T01:03:57Z Matias D. Cattaneo Rocio Titiunik Ruiqi Rae Yu http://arxiv.org/abs/2601.09525v2 Sparse covariate-driven factorization of high-dimensional brain connectivity with application to site effect correction 2026-05-25T15:34:55Z

Large-scale neuroimaging studies often collect data from multiple scanners across different sites, where variations in scanners, scanning procedures, and other conditions across sites can introduce artificial site effects. These effects may bias brain connectivity measures, such as functional connectivity (FC), which quantify functional network organization derived from functional magnetic resonance imaging (fMRI). How to leverage high-dimensional network structures to effectively mitigate site effects has yet to be addressed. In this paper, we propose SLACC (Sparse LAtent Covariate-driven Connectome) factorization, a multivariate method that explicitly parameterizes covariate effects in latent subject scores corresponding to sparse rank-1 latent patterns derived from brain connectivity. The proposed method identifies localized site-driven variability within and across brain networks, enabling targeted correction. We develop a penalized Expectation-Maximization (EM) algorithm for parameter estimation, incorporating the Bayesian Information Criterion (BIC) to guide optimization. Extensive simulations validate SLACC's robustness in recovering the true parameters and underlying connectivity patterns. Applied to the Autism Brain Imaging Data Exchange (ABIDE) dataset, SLACC demonstrates its ability to reduce site effects.

2026-01-14T14:48:13Z Rongqian Zhang Elena Tuzhilina Jun Young Park http://arxiv.org/abs/2605.25897v1 Nonparametric Estimation via Expected Order Statistics 2026-05-25T14:25:52Z

The empirical distribution function assigns mass $1/n$ to each of the $n$ observations in a sample. As these are highly variable, estimation error may be reduced by replacing them with estimated observations that are asymptotically less variable. Motivated by this idea, we introduce a nonparametric estimator obtained by assigning mass $1/m$ to $m$ estimated expected order statistics, with $m$ chosen arbitrarily. The estimator enjoys several finite-sample properties and yields a rich asymptotic theory. Its estimation error relative to its population counterpart is controlled by the $L^1$ error of the empirical distribution. Moreover, every $L$-functional of the new estimator corresponds to an $L$-functional of the empirical distribution with updated weights. We establish almost sure convergence in $L^p$ norm and Wasserstein distance as $n \to \infty$, and derive weak convergence of the associated empirical quantile process in $L^p(0,1)$, for $p\in[1,\infty)$ and $m$ fixed, and for $p=1,2$ as $n,m \to \infty$. These results yield asymptotic distributions for distance-based functionals, including $L^p$ and Wasserstein metrics. Bootstrap validity is also established. Simulations show that the estimator often improves on the empirical distribution and remains competitive with kernel methods, with more stable performance across different distributional settings.

2026-05-25T14:25:52Z Tommaso Lando Lorenzo Tedesco http://arxiv.org/abs/2605.25873v1 Bayesian perspectives on exponential random graph models 2026-05-25T14:00:56Z

Exponential random graph models (ERGMs) are a widely used framework for network data, enabling hypothesis testing on the structural mechanisms underlying observed networks. Bayesian ERGMs provide principled uncertainty quantification and enable the incorporation of prior knowledge through fully probabilistic modelling. However, computation remains challenging because the posterior is doubly intractable, with a likelihood normalising constant that depends on unknown parameters. This paper reviews Bayesian approaches to ERGM inference, categorising inference methods into three broad classes: auxiliary variable MCMC methods, adjusted pseudo-likelihood approaches, and variational methods, alongside dedicated treatment of model selection. We also discuss modelling extensions for missing data, longitudinal dynamics, populations of networks, weighted networks, highlighting applications across various scientific disciplines.

2026-05-25T14:00:56Z 16 pages Alberto Caimo Isabella Gollini http://arxiv.org/abs/2605.25855v1 High-Dimensional Change-Point Detection via Angular Kernel Statistics 2026-05-25T13:45:38Z

We study change-point detection for high-dimensional data in regimes where inference must be performed from small batches of observations. Our primary focus is the high-dimensional, low sample size (HDLSS) regime, where the sequence length is fixed while the ambient dimension diverges. We propose a dimension-averaged angular kernel scan framework for detecting marginal distributional shifts. The statistic aggregates bounded one-dimensional angular discrepancies across coordinates, yielding a fully nonparametric, hyperparameter-free, and moment-agnostic estimator that remains well-defined without specifying, estimating, or assuming finite marginal moments, for example under heavy-tailed or contaminated distributions. For the offline single-change problem, we derive an exact population mean factorization into a universal deterministic shape function and a scalar signal factor, characterize the null covariance structure up to a scalar long-run variance factor, and establish an HDLSS multivariate central limit theorem under cross-coordinate mixing. These results lead to plug-in Gaussian calibration, asymptotic type-I error control, and power and localization guarantees, including a $d^{-1/2}$ local detection scale. We further extend the offline procedure to a fixed-window sequential monitoring procedure for high-dimensional streaming data, and obtain ARL calibration and worst-case EDD bounds. Simulation studies demonstrate that the proposed method can accurately detect and localize changes in challenging HDLSS and streaming settings where moment-based or hyperparameter-sensitive procedures may be unreliable.

2026-05-25T13:45:38Z Jyotishka Ray Choudhury Yao Xie http://arxiv.org/abs/2602.07704v2 Correcting for Nonignorable Nonresponse Bias in Ordinal Observational Survey Data 2026-05-25T13:12:17Z

Many political surveys rely on post-stratification, raking, or related weighting adjustments to align respondents with the target population. But when respondents differ from nonrespondents on the outcome itself (nonignorable nonresponse), these adjustments can fail, introducing bias even into basic descriptives. We provide a practical method that corrects for nonignorable nonresponse by leveraging response-propensity proxies (e.g., interviewer-coded cooperativeness) observed among respondents to extrapolate toward nonrespondents, while directly integrating observable covariates and retaining the benefits of post-stratification with known population shares. The method generalizes the variable-response-propensity (VRP) framework of Peress (2010) from binary to ordinal outcomes, which are widely used to measure trust, satisfaction, and policy attitudes. The resulting estimator is computed by maximum likelihood and implemented in a compact R routine that handles both ordinal and binary outcomes. Using the 2024 American National Election Study (ANES), we show that accounting for nonignorable nonresponse produces substantively meaningful shifts for life satisfaction (estimated latent correlation $ρ\approx 0.53$), while yielding negligible changes for retrospective economic evaluations ($ρ\approx 0$), highlighting when nonignorable nonresponse substantively affects survey estimates.

2026-02-07T21:15:33Z 17 pages Lukáš Lafférs Jozef Michal Mintal Ivan Sutóris http://arxiv.org/abs/2605.25811v1 Geometry Adaptive Counterfactual Distribution Learning with Diffusion-Guided Smoothing 2026-05-25T13:02:56Z

We study counterfactual distribution learning for high-dimensional outcomes whose counterfactual law may concentrate near lower-dimensional structure. Standard isotropic smoothing treats all ambient directions equally, leading to unfavorable scaling and unstable local inference. We propose two diffusion-guided estimators based on semiparametric debiasing: diffusion-informed smoothing for counterfactual densities and diffusion-informed score smoothing for counterfactual scores. The estimators combine causal nuisance adjustment with geometry-adaptive localization driven by diffusion score information, removing first-order nuisance bias while aligning smoothing with local outcome geometry. We establish asymptotic expansions, risk bounds, and inference procedures for smoothed density and score-based targets, with ambient density inference obtained under additional approximation conditions. Under structural geometry conditions, the leading stochastic error is governed by an effective dimension induced by the diffusion-guided kernel, rather than by the ambient dimension. Semi-synthetic experiments based on CelebA show steeper error decay for geometry-adaptive methods, supporting the proposed effective-dimension theory.

2026-05-25T13:02:56Z Kwangho Kim http://arxiv.org/abs/2602.05938v2 DiPPER: A Bayesian approach to differential prevalence analysis with applications in microbiome studies 2026-05-25T12:31:56Z

Recent evidence suggests that analyzing the presence/absence of taxonomic features can offer a compelling alternative to differential abundance analysis in microbiome studies. However, standard approaches to differential prevalence analysis face challenges with boundary cases and multiple testing. To address these limitations, we developed DiPPER (Differential Prevalence via Probabilistic Estimation in R), a method based on Bayesian hierarchical modeling. We benchmarked our method against existing differential prevalence methods, along with two differential abundance tools, using publicly available data from 57 human gut microbiome studies. We observed considerable variation in performance across the evaluated methods. Importantly, DiPPER demonstrated high sensitivity to detect potentially differentially prevalent features while maintaining a well-calibrated family-wise error rate under the global null hypothesis. Most notably, it outperformed the alternatives in the replication of findings across independent studies. Furthermore, DiPPER provides differential prevalence estimates and uncertainty intervals that are inherently adjusted for multiple testing.

2026-02-05T17:49:08Z Source code and datasets: https://github.com/jepelt/differential-prevalence. R package: https://github.com/jepelt/DiPPER Juho Pelto Kari Auranen Janne V. Kujala Leo Lahti http://arxiv.org/abs/2605.25734v1 Stein-Encoder: A White-Box Supervised Encoder via Stein Identities in Multi-Modal Studies 2026-05-25T11:43:09Z

In multi-modal biomedical research, integrating high-dimensional genomic data with clinical baselines is essential for precision medicine. However, standard deep neural network approaches often entangle these modalities, obscuring the specific predictive impact of genetic features and leading to possibly suboptimal predictive performance. Motivated by the landmark METABRIC cohort primary breast tumors study, we propose the Stein-Encoder, a white-box supervised framework designed to isolate the genetic signal driving clinical outcomes conditional on nuisance covariates. By leveraging Stein's method and residualization techniques, our approach constructs an interpretable single index that summarizes relevant biological heterogeneity while flexibly incorporating clinical factors and can be used to improve downstream prediction. We establish theoretical guarantees for identification, consistency and efficiency improvement. Applied to the METABRIC cohort, the Stein-Encoder outperforms unsupervised benchmarks in predictive accuracy. Crucially, it achieves structural disentanglement by revealing response-specific biological mechanisms: we find that tumor size is driven primarily by mitotic networks, whereas prognostic indices rely on a distinct proliferation-versus-immune axis. This work contributes a unified, computationally efficient framework that bridges statistical rigor with the representational power of neural networks, enabling interpretable, task-specific and efficient compression of multi-modal health data for a wide range of precision medicine applications, beyond biomarker discovery.

2026-05-25T11:43:09Z Jiarui Zhang Shuoxun Xu Jiasheng Shi Xinzhou Guo http://arxiv.org/abs/2404.14328v2 Preserving linear invariants in ensemble filtering methods 2026-05-25T10:23:15Z

Data assimilation combines dynamical models with observations to improve state estimates. Ensemble filters sequentially assimilate observations by updating a set of samples over time, alternating between a forecast and an analysis step. Accurate and robust predictions often require preserving critical invariants such as mass, stoichiometric balance of chemical species, and electrical charge. While modern numerical solvers maintain these invariants, existing invariant-preserving analysis steps are limited to Gaussian settings. Furthermore, they can be incompatible with regularization techniques such as inflation and covariance tapering. In this work, we focus on preserving linear invariants in non-Gaussian filtering problems. Leveraging tools from measure transport theory, we introduce a novel class of nonlinear ensemble filters that preserve any desired linear invariants. Notably, we recover a constrained formulation of the Kalman filter for the special case of the Gaussian setting. We also demonstrate how to combine preserving invariants with regularization techniques in the ensemble Kalman filter. Numerical experiments illustrate the benefits of preserving linear invariants in both ensemble Kalman filters and transport-based nonlinear ensemble filters.

2024-04-22T16:39:32Z 25 pages Journal of Computational Physics (2026) Mathieu Le Provost Jan Glaubitz Youssef Marzouk 10.1016/j.jcp.2026.115048 http://arxiv.org/abs/2605.25496v1 Estimation of Directed Acyclic Graphs by Frequentist Model Averaging 2026-05-25T06:57:18Z

Directed acyclic graphs provide a fundamental tool for representing directed dependence structures in multivariate network data, and are widely used to model financial and economic networks. However, accurate and interpretable estimation remains challenging under graph structural uncertainty. We propose an optimal model averaging method for directed acyclic Gaussian graphs. With a set of candidate models varying by graph structures, we average estimates from candidate models using weights that minimize a penalized negative log-likelihood criterion. In contrast to existing approaches, we not only establish the asymptotic optimality, weight consistency, and parameter consistency of the proposed method, but also explicitly characterize how different candidate models affect the convergence rate. Moreover, we prove parameter consistency even when all candidate graph models are misspecified. Results from simulation studies and a real-data analysis on the banks' international liability data show the promise of the proposed method.

2026-05-25T06:57:18Z 33 pages, 5 figures Huihang Liu Wenhui Li Xinyu Zhang http://arxiv.org/abs/2605.25478v1 Transcripts and Algebraic Distances in Time Series: Stochastic Properties and Nonparametric Dependence Tests 2026-05-25T06:31:04Z

The use of ordinal patterns (OPs) for analyzing the dependence structure of univariate and continuously distributed processes has gained popularity in recent years. This research goes one step further and considers the transcripts being computed from successive OPs in the time series. Transcripts constitute a kind of ``difference'' between successive OPs and thus naturally relate to two algebraic distances between OPs, the Cayley and Kendall edit distances. The original time series is transformed into a sequence of transcripts or distances, respectively, and important stochastic properties thereof are derived. It is shown that these properties differ substantially among different types of original processes. This motivates the development of various statistics based on transcripts and edit distances in order to investigate the dependence structure of the original process. In particular, the asymptotic distribution of these statistics under the null hypothesis of serial independence is derived, which is then used to implement nonparametric tests for serial dependence. A simulation study shows that these novel dependence tests have appealing power properties, often outperforming former OP-based dependence tests. A concluding real-world data example illustrates the application and interpretation of the proposed approaches in practice.

2026-05-25T06:31:04Z Christian H. Weiß José M. Amigó http://arxiv.org/abs/2606.07556v1 Selecting New Measurement Locations to Diversify Traffic-Pattern Coverage: A Real-World Evaluation for Total Traffic Volume Estimation 2026-05-25T06:14:39Z

Accurate measurement of traffic volumes and flows is vital for modern intelligent transportation. However, despite recent technological advances in sensor devices, it is still expensive to install and maintain fixed traffic counters. Therefore, it is restricted to a small portion of location points where the counters can be installed, which severely limits the possibility of grasping and predicting the total traffic volume at a city-wide level. By contrast, devices with location history such as smartphones and connected vehicles are now widely used and provide much wider spatial coverage. However, the data from these devices are usually partial and noisy, so they are not enough to directly estimate total traffic volumes and flows. In this paper, we use the information from these widely available devices to help decide where to place additional traffic counters, and we study how selecting new measurement locations can improve city-wide traffic estimation performance. To achieve this, we propose an algorithm that chooses additional counter locations to increase the diversity of observed traffic signal patterns, rather than simply spreading counters evenly over space. The goal is to capture traffic-pattern types that are rare in the current counter set and to make the collected observations more representative for later estimation and forecasting. We also present a real-world evaluation; in a target city, we select new locations expected to improve traffic prediction, and we then commissioned new field measurements at those locations at our expense. The resulting data led to an improvement in traffic volume estimation accuracy across different fidelities.

2026-05-25T06:14:39Z 12 pages, 7 figures Masaaki Inoue Akifumi Okuno Shintaro Fukushima