https://arxiv.org/api/I0YhWWtWL+yPkNq5Q6YCOQ9MIPg 2026-06-10T09:46:46Z 36124 165 15 http://arxiv.org/abs/2606.05954v1 Network model selection: A review of methods 2026-06-04T09:53:12Z

Understanding the processes behind the evolution of complex networks is a key objective in network science. An effective framework for tackling this challenge is network model selection, which involves finding the model from a set of candidates that best explains a given network. This book is a systematic review of methods for this purpose. Each method is outlined in three parts: its core principle (used to organize methods into four categories), other relevant details including my own observations, and software availability. The book provides a comprehensive overview of the state-of-the-art in network model selection and concludes by exploring future directions. A unified, optimal method could identify the mechanisms that shape real-world networks more precisely than any current approach. This work represents the first step toward developing such an optimal method. It will be a valuable resource for students and researchers in network science.

2026-06-04T09:53:12Z This is an Accepted Manuscript version of the book: Zoran Levnajic, Network model selection: A review of methods, 2026, Springer. This version has been accepted for publication, but is not the Version of Record and does not reflect post-acceptance improvements (such as copyediting or typesetting), or any corrections. The final authenticated version is available online at ISBN 978-3-032-30448-3 Zoran Levnajić http://arxiv.org/abs/2606.05935v1 Hessian-informed, Coordinate Friendly Hamiltonian Monte Carlo in Linear Time 2026-06-04T09:37:55Z

Riemannian Hamiltonian Monte Carlo (RHMC) is a promising MCMC methodology thanks to its ability to accommodate position-dependent preconditioning and multi-step proposals. While RHMC performs well in low dimensions, it becomes infeasible in high dimensions due to its $O(d^3)$ cost per fixed-point iteration, where $d$ is the dimension of the target density. Even when the position-dependent preconditioner is based on the diagonal of the Hessian, the cost is still $O(d^2)$ per fixed-point iteration. In this paper, we propose a computational method to reduce the computational complexity of RHMC fixed-point iterations with diagonal preconditioners from $O(d^2)$ to $O(d)$ for targets with ``coordinate friendly'' structures. This distribution class includes generalized linear models as well as other dense and sparse graphical models. The method is expressed as manipulating the compute graph and can therefore be automated to work on black box targets. Finally, we show empirically that our implementation of RHMC results in better sample quality per unit of compute time for various target distributions compared to state-of-the-art HMC NUTS algorithms with both position-independent and position-dependent preconditioners.

2026-06-04T09:37:55Z Son Luu Nikola Surjanovic Zuheng Xu Trevor Campbell Alexandre Bouchard-Côté http://arxiv.org/abs/2606.05871v1 Compositional Boundaries for Density Fusion 2026-06-04T08:45:59Z

Distributed uncertainty-management systems often combine local probabilistic models along aggregation trees chosen by communication, privacy, or scheduling constraints. The final density should depend on the weighted sources, not on the particular order in which intermediate nodes combine them. We study this requirement as an algebraic compositionality problem for binary fusion of weighted probability densities. The central question is when a local fusion rule can be executed hierarchically while remaining order-invariant. We establish a compositional boundary for local segment-valued fusion rules. Within the class of continuous binary rules with additive output weights and weight-only coefficients, order-invariant hierarchical execution characterizes normalized weighted linear pooling; norm-induced segment balancing realizes the corresponding coefficient. Smooth endpoint-to-candidate $f$-divergence balancing has a different local geometry: its quadratic expansion induces square-root effective weights, showing why pairwise solvability alone is insufficient for schedule-independent fusion. We show that this obstruction is local to endpoint-to-candidate binary balancing, whereas global divergence barycenters retain additive-weight local limits. Finally, Gaussian mixtures show how the same issue appears in finite model classes: exact fusion is compositional, whereas stepwise compression is compositional only under a congruence condition on unnormalized component measures. These results distinguish exact schedule-independent fusion from global aggregation objectives and local approximation heuristics.

2026-06-04T08:45:59Z Ratan Bahadur Thapa Ali Darijani Jürgen Beyerer Steffen Staab http://arxiv.org/abs/2509.20345v3 General Synthetic-Powered Inference 2026-06-04T08:42:19Z

The rapid proliferation of high-quality synthetic data -- generated by advanced AI models or collected as auxiliary data from related tasks -- presents both opportunities and challenges for statistical inference. This paper introduces a GEneral Synthetic-Powered Inference (GESPI) framework that wraps around a broad class of statistical inference procedures to safely enhance sample efficiency by combining synthetic and real data. Our framework leverages high-quality synthetic data to boost statistical power, yet adaptively defaults to the standard method using only real data when synthetic data are of low quality. The error rate of our method remains below a user-specified bound without any distributional assumptions on the synthetic data, and decreases as the quality of the synthetic data improves. This flexibility enables seamless integration with conformal prediction, risk control, hypothesis testing, and multiple testing procedures, all without modifying the base inference method. We demonstrate the benefits of our method on challenging tasks with limited labeled data, including AlphaFold protein structure prediction, and comparing large reasoning models on complex math problems.

2025-09-24T17:37:14Z Meshi Bashari Yonghoon Lee Roy Maor Lotan Edgar Dobriban Yaniv Romano http://arxiv.org/abs/2605.31278v2 Industrializing Prediction-Powered Inference: The GLIDE Library for Reliable GenAI and Agentic Systems Evaluation 2026-06-04T07:25:36Z

Reliable evaluation of agentic systems requires unbiased estimates with valid uncertainty, but standard practice navigates between costly human annotation and biased LLM-as-judge proxies. Prediction-powered inference (PPI) combines both into debiased estimates with valid confidence intervals, yet its various methods remain scattered across papers under partial implementations. We introduce GLIDE, an open-source Python library that unifies state-of-the-art PPI estimators (PPI++, Stratified PPI, Predict-Then-Debias and its stratified variants, Active Statistical Inference) and samplers (uniform, stratified, active, cost-optimal) under a scipy-style API specialized to mean estimation. GLIDE ships with a reproducible Monte Carlo validation suite, an empirically grounded decision tree for method selection, and an agentic evaluation case study showing substantial annotation savings at equivalent precision. The GLIDE package is available at this URL: https://github.com/EmertonData/glide

2026-05-29T13:10:35Z 8 pages, Accepted to the ICML 2026 Workshop on Statistical Frameworks for Uncertainty in Agentic Systems, Seoul, South Korea, 2026 Grégoire Martinon Ibrahim Merad Mohammed Raki http://arxiv.org/abs/2605.16846v3 Efficient frequentist fractional polynomials for skewed dose-response and survival data: a variance-reducing alternative to OLS-FP 2026-06-04T06:33:14Z

Fractional polynomials (FP) are a standard tool for modelling nonlinear dose-response and covariate effects, implemented in the widely used mfp package. The conventional FP fit estimates its coefficients by ordinary least squares (OLS-FP), which is statistically inefficient when the regression errors are skewed or heavy-tailed, a common situation for survival times, concentrations and biomarkers. We present a drop-in replacement that keeps the identical FP model and design but estimates the coefficients with a moment-based score tuned to the residual skewness and kurtosis, giving a closed-form efficiency factor g2 = 1 - gamma3^2/(2+gamma4) relative to OLS-FP. Across skewed error laws the method reduces slope-coefficient variance by 10-20% for mildly skewed errors and up to roughly 60% for heavy-tailed log-normal errors, at realistic sample sizes, while keeping confidence-interval coverage close to nominal, and it reverts exactly to OLS-FP under symmetry, so it is never harmful when no gain is available. On the German Breast Cancer Study Group cohort it narrows the tumour-size confidence interval by 26% (bootstrap variance ratio 0.53 against the predicted 0.56), and a primary-biliary-cirrhosis cohort reproduces the gain. The estimator is closed-form, runs in milliseconds, and is released as a reproducible R package (pmm_fp in EstemPMM) with a one-command replication bundle; its core variance identity is machine-checked in Lean 4.

2026-05-16T07:07:26Z Revised and retitled version prepared for journal submission; applied biostatistical framing strengthened, primary-biliary-cirrhosis confirmation added, and supplementary theory separated. 25 pages, 2 figures, 5 tables Serhii Zabolotnii http://arxiv.org/abs/2605.29972v3 Identification-Robust Testing in Endogenous Functional Linear Regression with Weak or Irrelevant Auxiliary Variables 2026-06-04T04:37:11Z

We develop dimension-reduction-free tests for the slope function in functional linear regression when the functional regressor may be endogenous or measured with error. The tests are based on a functional moment condition induced by an auxiliary functional variable and do not require estimation of the slope function. This feature is particularly useful in infinite-dimensional settings, where the identification and regularization conditions needed for consistent estimation are often strong and difficult to verify. The proposed procedures remain asymptotically valid under weak or even failed relevance of the auxiliary variable, and they are consistent against fixed alternatives that are detectable through the moment operator. We establish the asymptotic null distribution, consistency against detectable alternatives, and local power under drifting alternatives. We also derive the locally optimal test within a class of weighted test statistics. Feasible critical values for implementation of the tests are obtained from data. Simulations show reliable size control and competitive power, including under weak relevance. We illustrate the method using a functional regression analysis of residential electricity demand and temperature distributions in South Korea.

2026-05-28T14:10:27Z Won-Ki Seo http://arxiv.org/abs/2606.05676v1 regcorr: An R Package for Regression Models of Pearson Correlation Coefficients 2026-06-04T03:58:37Z

Pearson's correlation coefficient is commonly used as a single-number summary of association between two responses. In many applications, however, the strength of association is itself heterogeneous and may vary with demographic, biological, experimental, or environmental covariates. The regcorr package implements regression models in which a Pearson correlation coefficient is linked to a linear predictor of covariates. The package supports bivariate normal responses and bivariate Bernoulli responses, provides Newton-Raphson estimation routines, includes data generators for simulation studies, and supplies a bootstrap-based subroutine for assessing the significance and power of covariate effects. The implementation follows the likelihood-based framework of Dufera, Liu, and Xu (2023) and exposes it through a lightweight R interface with no compiled code and minimal dependencies. This paper describes the statistical model, the computational design of regcorr, reproducible usage examples, and practical guidance for interpreting covariate-dependent correlations. The package is available from the Comprehensive R Archive Network at https://CRAN.R-project.org/package=regcorr under the MIT license.

2026-06-04T03:58:37Z 8 pages. R package available on CRAN Ze Lin Bo Li Jinyao Shen http://arxiv.org/abs/2606.05666v1 Weighting a Census as a Non-Probability Sample: A Doubly Robust Framework for Correcting Differential Undercoverage in Uruguay's 2023 Census 2026-06-04T03:48:46Z

The 2023 Uruguayan Census recorded a population of 3,444,451 with an estimated undercoverage of 10.3%. Post-enumeration evidence shows that omission was non-random, concentrated in vulnerable areas, rural territories, and among young adults. Integrating administrative records (AR) recovered aggregate counts but did not resolve selection bias in outcome variables, as AR lack core census variables, exhibit urbanicity and institutional-visibility biases, and do not reconstruct households. Estimates derived from enumerated microdata remain biased. We treat effectively enumerated households as a non-probability sample with an unknown selection mechanism and construct weights using a doubly robust (DR) estimator. This framework combines a segment-level response-propensity model, using the web linkage rate as a contact proxy, with calibration to combined-census demographic totals (sex, age, department). Because the DR estimator is consistent when either model is correctly specified, it provides robustness against undercoverage misspecification. We describe the application at a scale of three million records, document its effect on social indicators, and present a variance approximation based on an equivalent stratified cluster design. Finally, we establish a methodological framework to guide national statistical offices on optimizing non-response adjustments based on their available registers and paradata.

2026-06-04T03:48:46Z Ferreira Juan Pablo Goyeneche Juan Jose http://arxiv.org/abs/2601.13150v3 Propensity Score Propagation: A General Framework for Design-Based Inference with Unknown Propensity Scores 2026-06-04T03:17:28Z

Design-based inference, also known as randomization-based or finite-population inference, provides a principled framework for trustworthy statistical inference by attributing randomness solely to the design mechanism (e.g., treatment assignment, survey sampling, or missingness), without imposing super-population distributional or modeling assumptions on outcome data. From Fisher's and Neyman's seminal work to the recent resurgence of design-based inference, this perspective has played a central role in causal inference, survey sampling, and missing data analysis. However, a fundamental obstacle has limited its use in many modern applications: existing design-based inference theory typically relies on known propensity scores (i.e., known design probabilities), whereas propensity scores are usually unknown in observational studies, real-world survey settings, and missing data problems. We propose propensity score propagation, a general framework for valid design-based inference with unknown propensity scores. The framework introduces a regeneration-and-union procedure that propagates uncertainty from propensity score estimation into downstream design-based inference without imposing super-population outcome assumptions. It accommodates both parametric and nonparametric propensity score models, integrates seamlessly with existing design-based methods developed under known propensity scores, and applies broadly across design-based inference problems. Theoretical results and simulation studies show that the proposed framework achieves nominal coverage, even when existing approaches exhibit substantial under-coverage.

2026-01-19T15:32:09Z Siyu Heng Yanxin Shen Zijian Guo http://arxiv.org/abs/2402.12825v3 Quasi-maximum likelihood estimation for scalable ARMA models 2026-06-04T02:30:33Z

The recently proposed scalable ARMA model preserves the parsimony of traditional VARMA models while achieving greater computational tractability. However, existing studies are limited to regularized least squares estimation (LSE) for high-dimensional settings, which is not only statistically less efficient but also requires the sub-Gaussian assumption for its theoretical guarantees. Moreover, it still lacks inference tool for real applications. To fill this gap, we develop a quasi-maximum likelihood estimation (QMLE) framework for scalable ARMA models. Its asymptotic normality is established under a finite fourth order moment condition, and we formally prove its asymptotic efficiency gain over LSE. We also introduce an efficient block coordinate descent algorithm for computation and a consistent Bayesian information criterion for model selection. Simulation studies validate the finite-sample performance of our methodology, and an empirical application to six macroeconomic indicators demonstrates its practical utility.

2024-02-20T08:49:58Z 65 pages, 2 figures, 5 tables Yuchang Lin Wenyu Li Qianqian Zhu http://arxiv.org/abs/2606.05599v1 Mitigating the Curse of Dimensionality in Uniform Convergence of Deep Neural Networks via Smooth Activations 2026-06-04T02:24:50Z

This paper establishes a theoretical framework for the uniform convergence of smoothly activated deep neural network (DNN) estimators. While standard ReLU networks achieve minimax-optimal rates in the $L^2(P)$ norm for various nonparametric regression tasks, we establish a theoretical lower bound demonstrating that least-squares ReLU estimators can suffer from the curse of dimensionality in their uniform convergence behavior. Motivated by the need for reliable uniform guarantees in downstream tasks requiring worst-case reliability, we address this limitation by analyzing smoothly activated DNNs (smooth DNNs), encompassing both feedforward and residual structures. We establish novel pseudo-dimension bounds, non-asymptotic approximation guarantees, and Hölder-norm bounds for the approximators of these models. Leveraging these results, we derive non-asymptotic uniform convergence rates for smooth DNN estimators across multiple statistical contexts, including Huber, least-squares, quantile, and logistic regression. We prove that smooth DNNs can mitigate the {curse of dimensionality} in uniform convergence by adaptively exploiting the low-dimensional hierarchical composition structure of the target function. Supported by both simulation studies and a real-world application, our results position smooth DNNs as a theoretically grounded and practically viable alternative to ReLU networks for statistical learning tasks requiring uniform guarantees.

2026-06-04T02:24:50Z 30 pages, 5 figures Yizhe Ding Runze Li Jia Liu Lingzhou Xue http://arxiv.org/abs/2606.05560v1 Wasserstein Exponential Smoothing 2026-06-04T01:14:22Z

Exponential smoothing (ES) often outperforms other techniques in time series forecasting across a wide range of data-generating processes. While ES has traditionally been applied to time series in $\mathbb{R}$, this paper extends the methodology to distributional time series, where each observation is a probability distribution on $\mathbb{R}$. The primary contribution of this work is twofold. First, we propose a principled and intuitive generalization of ES within the Wasserstein space, which retains the exceptional parsimony of classical ES. Second, we theoretically and empirically demonstrate that the smoothing parameter can be consistently estimated by minimizing a Wasserstein distance. Applications to distributional time series of high-frequency financial returns and household electricity demands confirm the practical effectiveness of our Wasserstein ES model.

2026-06-04T01:14:22Z Takuo Matsubara Peiwen Jiang Minh-Ngoc Tran Wilson Ye Chen http://arxiv.org/abs/2312.01265v4 The optimal sub-Gaussian normalisation for randomised monotone functions 2026-06-04T01:10:33Z

Let $\mathcal{M}$ denote the class of randomised monotone functions on $\mathbb{R}$ with values in $[0,1]$, and let $U_{\mathcal{M}}\colon \mathbb{R}_+\to \mathbb{R}_+$ be the minimal function for which \[ \mathbb{P}\left\{ \sqrt{η_f}\, \sup_{t\in\mathbb{R}} \left| f_Z(t) - \Exf{f_Z(t)} \right| \ge \varepsilon\sqrt{U_{\mathcal{M}}(η_f)} \right\} \le 2\mathrm{e}^{-2\varepsilon^2} \] holds for every member $f_Z$ of $\mathcal{M}$ of finite effective sample size $η_f$ and every positive $\varepsilon$. We prove that for every $x> 1$, \[ \left| \sqrt{U_{\mathcal{M}}(x)} - \sqrt{\log_4 x} \right| \le 2 \min\!\left\{ 1,\, \frac{2 \ln(\mathrm{e} + \ln x)}{\sqrt{\ln x}} \right\}\,. \] The optimal scale $\sqrt{U_{\mathcal{M}}(x)}$ is sharply tied, uniformly at finite sample sizes, to $\frac{1}{\sqrt{2\ln 2}}\sqrt{\ln x}$.

2023-12-03T03:20:09Z The updated paper refines mathematical results to be significantly sharper. Methods are unchanged, specific corrections are implemented. All empirical applications and field experiment results have been removed. Sutanuka Roy is no longer an author as they are writing a separate empirical paper. Thomas Anton is now at Columbia; Rabee Tourky remains at ANU Thomas Anton Rabee Tourky http://arxiv.org/abs/2410.06326v3 Convex Estimation of Gaussian Graphical Regression Models with Covariates 2026-06-04T00:45:28Z

Gaussian graphical models (GGMs) are widely used to recover the conditional independence structure among random variables. Recent work has sought to incorporate auxiliary covariates to improve estimation, particularly in applications such as co-expression quantitative trait locus (eQTL) studies, where both gene expression levels and their conditional dependence structure may be influenced by genetic variants. Existing approaches to covariate-adjusted GGMs either restrict covariate effects to the mean structure or lead to nonconvex formulations when jointly estimating the mean and precision matrix. In this paper, we propose a convex framework that simultaneously estimates the covariate-adjusted mean and precision matrix via a natural parametrization of the multivariate Gaussian likelihood. The resulting formulation enables joint convex optimization and yields improved theoretical guarantees under high-dimensional scaling, where the sparsity and dimension of covariates grow with the sample size. We support our theoretical findings with numerical simulations and demonstrate the practical utility of the proposed method through a reanalysis of an eQTL study of glioblastoma multiforme and an analysis of diet on the human gut microbiome.

2024-10-08T20:02:10Z Ruobin Liu Guo Yu