https://arxiv.org/api/PI6pTiVQfo904hCtCKTmxooRHug2026-06-21T11:30:11Z36316102015http://arxiv.org/abs/2506.19958v4RobustiPy: An efficient next generation multiversal library with model selection, averaging, resampling, and explainable artificial intelligence2026-05-18T18:14:02ZScientific inference is often undermined by the vast but rarely explored "multiverse" of defensible modelling choices, which can generate results as variable as the phenomena under study. We introduce RobustiPy, an open-source Python library that systematizes multiverse analysis and model-uncertainty quantification at scale. RobustiPy unifies bootstrap-based inference, combinatorial specification search, model selection and averaging, joint-inference routines, and explainable AI methods within a modular, reproducible framework. Beyond exhaustive specification curves, it supports rigorous out-of-sample validation and quantifies the marginal contribution of each covariate. We demonstrate its utility across five simulation designs and ten empirical case studies spanning economics, sociology, psychology, and medicine, including a re-analysis of widely cited findings with documented discrepancies. Benchmarking on ~672 million simulated regressions shows that RobustiPy delivers state-of-the-art computational efficiency while expanding transparency in empirical research. By standardizing and accelerating robustness analysis, RobustiPy transforms how researchers interrogate sensitivity across the analytical multiverse, offering a practical foundation for more reproducible and interpretable computational science.2025-06-24T19:11:42ZDaniel ValdenegroJiani YanDuiyi DaiCharles Rahalhttp://arxiv.org/abs/2605.14565v2A Bayesian Longitudinal Spatial Normative Model for Individualized Brain Deviation Mapping2026-05-18T17:38:32ZNormative modeling enables individualized characterization of structural brain deviations by evaluating subjects against a reference population rather than a group average. Most existing implementations treat brain regions independently and remain cross-sectional, despite the availability of repeated neuroimaging measurements and the well-documented spatial organization of neuroanatomical variation. We propose a Bayesian longitudinal spatial normative model that jointly captures within-subject temporal dependence and spatially structured subject-specific deviations within a unified hierarchical framework. The individualized deviation map is treated as a latent spatial process with an explicit posterior distribution, yielding a principled Bayes estimator under squared error loss rather than an ad hoc residual summary. Across six simulation scenarios encompassing varying spatial dependence, nonlinear trajectories, irregular visit schedules, and missing follow-up, the proposed model consistently reduced deviation-map reconstruction error relative to independent cross-sectional and longitudinal non-spatial benchmarks while maintaining stable calibration. In an application to OASIS-3 structural MRI data, the model reduced RMSE by 54% relative to the independent cross-sectional model and by 45% relative to the longitudinal non-spatial model. Regional deviation burden was concentrated in the temporal pole, entorhinal cortex, inferior temporal cortex, posterior cingulate, and parahippocampal cortex, consistent with regions implicated in early Alzheimer-type neurodegeneration. Subject-level profiles revealed substantial heterogeneity in regional abnormality patterns, including marked multiregional deviation with preserved global cognitive scores.2026-05-14T08:36:45ZJ. T. Korleyhttp://arxiv.org/abs/2605.18691v1Finite Population Sampling as n to N: Empirical Evidence for the Transition from Inference to Accuracy2026-05-18T17:28:22ZThe Central Limit Theorem provides a foundation for inferential statistics and hypothesis testing. It describes how standardized statistics behave under repeated sampling from large populations. However, if the size of the sample (n) becomes so large that it approaches the size of the population (N), sampling variability becomes very small, and standard errors and margins of error both approach zero. The purpose of this project was to investigate the behavior of estimators as the sampling fraction (f = n/N) approaches 1, motivated by modern data streams from administrative records, transaction logs, sensor systems, and institutional databases that capture large portions of finite populations. We constructed two finite populations with known parameters and drew repeated samples across a range of sampling fractions. We then examined the resulting randomization distributions of the sample mean to understand how sampling variability collapses. Additional experiments were conducted using various CPU- and GPU-based methods to evaluate the deviation of the sample mean from the defined population mean under different computational conditions. The results confirm that sampling variability diminishes as expected under finite population theory and becomes negligible well before full enumeration is reached. Once sampling variability is minimized, remaining deviations in estimators are primarily related to numerical precision and computational structure rather than random sampling. These findings support a reassessment of inferential assumptions in high-coverage, large-scale data settings.2026-05-18T17:28:22Z12 pages, 2 Figures, 3 TablesMike Crowhursthttp://arxiv.org/abs/2605.18656v1Statistical Limits and Efficient Algorithms for Differentially Private Federated Learning2026-05-18T17:01:34ZFederated Learning is a leading framework for training ML and AI models collaboratively across numerous user devices or databases. We study the trade-offs among estimation accuracy, privacy constraints, and communication cost for differentially private (DP) federated M estimation. The two standard methods in the literature are FedAvg, which may suffer from high federation bias, and FedSGD, which can incur high communication cost. Aimed at improving accuracy at a reduced communication cost, we propose FedHybrid, which uses FedSGD starting with an improved initialization by the FedAvg estimator. We propose FedNewton, which averages local Newton iterations to reduce bias in FedAvg, achieving an estimation accuracy comparable to FedSGD with much fewer communication rounds when the number of clients grows sufficiently slowly. We establish finite sample upper bounds on the mean-squared error rates of the DP versions of these estimators as functions of the number of clients, local sample sizes, privacy budget, and number of iterations. We further derive a minimax lower bound on the MSE of any iterative private federated procedure that provides a benchmark to assess the optimality gap of these methods. We numerically evaluate our methods for training a logistic regression and a neural network on the computer vision datasets MNIST and CIFAR-10.2026-05-18T17:01:34ZArnab AuddyXiangni PengSubhadeep Paulhttp://arxiv.org/abs/2605.18655v1Self-Supervised Conformal Prediction with Equivariant Bootstrapping for Image Uncertainty Quantification2026-05-18T17:00:40ZInverse problems are ubiquitous in modern scientific studies and involve recovering an underlying signal from noisy observations often transformed by a measurement operator. These problems are frequently ill-posed, particularly in imaging, leading to multiple plausible solutions and considerable uncertainty in reconstructed images. In fields like the physical and biological sciences, accurate uncertainty quantification (UQ) is critical for trustworthy scientific analyses and confident diagnoses. Current UQ methods for imaging often fall short; they can be inaccurate, or require unavailable or difficult-to-acquire ground truth data for calibration, which can introduce hidden biases due to distribution shifts between calibration and observed data. We introduce a UQ approach that leverages equivariant bootstrapping to generate heuristic coverages by exploiting data symmetries. We then refine these coverages through a conformal prediction calibration step, while crucially employing a self-supervised approach to avoid the need for ground truth calibration data. We demonstrate this method with weak lensing mass-mapping, where we aim to reconstruct the convergence field from shear measurements of distant galaxies weakly-lensed by gravitational fields. Mass-mapping in particular benefits from the self-supervised approach, as simulating calibration data is expensive and relies on specific cosmological models that could introduce biases in downstream cosmological inference tasks.2026-05-18T17:00:40Z9 pages, 2 figures; submitted conference proceedings for MaxEnt 2025Henry J. AldridgeTobías I. LiaudatMarcelo PereyraJason D. McEwenhttp://arxiv.org/abs/2605.18633v1Stable Causal Discovery via Directed Acyclic Graph Aggregation2026-05-18T16:41:31ZDirected Acyclic Graphs (DAGs) are central to uncovering causal structure in complex systems, yet learning a single DAG from data is often challenging: model uncertainty, finite samples, and a combinatorially large search space frequently yield unstable estimates. We propose DAGgr, a model averaging framework that aggregates multiple candidate DAGs into a single stable representation. Candidate graphs are weighted by their out-of-sample predictive likelihood across repeated data splits, and a thresholding rule on the resulting edge-importance scores guarantees that the aggregated graph is itself acyclic. We establish a finite-sample risk bound, prove that the procedure preserves acyclicity, and show that edge selection is consistent under mild conditions on the weights. Simulations across random, hub, and chain structures, together with an analysis of the Sachs et al. (2005) protein-signaling network, show that DAGgr matches or exceeds the best individual candidate while consistently outperforming bootstrap-aggregation baselines across structural recovery metrics.2026-05-18T16:41:31ZYunan WuYue WangChunlin LiChenglong Yehttp://arxiv.org/abs/2605.18619v1Random spanning tree Markov random field priors for Bayesian inverse problems in imaging2026-05-18T16:28:52ZMarkov random fields are common prior distributions used in Bayesian inverse imaging problems. In particular, difference priors assign probability distributions to differences between neighbouring pixels, such as Gaussian, Laplace, or Cauchy distributions. Depending on the chosen difference distribution, these priors have smoothing or edge-preserving properties. In this work, we propose a hyperprior on the connectivity graph of the pixel grid in the form of a random spanning tree, i.e., a random connected graph with the minimal number of edges, thereby coupling continuous and discrete random variables in the prior. By using random spanning trees, only a sparse random subset of edges is regularized, which helps preserve edges in the image with reduced contrast loss compared to standard difference-based Markov random fields. We discuss how fractal-like interfaces arise in high-resolution prior samples due to the random-tree connectivity. Finally, we propose a Gibbs sampler that alternates between the discrete tree updates and continuous pixel updates to efficiently explore the posterior distribution. We apply the method to various standard test image restoration problems, including denoising, deblurring, and inpainting, to study the impact of the proposed prior in comparison with existing Markov random fields.2026-05-18T16:28:52ZJasper Marijn Everinkhttp://arxiv.org/abs/2605.18588v1OSSMM: An Open-Source Sleep Monitor and Modulator2026-05-18T16:04:23ZWe present the Open-Source Sleep Monitor and Modulator (OSSMM), an open-source hardware and software platform for accessible sleep research. The OSSMM comprises a small wearable headband built from 3D prints and affordable commercial-off-the-shelf (COTS) components at a material cost under 40 euros, supported by a companion Android application. The system requires no conductive gels, disposable electrodes, or specialized equipment, and captures multiple biosignals movement, pulse, electrooculography (EOG), and putative electroencephalography (EEG) with wireless connectivity for data storage and potential sleep modulation capability via an onboard vibration motor. A proof-of-concept single-participant evaluation across 15 nights demonstrated that the captured biosignals support four-stage sleep classification (Wake, Light Sleep, Deep Sleep, REM) using conventional machine learning methods, with the best-performing model achieving a Macro F1-score of 0.770 and accuracy of 0.776 against a validated non-contact sleep monitor ($κ$=0.63 with PSG). Two technical findings are of particular note. First, inexpensive, reusable conductive thermoplastic polyurethane (CTPU) electrodes from commercial fitness chest straps captured a differential signal whose spectral properties in canonical EEG frequency bands, including signatures consistent with sleep spindles, are the principal features driving classification. Second, this signal is obtained from just two frontal electrodes without a dedicated ground reference, suggesting that practical sleep staging is achievable with simpler configurations than typically employed. All hardware designs, software, and build instructions are openly available to support replication and modification by the research community.2026-05-18T16:04:23Z8 pagesJonny GiordanoFergal StapletonGabriel PalmaBarak A. Pearlmutterhttp://arxiv.org/abs/2211.09284v5Iterative execution of discrete and inverse discrete Fourier transforms with applications for signal denoising via sparsification2026-05-18T15:52:58ZWe describe a family of iterative algorithms that involve the repeated execution of discrete and inverse discrete Fourier transforms. One interesting member of this family is motivated by the discrete Fourier transform uncertainty principle and involves the application of a sparsification operation to both the real domain and frequency domain data with convergence obtained when real domain sparsity hits a stable pattern. This sparsification variant has practical utility for signal denoising, in particular the recovery of a periodic spike signal in the presence of Gaussian noise. General convergence properties and denoising performance relative to existing methods are demonstrated using simulation studies. An R package implementing this technique and related resources can be found at https://hrfrost.host.dartmouth.edu/IterativeFT.2022-11-17T01:04:10ZH. Robert Frosthttp://arxiv.org/abs/2605.18562v1Estimating Item Difficulty with Large Language Models as Experts2026-05-18T15:42:13ZAccurate estimates of item difficulty are essential for valid assessment and effective adaptive learning. However, for newly created tasks, response data are typically unavailable. Pretesting and expert judgement can be costly and slow, while machine learning methods often require large labelled training datasets. Recent work suggests that large language models (LLMs) may help. However, there is limited evidence on the elicitation procedures and prompt configurations used to emulate experts for difficulty estimation. This study addresses this gap by evaluating three off-the-shelf LLMs as difficulty raters for newly created items without access to response data. Using an item bank from an online learning system, the study examined 6 domains of primary-school mathematics, with empirical difficulty estimates treated as empirical reference. The study used a full factorial design crossing three factors: judgement format (absolute vs pairwise), decision type (hard decisions vs token-probability-based estimates), and prompting strategy (zero-shot vs few-shot). LLM-derived difficulty estimates were compared with empirical difficulties using Spearman rank correlations. Across domains, LLM-based estimates exhibited moderate to strong positive correlations with empirical item difficulties. For simpler arithmetic tasks, some configurations approached the upper end of the accuracy range reported for human experts in previous research. Pairwise comparison consistently outperformed absolute judgement in the absence of additional refinements. However, when token-level probabilities were incorporated and examples of items with known empirical difficulty were provided, the absolute judgement configuration likewise demonstrated moderate-to-high alignment. The study positions LLMs as a promising tool for initial item calibration and offers insights into effective workflow configuration.2026-05-18T15:42:13Z24 pages, 2 figures, 9 tablesDiana KolesnikovaDepartment of Methodology and Statistics, Tilburg University, Tilburg, NetherlandsKirill FedyaninSmart Business Technologies, Belgrade, SerbiaAbe D. HofmanDepartment of Psychological Methods, University of Amsterdam, Amsterdam, NetherlandsProwise Learn, Amsterdam, NetherlandsMatthieu J. S. BrinkhuisDepartment of Information and Computing Sciences, Utrecht University, Utrecht, NetherlandsMaria BolsinovaDepartment of Methodology and Statistics, Tilburg University, Tilburg, Netherlandshttp://arxiv.org/abs/2603.17041v2When Marginals Match but Structure Fails: Covariance Fidelity in Generative Models2026-05-18T14:56:03ZGenerative models are increasingly deployed as substitutes for real data in downstream scientific workflows, yet standard evaluation criteria remain focused on marginal distribution matching. We argue that this represents a fundamental gap: downstream inference is rarely a marginal operation, and a model that passes every univariate diagnostic can still produce structurally unreliable synthetic data. We introduce covariance-level dependence fidelity, measured by D_Sigma(P,Q) = ||Sigma_P - Sigma_Q||_F, as a principled, computable criterion for evaluating whether a generative model preserves the joint structure of data beyond its univariate marginals. Three results formalise this criterion. First, marginal fidelity provides no constraint on dependence structure: D_Sigma can be made arbitrarily large while all univariate marginals match exactly. Second, covariance divergence induces quantifiable downstream instability, including sign reversals in population regression coefficients. Third, bounding D_Sigma provides positive stability guarantees for dependence-sensitive procedures such as PCA via Davis-Kahan-type bounds. Empirical validation across three domains, image data (Fashion-MNIST VAE, n = 60,000), bulk RNA-seq (TCGA-BRCA, n = 1,111), and a small-sample stress test (Alzheimer's gene expression, n = 113), shows that D_Sigma/delta consistently distinguishes structure-discarding from structure-preserving generators in cases where standard marginal diagnostics show little separation, confirming that covariance-level fidelity provides information orthogonal to existing evaluation metrics across domains and sample sizes.2026-03-17T18:24:31Z44 pages, 25 figures. Extended version of paper accepted at MathAI 2026 (International Conference on Mathematics of Artificial Intelligence), March 30 - April 3, 2026Nazia Riasathttp://arxiv.org/abs/2601.16022v2Approximate Likelihood-Based Inference for Spatial Generalized Linear Mixed Models2026-05-18T14:52:27ZWe study maximum likelihood estimation for spatial generalized linear mixed models with Gaussian process approximations using a stochastic Newton-Raphson algorithm. We consider two Gaussian Process approximations in this context: spectral Gaussian process approximations and stochastic partial differential equations (SPDE). We refine the stochastic maximum likelihood algorithm and we propose a new stopping criterion for efficient termination to prevent long runs of sampling in the stationary post-convergence phase and a Monte Carlo estimator of fixed effect standard errors. We run a series of simulation comparisons of spatial statistical models alongside the popular Bayesian integrated nested Laplacian approximation method which incorporates SPDE. We show that HSGP provides nominal coverage of fixed and random effect parameters with smooth latent fields but performance degrades for rough fields. SPDE in a stochastic maximum likelihood framework maintains nominal coverage and matches or improves upon the performance of Bayesian integrated nested Laplacian approximation.2026-01-22T14:46:35ZSamuel I. WatsonYixin WangEmanuele Giorgihttp://arxiv.org/abs/2604.24660v2Nonparametric Instrumental Variable Analysis Without Structural Equations: Debiased Inference on Functionals of Inverse Problems with No Solutions2026-05-18T14:35:36ZWe consider debiased inference on finite-dimensional functionals of infinite-dimensional least-squares solutions to inverse problems as a way to avoid having to assume exact solutions exist. Such assumptions are substantive and not innocuous, and their failure may imperil inference when we impose them on the statistical model. Our approach instead allows us to conduct inference on a quantity that is defined regardless of solutions existing and coincides with the usual estimands when they do. For the case of instrumental variables, this means we can motivate the analysis with structural models but these do not need to hold exactly for the semiparametric inferential procedure to remain valid.2026-04-27T16:24:35ZZikai ShenNathan KallusDimitri MeunierHoussam ZenatiArthur GrettonAurélien Bibauthttp://arxiv.org/abs/2406.16859v2On the extensions of the Chatterjee-Spearman test2026-05-18T14:04:39ZChatterjee (2021) introduced a novel independence test that is rank-based, asymptotically normal and consistent against all alternatives. One limitation of Chatterjee's test is its low statistical power for detecting monotonic relationships. To address this limitation, in our previous work (Zhang, 2024, Commun. Stat. - Theory Methods), we proposed to combine Chatterjee's and Spearman's correlations into a max-type test and established the asymptotic joint normality. This work examines three key extensions of the combined test. First, motivated by its original asymmetric form, we extend the Chatterjee-Spearman test to a symmetric version, and derive the asymptotic null distribution of the symmetrized statistic. Second, we investigate the relationships between Chatterjee's correlation and other popular rank correlations, including Kendall's tau and quadrant correlation. We demonstrate that, under independence, Chatterjee's correlation and any of these rank correlations are asymptotically joint normal and independent. Simulation studies demonstrate that the Chatterjee-Kendall test has better power than the Chatterjee-Spearman test. Finally, we explore two possible extensions to the multivariate case. These extensions expand the applicability of the rank-based combined tests to a broader range of scenarios.2024-06-24T17:59:33Z46 pages, 8 figuresQingyang Zhanghttp://arxiv.org/abs/2504.03035v2High-dimensional ridge regression with random features for non-identically distributed data with a variance profile2026-05-18T12:53:10ZRandom feature ridge regression is often analyzed in the high-dimensional regime under the homogeneous sampling model $x_i=Σ^{1/2}x_i'$, where the vectors $x_i'$ have iid entries and the same covariance matrix $Σ$ is shared by all samples. In this paper, we move beyond this setting and study non-identically distributed data through a variance-profile model in which the training and test covariates have row-dependent diagonal covariance matrices $Σ_i=\diag(γ_{i1}^2,\ldots,γ_{ip}^2)$ and $\widetildeΣ_i=\diag(\tildeγ_{i1}^2,\ldots,\tildeγ_{ip}^2)$. Our main contribution is the derivation of asymptotic equivalents for the training and test risks of ridge regression with random features when $n$, $p$, and $m$ grow proportionally. The first set of equivalents is obtained by combining the linear-plus-chaos approximation with traffic-probability arguments, whereas the second set is deterministic and follows from operator-valued free probability through an amalgamation-over-the-diagonal argument. These equivalents are sharp in numerical experiments. They also reveal how heterogeneous variance profiles, including mixture-type profiles inspired by MNIST, can modify generalization and exhibit double-descent behavior when the ridge parameter is small.2025-04-03T21:20:08ZIssa-Mbenard DaboJérémie Bigot