https://arxiv.org/api/Rt/FRL/Nz+Ok55drrDq3ttNLSCQ 2026-06-18T20:10:46Z 36296 915 15 http://arxiv.org/abs/2605.21041v1 Conditioning Gaussian Processes on Almost Anything 2026-05-20T11:23:42Z

Gaussian processes (GPs) offer a principled probabilistic model over functions, but exact inference is restricted to the linear-Gaussian regime. We establish an explicit equivalence between GPs and a class of linear diffusion models, recasting predictive sampling as an ODE with closed-form Gaussian dynamics and a likelihood-dependent guidance term that admits a simple Monte Carlo approximation. In the linear-Gaussian setting, we recover standard GP conditioning exactly; beyond conjugacy, the same machinery handles any conditioning statement admitting point-wise likelihood evaluation -- including non-linear physics, and, for the first time, natural language via large language models. Whitening isolates the irreducible non-Gaussian dynamics, minimising Wasserstein-2 transport cost and eliminating numerical stiffness. The result is a general-purpose GP inference scheme requiring no bespoke derivations. Together, these results provide a general mechanism for incorporating the full richness of real-world knowledge as conditioning information, opening a new frontier for the probabilistic modelling of real-world problems.

2026-05-20T11:23:42Z Henry Moss Lachlan Astfalck Thomas Cowperthwaite Colin Doumont Sam Willis Philipp Hennig Christopher Nemeth Andrew Zammit-Mangion http://arxiv.org/abs/2605.20943v1 Missing data and cluster graphs: cluster-level missingness vs variable-level missingness 2026-05-20T09:29:24Z

Missing data is pervasive in many scientific domains such as public health, environmental science, and the social sciences. Recoverability from missing data is typically studied using fully specified variable-level missingness models despite that, in many applications, only coarse structural information is available, for instance when variables are grouped into clusters due to limited knowledge or interpretability reasons. In this paper, we investigate recoverability from such abstract representations. We introduce two classes of cluster-based missingness graphs: the m-C-DMG, which retains variable-specific missingness indicators, and the cm-C-DMG, which aggregates missingness mechanisms at the cluster level. We formalize the notion of compatibility between these abstract graphs and underlying variable-level missingness models, and study how this abstraction affects the recoverability of probabilistic and causal queries. In particular, we give graphical conditions of recovering the joint distribution as well as graphical conditions of recovering a macro causal effect. Overall, our results clarify when cluster-level missingness information is sufficient for valid inference, and when finer-grained modeling is necessary.

2026-05-20T09:29:24Z Willow Scott Eugenio Valdano Charles Assaad http://arxiv.org/abs/2601.14991v2 Consistency of Honest Decision Trees and Random Forests 2026-05-20T09:24:04Z

We study various types of consistency of honest decision trees and random forests in the regression setting. In contrast to related literature, our proofs are elementary and follow the classical arguments used for smoothing methods. Under mild regularity conditions on the regression function and data distribution, we establish weak and almost sure convergence of honest trees and honest forest averages to the true regression function, and moreover we obtain uniform convergence over compact covariate domains. The framework naturally accommodates ensemble variants based on subsampling and also a two-stage bootstrap sampling scheme. Our treatment synthesizes and simplifies existing analyses, in particular recovering several results as special cases. The elementary nature of the arguments clarifies the close relationship between data-adaptive partitioning and kernel-type methods, providing an accessible approach to understanding the asymptotic behavior of tree-based methods.

2026-01-21T13:40:36Z Martin Bladt Rasmus Frigaard Lemvig http://arxiv.org/abs/2511.21836v2 A simple and powerful test of vaccine waning 2026-05-20T09:07:53Z

Determining whether vaccine efficacy wanes is important for individual and public decision making. Yet, quantification of waning is a subtle task. The classical approaches cannot be interpreted as measures of declining efficacy unless we impose unreasonable assumptions. Recently, formal causal estimands designed to quantify vaccine waning have been proposed. These estimands can be bounded under weaker assumptions, but the bounds are often too wide to make claims about the presence of waning. We propose a different approach: a formal test to assess whether a treatment effect is constant over time at the individual level. This test provides a considerable power gain over existing approaches and is valid under interpretable assumptions in vaccine trials. We illustrate the increase in power through real and simulated examples, using three different approaches to compute the test statistics. Two of these approaches are based solely on summary data, accessible from existing clinical trials. Beyond our test, we also give new results that bound the waning effect. We use our methods to reanalyze data from a randomized controlled trial of the BNT162b2 COVID-19 vaccine. While prior analysis did not establish waning, our test rejects the null hypothesis of no waning.

2025-11-26T19:06:15Z Gellért Perényi Matias Janvin Mats J. Stensrud http://arxiv.org/abs/2605.20817v1 Topics in Nonparametric Bayesian Statistics 2026-05-20T07:13:49Z

The intersection set of Bayesian and nonparametric statistics was almost empty until about 1973, but now is growing at a healthy rate. This chapter, for the {\it Highly Structured Stochastic Systems} book (Oxford University Press, 2003) gives an overview of various theoretical and applied research themes inside this field, partly complementing and extending recent reviews of Dey, M{ü}ller and Sinha (1998) and Walker, Damien, Laud and Smith (1999). The intention is not to be complete or exhaustive, but rather to touch on research areas of interest, partly by example.

2026-05-20T07:13:49Z 23 pages, no figures. Published, in modified form, as Chapter 15 in the book `Highly Structured Stochastic Systems' (Oxford University Press, 2003, eds. P.J. Green, N.L. Hjort, S. Richardson) Nils Lid Hjort http://arxiv.org/abs/2605.20806v1 Evaluation of the number of clusters in a data set using $p$-values from Multiple Tests of Hypotheses 2026-05-20T06:58:04Z

This paper proposes a novel, nonparametric, interpoint distance-based measure to investigate whether there exist any groups in a set of given data, and if so then, how many groups are prevailing in total. It is a cluster accuracy index useful for arbitrary-dimensional data set, in association with any clustering algorithm having the number of groups specified as a priori. We perform univariate, nonparametric, multiple statistical tests of hypotheses, where as many dependent tests as the sample size are carried out using the interpoint distances. They possess $p$-values to be combined to reach a decision, which is taken in a step-wise process for a possible number of clusters. It reduces the unnecessary computations compared with the other accuracy measures from the literature. Data study establishes the proposed index's efficiency and superiority.

2026-05-20T06:58:04Z Communications in Statistics - Theory and Methods (2024), 53, 8878-8889 Soumita Modak 10.1080/03610926.2024.2309967 http://arxiv.org/abs/2605.20154v2 Component over Composite: Mitigating Type I Error Inflation when Imputing "Days Alive and at Home" 2026-05-20T06:56:01Z

Background: Days Alive and at Home (DAH) over a pre-defined follow-up period is a novel post-intervention composite outcome that combines data from at least three components: (i) initial length of hospital stay, (ii) length of total readmissions or other post-discharge care and (iii) mortality. Missing values bring unique challenges to the analysis of trials with the DAH outcome as the three components may have different rates of missingness caused by distinct missing data mechanisms. Current approaches define DAH as missing if any of the components are missing, and proceed with complete cases or Multiple Imputation (MI) of the composite. Methods: Through a simulation study motivated by the NOTACS trial, we compare several methods of handling missing data, including complete case analysis, MI of the composite, and MI of the components when the primary analysis is a Mann-Whitney-Wilcoxon test. Results: MI on the component level has good properties in terms of type I error control and power. We caution against the use of MI on the composite level with Predictive Mean Matching, which can lead to type I error inflation. Conclusions: Given the complex distributional characteristics of DAH, naive approaches such as defining missingness on the composite level and directly imputing the composite with Predictive Mean Matching, can lead to type I error inflation. Imputing on the component level is recommended, suggested future work included imputation approaches that are compatible with more complex definitions of DAH, as well as recommendations for sensitivity analyses to the Missing at Random assumption.

2026-05-19T17:43:03Z Mia S. Tackney Sarah Dawson Letao Yuan Dominique-Laurent Couturier Sofia S. Villar http://arxiv.org/abs/2605.20767v1 The Illusion of Intervention: Your LLM-Simulated Experiment is an Observational Study 2026-05-20T06:09:41Z

Large language models (LLMs) show potential as simulators of human behavior, offering a scalable way to study responses to interventions. However, because LLMs are trained largely on observational data, interventions in experiments with LLM-simulated synthetic users can induce unintended shifts in latent user attributes, causing user drift where the implicit simulated population differs across treatment conditions, potentially distorting effect estimates. We formalize the confounding or selection bias that can arise due to user drift and show how intervention-dependent shifts can inflate or attenuate observed differences in user responses under intervention. To diagnose confounding, we propose using negative control outcomes--attributes that should remain invariant under intervention--to identify distribution shifts across intervention conditions, providing evidence of user drift. To mitigate drift, we study adjusting the persona specification by eliciting additional confounders, finding that targeted, setting-relevant confounders can substantially reduce bias across survey-style and multi-turn agent evaluations.

2026-05-20T06:09:41Z Victoria Lin Taedong Yun Maja Matarić John Canny Arthur Gretton Alexander D'Amour http://arxiv.org/abs/2605.20710v1 Assessing Estimate of CATE from Observational Data via an RCT Study 2026-05-20T05:07:54Z

Conditional average treatment effects (CATEs) are increasingly estimated from observational data and used to guide policy and individualized treatment decisions. Before such estimates can be trusted in practice, their predictive fitness needs to be assessed, yet observational data alone offer limited opportunities for doing so. We propose CATE Assessment via Fitness Evaluation (CAFE), a formal framework for directly assessing the goodness-of-fit of a CATE estimate learned from observational data, rather than the full underlying outcome model, using evidence from a randomized trial. CAFE partitions the trial covariate space according to estimated propensity scores (or the like) and compares observationally derived conditional treatment effects with group-level experimental averages. The framework accommodates a broad class of CATE learners, including parametric models and flexible machine learning methods such as causal forest and boosting. We establish theoretical guarantees under both the null and alternative hypotheses, and introduce a maximum-type extension to improve sensitivity to localized lack of fit. When both randomized trial and observational data are available, we further develop a two-stage procedure to detect the existence of unobserved confounders. Extensive numerical studies show the utility of the CAFE approach when assessing observational-derived CATE estimates.

2026-05-20T05:07:54Z 34 pages, 5 figures Bosen Cui Yuhong Yang http://arxiv.org/abs/2605.20692v1 Inferring infectiousness: a joint model of the within-host viral kinetics of SARS-CoV-2 2026-05-20T04:39:53Z

During an infectious disease outbreak, providing accurate answers to policy questions about transmission requires a detailed model of the natural history of infectiousness. Unfortunately, direct measures of infectiousness are generally unavailable. Instead, we often rely on indirect proxies, such as viral load measured by PCR or antigen tests, viral culture to detect replication-competent virus, or symptom onset, each of which reflects different aspects of viral dynamics or host response. However, these proxies vary in terms of the ease of collection, scalability, and their relationship to viral shedding and therefore underlying infectiousness. Here, we use data from five prospective, densely sampled cohorts with longitudinal data on multiple proxies of viral shedding for approximately 2,000 infections to develop a Bayesian joint model for the within-host viral kinetics of SARS-CoV-2 infection. Modeling the joint distribution allows us to infer the trajectory of infectious virus shedding -- the most direct correlate of infectiousness -- for individuals who contribute only PCR data, and to compute derived quantities that are inaccessible from any single proxy alone. These include the population-level probability and expected duration of ongoing infectiousness as a function of time since diagnosis, stratified by variant, vaccination status, and infection history; the residual risk of releasing an individual from isolation; and personalized, real-time estimates of infectiousness that are sequentially updated as new test results become available.

2026-05-20T04:39:53Z Christopher B. Boyer Stephen M. Kissler Seran Hakki Jakob Jonnerby Ajit Lalvani Marc Lipsitch http://arxiv.org/abs/2605.20681v1 Scale-Calibrated Median-of-Means for Robust Distributed Principal Component Analysis 2026-05-20T03:48:31Z

Distributed principal component analysis (PCA) produces node-level estimates of both a mean vector and a principal subspace. Robustly aggregating these heterogeneous objects requires a relative scale between mean error and subspace error. We study a scale-calibrated median-of-means estimator for this problem using the product geometry of Euclidean space and the Grassmann manifold. A node-level PCA expansion shows that the mean component has the usual linear influence, whereas the subspace component is an eigengap-weighted covariance perturbation. We prove a local reduction showing that the proposed product-manifold median-of-means estimator is asymptotically equivalent to a scaled spatial median of node influence errors. This yields fixed-node non-Gaussian limits, growing-node Gaussian limits with finite-block bias, and an explicit scale-dependent covariance formula. We propose robust block-scale and inference-optimal calibration rules, establish high-probability median-of-means bounds, characterize factorwise bad-node influence, and prove node-bootstrap validity. Simulations and large-scale single-cell RNA-seq data show that scale calibration adapts to eigengap-driven subspace uncertainty and provides a robust distributed PCA summary.

2026-05-20T03:48:31Z Kisung You http://arxiv.org/abs/2605.20634v1 New Confidence Regions for Linear Regression Parameters with Stationary-Ergodic Dependent Errors 2026-05-20T02:40:35Z

We develop joint confidence regions for linear regression coefficients when the regressors and errors are jointly stationary and ergodic with unspecified serial dependence. The method applies random smoothing, using an independent auxiliary sample and shrinking bandwidth, to a vector of regression and second-moment statistics. Under stationarity, ergodicity, and finite second moments, the estimator is asymptotically normal and yields Wald confidence regions and simultaneous confidence intervals without direct long-run variance estimation or a parametric dependence model. For implementation, we introduce a scaled estimator with data-driven bandwidth selection and a mild truncation that improves finite-sample stability. Simulations under ARMA, ARFIMA, copula-based Markov errors, and fractional Gaussian noise, with Gaussian and heavy-tailed margins, show near-nominal coverage and competitive region volumes relative to Newey-West HAC and MAC. A winter Beijing PM2.5 application illustrates the procedure. Keywords: Random smoothing, Joint inference, Confidence regions, Dependent errors, Long memory, Regression inference

2026-05-20T02:40:35Z Mous-Abou Hamadou Martial Longla Mathias Nthiani Muia Mahmud Hasan http://arxiv.org/abs/2605.20633v1 Application of Propensity Score Models and Causal Estimators in Observational Studies under Model Misspecification 2026-05-20T02:36:46Z

Propensity score (PS) methods are widely used in observational studies to reduce confounding and estimate causal treatment effects. However, the validity of PS-based causal estimators depends heavily on correct model specification, and model misspecification may lead to substantial bias and instability. In this study, we systematically evaluate the performance of commonly used causal estimators, including response surface modeling (RSM), inverse probability weighting (IPW), and augmented inverse probability weighting (AIPW), under varying levels of PS and outcome model misspecification. We compare classical logistic regression with several machine learning approaches for PS estimation, including random forests (RF), support vector machines (SVM), and linear discriminant analysis (LDA). Extensive simulation studies were conducted under multiple scenarios defined by combinations of correctly specified and misspecified PS and outcome models, varying sample sizes, and different covariate correlation structures. Estimator performance was assessed using bias, absolute bias, root mean squared error, empirical standard error, and confidence interval width. Results demonstrate that AIPW consistently provides robust and stable estimates across most scenarios due to its doubly robust property, whereas IPW is highly sensitive to PS misspecification and unstable PS estimates produced by flexible machine learning methods. RSM performs well only when the outcome model is correctly specified. Real-world applications using the ACTG175 clinical trial and the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset further illustrate the practical implications of estimator choice and PS modeling strategy. Overall, our findings highlight the importance of integrating flexible machine learning approaches within doubly robust frameworks to improve causal effect estimation in observational studies.

2026-05-20T02:36:46Z 24 pages, 4 figures Apu Chandra Das Sakib Salam Md Robiul Islam Talukder Ashim Chandra Das Antar Chandra Das Rakhi Chowdhury http://arxiv.org/abs/2504.05431v3 A Generalized Tangent Approximation based Variational Inference Framework for Strongly Super-Gaussian Likelihoods 2026-05-20T02:16:55Z

Variational inference, as an alternative to Markov chain Monte Carlo sampling, has played a transformative role in enabling scalable computation for complex Bayesian models. Nevertheless, existing approaches often depend on either rigid model-specific formulations or stochastic black-box optimization routines. Tangent approximation is a principled class of structured variational methods that exploits the geometry of the underlying probability model. However, its utility has largely been confined to logistic regression and related modeling regimes. In this article, we propose a novel variational framework based on tangent transformation for a broad class of probability models characterized by strongly super-Gaussian likelihoods. Our method leverages convex duality to construct tangent minorants of the log-likelihood, thereby inducing conjugacy with Gaussian priors over model parameters in an otherwise intractable setup. Under mild assumptions on the data-generating mechanism, we establish algorithmic convergence guarantees, a contribution that stands in contrast to the limited theoretical assurances typically available for black-box variational methods. Additionally, we derive near-minimax optimal bounds for the variational risk. Superior performance of our proposed methodology is illustrated on simulated and real-data scenarios that challenge state-of-the-art variational algorithms in terms of scalability and their ability to consistently capture complex underlying data structure.

2025-04-07T18:54:05Z 135 pages, 51 figures, 13 tables, Revision Submitted Somjit Roy Pritam Dey Debdeep Pati Bani K. Mallick http://arxiv.org/abs/2605.20621v1 Changepoint Detection in Categorical Time Series with Application to Daily Total Cloud Cover in Canada 2026-05-20T02:14:24Z

Changepoints are essential for homogenizing categorical time series and analyzing their trends and variations. The original total cloud cover in Canada was recorded hourly in tenths (or eighths), exhibiting inherent seasonality and serial correlation. Lu and Wang (2012) introduced an extended cumulative logit model to detect shifts in the annual frequencies of cloud cover conditions. While annual aggregation mitigates seasonality and serial correlation, it shortens the time series and may lead to overdispersion. This article introduces a marginalized transition model to detect a single changepoint in periodic and serially correlated categorical time series. The model captures serial dependence using a first-order Markov chain and enables category-specific changepoint specification. To enhance computational efficiency, we develop a new parameter estimation procedure for obtaining maximum likelihood estimates. A maximally selected likelihood ratio test statistic is then proposed to test for sudden changes in categorical time series, and the method is illustrated using daily total cloud cover observations recorded at 9 a.m. and 3 p.m. at Fort St. John Airport, British Columbia, Canada.

2026-05-20T02:14:24Z 31 pages, 16 figures, 5 tables; includes supplementary material; R/Rcpp code available in the linked GitHub repository Mo Li QiQi Lu XiaoLan Wang