https://arxiv.org/api/tdAoeiWk0dpgRKd0rSLzpJS5Zjg2026-06-10T13:27:04Z3612422515http://arxiv.org/abs/2511.18731v2Can discrete-time analyses be trusted for stepped wedge trials with continuous recruitment?2026-06-02T19:39:05ZIn stepped wedge cluster randomized trials (SW-CRTs), interventions are sequentially rolled out to clusters over multiple periods. It is common practice to analyze data from SW-CRTs using linear mixed models that treat time as discrete. However, a recent systematic review found that 95.1% of cross-sectional SW-CRTs recruit individuals continuously over time. Despite the high prevalence of such continuous recruitment designs, there has been limited guidance on how to draw model-robust inference when analyzing such SW-CRTs. In this article, we investigate through simulations the implications of using such discrete-time linear mixed models in the case of continuous recruitment designs with a continuous outcome. Specifically, in the data-generating process, we characterize continuous recruitment using a continuous-time exponential decay correlation structure in the presence or absence of a fixed continuous period effect, addressing scenarios both with and without a random or exposure-time-dependent intervention effect. We then analyze the simulated data under three popular discrete-time working correlation structures: simple exchangeable, nested exchangeable, and discrete-time exponential decay, with a robust sandwich variance estimator. Our results demonstrate that discrete-time analysis often yields negligible bias and that the robust variance estimator with the Mancl and DeRouen correction consistently achieves nominal coverage and type I error rate. One important exception occurs when recruitment patterns vary systematically between control and intervention periods, where discrete-time analysis leads to slightly biased estimates. Finally, we illustrate these findings by reanalyzing a completed SW-CRT.2025-11-24T03:53:05ZHao WangGuangyu TongHeather AlloreKelsey L. GranthamMonica TaljaardFan Lihttp://arxiv.org/abs/2604.14575v2Generative Augmented Inference2026-06-02T18:59:00ZLarge language models enable inexpensive AI-generated annotations, but using them reliably for causal inference remains challenging. Naively pooling AI and human data induces bias, while existing methods such as Prediction-Powered Inference (PPI; Angelopoulos et al., 2023a) treat AI outputs as proxies of true labels -- an assumption often violated for generative model outputs in practice. We propose Generative Augmented Inference (GAI), a framework that treats AI outputs as general, potentially high-dimensional informative features for learning human labels rather than as surrogates. GAI flexibly models this relationship using nonparametric methods, enabling consistent estimation and valid inference from combined human and AI data. We establish asymptotic normality and show that, under random labeling, GAI strictly improves asymptotic efficiency over human-data-only estimation whenever AI outputs are informative for true labels. Empirical studies on real-world datasets demonstrate that GAI significantly reduces estimation error and improves confidence interval quality across diverse generative data sources relative to human-only and PPI-based estimation.2026-04-16T03:10:37ZCheng LuMengxin WangDennis J. ZhangHeng Zhanghttp://arxiv.org/abs/2606.04134v1Optimal Treatment Policy Estimation for Recurrent Events with a Competing Terminal Event: An Instrumented Difference-in-Differences Approach2026-06-02T18:45:36ZLearning reproducible and generalizable optimal treatment policies for chronic diseases requires large, representative populations with long-term follow-up. Administrative health data provide a natural starting point, but their use is often limited by unmeasured confounding. We address this by proposing a novel framework based on Instrumented Difference-in-Differences (iDID) to estimate optimal policies for recurrent event outcomes subject to a terminating event. The iDID design is particularly useful in this setting because it leverages policy-induced treatment variation while allowing for persistent unmeasured differences across populations, relying on assumptions that are more plausible for administrative health data than those required by conventional IV or DID approaches. A key feature of our approach is that it explicitly addresses the fundamental challenge of avoiding policies that trivially reduce recurrent adverse events by increasing mortality. We derive two distinct Inverse Probability Weighted identifications and develop a multiply robust estimator that achieves consistency if any one of several subsets of nuisance models is correctly specified. We establish the estimator's consistency and asymptotic normality through large-sample theory and demonstrate its superior finite-sample performance over existing methods via simulation. Finally, we apply this framework to a national Medicare dataset to optimize first-line Type 2 Diabetes strategies, specifically targeting the minimization of disease-related hospitalizations while accounting for survival.2026-06-02T18:45:36ZRitoban KunduJames FlorySean HennessyAshkan Ertefaiehttp://arxiv.org/abs/2606.04128v1On prediction-powered inference for quantile regression via convolution smoothing2026-06-02T18:36:46ZThis paper studies quantile regression in a data-limited setting where the gold-standard outcome is available only for a limited number of observations, whereas a surrogate outcome is widely available. Such settings are becoming increasingly common with the availability of low-cost predictions from modern AI, motivating a growing line of research on "prediction-powered inference," for improved statistical inference. Naively extending this framework to quantile regression, however, raises two challenges: computational difficulties due to the discontinuity of the subgradient, and overly conservative confidence intervals. To address these issues, we propose a convolution-based smoothing of the check-loss objective and develop two variants of the estimator. The proposed estimators are computationally tractable, and our numerical studies show that they mitigate overcoverage. As a theoretical contribution, we establish the asymptotic distributions of the proposed estimators under a possibly misspecified linear quantile regression model. We further propose an ensemble of the two estimators and illustrate the proposed methods through simulations and an application to a local housing dataset.2026-06-02T18:36:46Z32 pages, 8 figuresShota TakeishiJimin DingXuming Hehttp://arxiv.org/abs/2508.05866v2Identifiability and Inference for Generalized Latent Factor Models2026-06-02T18:27:02ZGeneralized latent factor analysis not only provides a useful latent embedding approach in statistics and machine learning, but also serves as a widely used tool across various scientific fields, such as psychometrics, econometrics, and social sciences. Ensuring the identifiability of latent factors and the loading matrix is essential for the model's estimability and interpretability, and various identifiability conditions have been employed by practitioners. However, fundamental statistical inference issues for latent factors and factor loadings under commonly used identifiability conditions remain largely unaddressed, especially for correlated factors and/or non-orthogonal loading matrix. In this work, we focus on the maximum likelihood estimation for generalized factor models and establish statistical inference properties under popularly used identifiability conditions. The developed theory is further illustrated through numerical simulations and an application to a personality assessment dataset.2025-08-07T21:36:38Z36 pages, 4 figuresChengyu CuiGongjun Xuhttp://arxiv.org/abs/2606.04114v1Global Warming Has Been Accelerating Since At Least 19902026-06-02T18:21:38ZWe investigate acceleration in global temperature, defining acceleration as a supralinear (greater-than-linear) increase over time. We develop a statistical framework to test for supralinear trends using a linearithmic specification. Our results indicate evidence of acceleration in global temperature since at least 1990, with significance strengthening as more recent data are included. In contrast, evidence for acceleration under a quadratic specification is significant only in the longest estimation window. We also show that, if the true temperature trend is supralinear, standard break-point tests will eventually detect changes in the slope of a linear trend model, which may explain reported structural breaks in global temperature trends.2026-06-02T18:21:38ZJ. Eduardo Vera-Valdeshttp://arxiv.org/abs/2606.03961v1A Neural Estimation Framework for Aggregated Relational Data under Intractable Likelihoods2026-06-02T17:49:58ZAggregated relational data (ARD) consists of survey responses to questions of the form ``how many people do you know who~$X$?'' and is widely used in survey statistics for indirect inference about populations and social networks. The dominant ARD inference target is hidden-population size estimation via the Network Scale-Up Method (NSUM), but ARD is also used for personal-network-size estimation, mixing-pattern recovery, and inference about latent network structure. Bayesian inference for ARD almost universally assumes that, conditional on a respondent's degree, the counts reported for different subpopulations are independent. There are, however, reasons to question this assumption, as homophily, latent-space clustering, and imperfect recall may all induce cross-population dependence. We develop a simulation-based neural estimation framework for ARD which requires only a simulator, so it can be applied to generative models whose likelihood cannot be written down or efficiently evaluated. The framework trains a permutation-invariant neural Bayes estimator that returns, for each marginal parameter, a posterior median and a 95% credible interval, by minimising a multi-quantile pinball loss with a cumulative-gap construction that rules out quantile crossing by design. We demonstrate the framework on three structurally distinct intractable extensions of NSUM-style ARD inference: a stochastic block model, a latent-space model, and a recall-subset model. We apply the framework to ARD Household Survey collected in Rwanda. The framework provides inference on any new survey drawn from the training distribution, and extends the reach of ARD modelling to network-structure and cognitive-process assumptions beyond those currently accessible to likelihood-based inference.2026-06-02T17:49:58Z33 pages, 3 figures, 2 tablesRowland G SeymourJoseph Marshhttp://arxiv.org/abs/2404.06803v3A new way to evaluate G-Wishart normalising constants via Fourier analysis2026-06-02T16:52:26ZThe G-Wishart distribution is a core component for the Bayesian analysis of Gaussian graphical models as the conjugate prior for the precision matrix. Evaluating the marginal likelihood of such models usually requires computing high-dimensional integrals to determine the G-Wishart normalising constant. Closed-form results are known for decomposable or chordal graphs, while an explicit representation as a formal series expansion has been derived recently for general graphs. The nested infinite sums, however, do not lend themselves to computation, remaining of limited practical value. Borrowing techniques from random matrix theory and Fourier analysis, we provide novel exact results well suited to the numerical evaluation of the normalising constant for classes of graphs beyond chordal graphs. We additionally develop a Monte Carlo scheme for general graphs, which can be orders of magnitude more efficient than current approaches.2024-04-10T07:49:48ZChing WongGiusi MoffaJack Kuipershttp://arxiv.org/abs/2501.01324v4Fast data inversion for high-dimensional Ornstein-Uhlenbeck processes from noisy measurements2026-06-02T16:49:03ZIn this work, we develop a scalable approach for a flexible latent factor model for high-dimensional dynamical systems. Each latent factor process has its own correlation and variance parameters, and the orthogonal factor loading matrix can be either fixed or estimated. We utilize an orthogonal factor loading matrix that avoids computing the inversion of the posterior covariance matrix at each time of the Kalman filter, and derive closed-form expressions in an expectation-maximization algorithm for parameter estimation, which substantially reduces the computational complexity without approximation. Our approach has several applications, including noise filtering for high-dimensional time series, estimating nonseparable covariance structure between different time series, and estimating latent physical processes from real-world measurements. Extensive simulated studies illustrate higher accuracy and scalability of our approach compared to alternatives. Furthermore, by applying our method to geodetic measurements to estimate slow slip events from geodetic data in the Cascadia region, our estimated slip better agrees with independently measured seismic data of tremor events. The substantial acceleration from our method enables the use of massive noisy data for geological hazard quantification and other applications.2025-01-02T16:25:57ZYizi LinXubo LiuPaul SegallMengyang Guhttp://arxiv.org/abs/2606.03880v1Principal Components Decomposition of Fraction of Variance Explained in High Dimensional Linear Models with Strong Correlation2026-06-02T16:47:00ZThe fraction of variance explained (FVE) in a linear model quantifies the extent to which predictors account for outcome variability. In high-dimensional settings, where traditional FVE estimators do not apply, modern FVE estimators such as GWASH or linear mix-effect model estimated through the restricted maximum likelihood (LMM-REML) struggle with strong correlation among predictors, often found, for example, in brain imaging data. We propose a decomposition framework that partitions the FVE into two components: a low-dimensional component capturing the strong correlation, estimable by low dimensional methods, and a high-dimensional component with remaining weak correlation, estimable by high dimensional methods. Simulations demonstrate that decomposing dominant principal components (PCs) and estimating the high-dimensional FVE using GWASH or LMM-REML leads to improved bias reduction compared to directly applying standard approaches such as GWASH and LMM-REML. Our method shows consistent performance asymptotically as both the number of predictors and the number of samples increase. We illustrate the method in an analysis of the Adolescent Brain Cognitive Development (ABCD) brain imaging dataset, capturing nuanced heritability signals in the FVE of cognitive measures predicted by high-resolution brain imaging data.2026-06-02T16:47:00ZMan LuoChun Chieh FanDavid AzrielArmin Schwartzmanhttp://arxiv.org/abs/2406.05242v4Markov chain Monte Carlo without evaluating the target: an auxiliary variable approach2026-06-02T16:27:25ZIn sampling tasks, it is common for target distributions to be known up to a normalizing constant. However, in many situations, even evaluating the unnormalized distribution can be costly or infeasible. This issue arises in scenarios such as sampling from the Bayesian posterior for tall datasets and the `doubly-intractable' distributions. In this paper, we begin by observing that seemingly different Markov chain Monte Carlo (MCMC) algorithms, such as the exchange algorithm, PoissonMH, and TunaMH, can be unified under a simple common procedure. We then extend this procedure into a novel framework that allows the use of auxiliary variables in both the proposal and the acceptance--rejection step. Several new MCMC algorithms emerge from this framework that uses estimated gradients to guide the proposal moves. They have demonstrated significantly better performance than existing methods on both synthetic and real datasets. We also develop theory for the new framework and use it to simplify and extend results for existing algorithms. The code to reproduce the experimental results can be found at https://github.com/ywwes26/Auxiliary-MCMC.2024-06-07T20:06:23ZICML 2026 oral. Code: https://github.com/ywwes26/Auxiliary-MCMCWei YuanGuanyang Wanghttp://arxiv.org/abs/2606.03828v1Network Time Series Models for Multivariate Volatility Forecasting2026-06-02T16:11:50ZRealized volatility has become a standard tool for measuring latent variation in financial assets, and its forecasting is crucial for a wide range of financial applications. We propose a network-based model for forecasting a vector of realized variance processes through the heterogeneous autoregressive (HAR) approach. The generalised network HAR (GNHAR) model incorporates cross-sectional spillovers through a directed graph inferred from Granger-causality tests or connectedness indices, yielding a parsimonious multivariate time series model specification. In an application to ten equities over tranquil and crisis regimes, the proposed GNHAR model improves upon common HAR model benchmarks under both short- and long-term forecasting. We also compare the network-based specification when the jump-continuous decomposition or node-specific option-implied variances are considered. Finally, unlike overparameterised models, our approach yields a concise set of parameters that track the strengthening or weakening of cross-market dependencies, providing a time-varying quantitative assessment of market stability.2026-06-02T16:11:50ZChiara BoettiMatthew A. Nuneshttp://arxiv.org/abs/2411.15691v4Semi-supervised inference using unlabeled summary statistics2026-06-02T16:02:11ZSemi-supervised inference assumes access to a labeled dataset together with a large unlabeled dataset in which the outcome variable is missing, and it is widely used to improve statistical efficiency and support generalizability across populations. In many modern applications, however, individual-level unlabeled data may not be directly accessible due to privacy restrictions, data-sharing limits, or storage constraints, while summary statistics such as sample means and covariances from the unlabeled population are often available. In this work, we study this constrained semi-supervised setting where, in addition to labeled data with individualized observations, auxiliary information from the unlabeled population is available only through summary statistics. We propose new semi-supervised inference methods for mean estimation under both covariate-independent and covariate-dependent labeling and show that unlabeled summaries can still improve efficiency and help correct selection bias. The proposed methods apply in high dimensions and are robust to model misspecification. Valid inference is obtained under sparsity conditions comparable to those required by semi-supervised methods that assume access to individual-level unlabeled samples. Our approach relies on a specialized cross-fitting procedure, where sample splitting is applied only to the labeled data, which removes the need for individualized unlabeled covariates. We further extend this framework to average treatment effect estimation, enabling generalizability and transportability of causal conclusions in this constrained semi-supervised setting.2024-11-24T02:48:04ZFacheng YuZhen QiYuqian Zhanghttp://arxiv.org/abs/2606.03805v1Regularization in Paired Comparison Models via Pseudo-Games and Phantom Players2026-06-02T15:50:47ZPaired comparison models are useful for estimating latent abilities or preferences from binary outcomes, but maximum likelihood estimation can be unstable or fail when the comparison graph is disconnected or nearly separated. Ridge regularization addresses these difficulties by shrinking ability parameters toward a common center, but it can obscure the simple likelihood interpretation that makes Bradley-Terry and Thurstone-Mosteller models attractive to practitioners. This paper describes two data-augmentation perspectives on regularization. The first adds fractional pseudo-games between every pair of competitors. The second adds a fixed-strength phantom player and gives each real competitor a weighted pseudo-win and pseudo-loss against that player. Both approaches yield finite, shrunken estimates; the phantom-player construction also resolves the usual location nonidentifiability without an explicit linear constraint. For the Bradley-Terry model, the two augmentations lead to transparent penalty functions that can be compared directly with ridge penalties. An application to the 2025 Major League Baseball regular season illustrates that tuned pseudo-game and phantom-player regularization can closely reproduce ridge-regularized strength estimates while retaining an intuitive augmented-data representation.2026-06-02T15:50:47Z22 pages, 4 figures, 2 tablesMark E. Glickmanhttp://arxiv.org/abs/2606.03786v1Disentangling conviction and conformity: a Bayesian ideal point model of voting behaviour in online debates2026-06-02T15:38:36ZOnline debate platforms offer a unique window into the mechanisms driving opinion formation: they capture both explicit political preferences and the peer environment in which those preferences are expressed. In this work, I develop a Bayesian logistic regression model, inspired by ideal point models from political science, to disentangle two competing mechanisms of voting behaviour in online debates: conviction, driven by prior ideological beliefs, and conformity, driven by peer influence. I apply this framework to the Debate.org dataset, comprising approximately 341k votes across 78k debates on 48 socio-political topics. As the debate platform does not provide predefined topic labels for each debate, I infer the topic and stance from the debate text using large language models, and, with a Bayesian approach, I quantify the relative contribution of each mechanism. I find substantial heterogeneity across topics: conviction dominates on issues tied to personal freedoms and lifestyle choices, such as drug legalisation and legalised prostitution, while conformity dominates on several topics widely regarded as paradigmatic cases of moral conviction, including abortion, gun rights, and global warming. These results have implications for the stability of online political discourse and the design of deliberative platforms.2026-06-02T15:38:36ZElena Candellone