https://arxiv.org/api/PGvqVOfy8oE2A0o7DYTIQsgH7v0 2026-03-20T10:53:49Z 34634 0 15 http://arxiv.org/abs/2603.19211v1 Synthetic Control Misconceptions: Recommendations for Practice 2026-03-19T17:56:42Z To estimate the causal effect of an intervention, researchers need to identify a control group that represents what might have happened to the treatment group in the absence of that intervention. This is challenging without a randomized experiment and further complicated when few units (possibly only one) are treated. Nevertheless, when data are available on units over time, synthetic control (SC) methods provide an opportunity to construct a valid comparison by differentially weighting control units that did not receive the treatment so that their resulting pre-treatment trajectory is similar to that of the treated unit. The hope is that this weighted ``pseudo-counterfactual" can serve as a valid counterfactual in the post-treatment time period. Since its origin twenty years ago, SC has been used over 5,000 times in the literature (Web of Science, December 2025), leading to a proliferation of descriptions of the method and guidance on proper usage that is not always accurate and does not always align with what the original developers appear to have intended. As such, a number of accepted pieces of wisdom have arisen: (1) SC is robust to various implementations; (2) covariates are unnecessary, and (3) pre-treatment prediction error should guide model selection. We describe each in detail and conduct simulations that suggest, both for standard and alternative implementations of SC, that these purported truths are not supported by empirical evidence and thus actually represent misconceptions about best practice. Instead of relying on these misconceptions, we offer practical advice for more cautious implementation and interpretation of results. 2026-03-19T17:56:42Z Robert Pickett Jennifer Hill Sarah Cowan http://arxiv.org/abs/2603.19160v1 PPI is the Difference Estimator: Recognizing the Survey Sampling Roots of Prediction-Powered Inference 2026-03-19T17:16:28Z Prediction-powered inference (PPI) is a rapidly growing framework for combining machine learning predictions with a small set of gold-standard labels to conduct valid statistical inference. In this article, I argue that the core estimators underlying PPI are equivalent to well-established estimators from the survey sampling literature dating back to the 1970s. Specifically, the PPI estimator for a population mean is algebraically equivalent to the difference estimator of Cassel et al. (1976), and PPI plus corresponds to the generalized regression (GREG) estimator of Sarndal et al. (2003). Recognizing this equivalence, I consider what part of PPI is inherited from a long-standing literature in statistics, what part is genuinely new, and where inferential claims require care. After introducing the two frameworks and establishing their equivalence, I break down where PPI diverges from model-assisted estimation, including differences in the mode of inference, the role of the unlabeled data pool, and the consequences of differential prediction error for subgroup estimands such as the average treatment effect. I then identify what each framework offers the other: PPI researchers can draw on the survey sampling literature's well-developed theory of calibration, optimal allocation, and design-based diagnostics, while survey sampling researchers can benefit from PPI's extensions to non-standard estimands and its accessible software ecosystem. The article closes with a call for integration between these two communities, motivated by the growing use of large language models as measurement instruments in applied research. 2026-03-19T17:16:28Z Reagan Mozer http://arxiv.org/abs/2402.01972v2 Combining T-learning and DR-learning: a framework for oracle-efficient estimation of causal contrasts 2026-03-19T16:15:16Z We introduce efficient plug-in (EP) learning, a novel framework for the estimation of heterogeneous causal contrasts, such as the conditional average treatment effect and conditional relative risk. The EP-learning framework enjoys the same oracle efficiency as Neyman-orthogonal learning strategies, such as DR-learning and R-learning, while addressing some of their primary drawbacks: (i) their practical applicability can be hindered by non-convex loss functions; and (ii) they may suffer from poor performance and instability due to inverse probability weighting and pseudo-outcomes that violate bounds. To overcome these issues, the EP-learner leverages an efficient plug-in estimator of the population risk function for the causal contrast. In doing so, it inherits the stability of plug-in strategies such as T-learning, while improving on their efficiency. Under reasonable conditions, EP-learners based on empirical risk minimization are oracle-efficient, exhibiting asymptotic equivalence to the minimizer of an oracle-efficient one-step debiased estimator of the population risk function. In simulation experiments, we show that EP-learners of the conditional average treatment effect and conditional relative risk outperform state-of-the-art competitors, including the T-learner, R-learner, and DR-learner. Open-source implementations of the proposed methods are available in our \texttt{R} package \texttt{hte3}. 2024-02-03T00:47:50Z Lars van der Laan Marco Carone Alex Luedtke http://arxiv.org/abs/2603.12422v3 Mortgage Burnout and Selection Effects in Heterogeneous Cox Hazard Models 2026-03-19T16:12:09Z We study the aggregate hazard rate of a heterogeneous population whose individual event intensities are modeled as Cox (doubly stochastic) processes. In the deterministic hazard setting, the observed pool hazard is the survival weighted mean of the individual hazards, and its time derivative equals the mean individual hazard drift minus a variance term. This yields a transparent structural explanation of burnout in mortgage pools. We extend this perspective to stochastic intensity models. The observed pool hazard remains a survival-weighted mean, but now evolves as an Ito process whose drift contains the mean drift of the individual hazards and a negative selection term driven by cross-sectional dispersion, together with a diffusion term inherited from the common factor. We formulate the general identity and discuss special cases relevant to mortgage prepayment modeling. 2026-03-12T20:09:44Z 8 pages. Added a subsection on the Cox model Andrew Lesniewski http://arxiv.org/abs/2603.19058v1 Adaptive Nonlinear Data Assimilation through P-Spline Triangular Measure Transport 2026-03-19T15:50:38Z Non-Gaussian statistics are a challenge for data assimilation. Linear methods oversimplify the problem, yet fully nonlinear methods are often too expensive to use in practice. The best solution usually lies between these extremes. Triangular measure transport offers a flexible framework for nonlinear data assimilation. Its success, however, depends on how the map is parametrized. Too much flexibility leads to overfitting; too little misses important structure. To address this balance, we develop an adaptation algorithm that selects a parsimonious parametrization automatically. Our method uses P-spline basis functions and an information criterion as a continuous measure of model complexity. This formulation enables gradient descent and allows efficient, fine-scale adaptation in high-dimensional settings. The resulting algorithm requires no hyperparameter tuning. It adjusts the transport map to the appropriate level of complexity based on the system statistics and ensemble size. We demonstrate its performance in nonlinear, non-Gaussian problems, including a high-dimensional distributed groundwater model. 2026-03-19T15:50:38Z 24 pages, 10 figures Berent Å. S. Lunde Maximilian Ramgraber http://arxiv.org/abs/2603.19051v1 Optimal Sample Size Calculation in Cost-Effectiveness Longitudinal Cluster Randomized Trials 2026-03-19T15:46:19Z Longitudinal cluster randomized trials (L-CRTs) are increasingly used to evaluate the cost-effectiveness of healthcare interventions across multiple assessment periods, yet design methods for powering these trials remain underdeveloped. Existing methods for cost-effectiveness analyses in cluster settings are limited to simple parallel-arm cluster randomized trials with a single follow-up assessment period. These methods cannot accommodate the complex correlation structures in L-CRTs conducted over multiple periods, which require differentiation between within-period and between-period correlations for both clinical and cost outcomes, as well as between-outcome correlations. Moreover, while substantial methodological advances have been made for the design of L-CRTs with univariate outcomes, none specifically address cost-effectiveness objectives where clinical and cost outcomes must be jointly modeled. We provide a design-stage framework for powering cost-effectiveness L-CRTs across three design variants: parallel-arm, crossover, and stepped wedge designs. We derive closed-form variance expressions for the generalized least squares estimator of the average incremental net monetary benefit under a bivariate linear mixed model. We propose a standardized ceiling ratio that adjusts willingness-to-pay for relative outcome variability to inform optimal design. We then develop local optimal designs that maximize statistical power under known correlation parameters and MaxiMin designs that ensure robust performance across parameter uncertainty for all three design variants. Through a real stepped wedge trial data example, we demonstrate the sample size calculation for testing intervention cost-effectiveness under local optimal and MaxiMin designs. 2026-03-19T15:46:19Z Hao Wang Jingxia Liu Drew B. Cameron Jiaqi Tong Donna Spiegelman Daniella Meeker Fan Li http://arxiv.org/abs/2603.04172v2 The Pivotal Information Criterion 2026-03-19T15:35:12Z The Bayesian and Akaike information criteria aim at finding a good balance between under- and over-fitting. They are extensively used every day by practitioners. Yet we contend they suffer from at least two afflictions: their penalty parameter $λ=\log n$ and $λ=2$ are too small, leading to many false discoveries, and their inherent (best subset) discrete optimization is infeasible in high dimension. We alleviate these issues with the pivotal information criterion: PIC is defined as a continuous optimization problem, and the PIC penalty parameter $λ$ is selected at the detection boundary (under pure noise). PIC's choice of $λ$ is the quantile of a statistic that we prove to be (asymptotically) pivotal, provided the loss function is appropriately transformed. As a result, simulations show a phase transition in the probability of exact support recovery with PIC, a phenomenon studied with no noise in compressed sensing. Applied on real data, for similar predictive performances, PIC selects the least complex model among state-of-the-art learners. 2026-03-04T15:26:32Z Sylvain Sardy Maxime van Cutsem Sara van de Geer http://arxiv.org/abs/2507.04668v2 Forward Regression via Gram-Schmidt Orthogonalization for Ultra-High Dimensional Linear Models 2026-03-19T15:27:29Z Forward regression is a classical and effective tool for variable screening in ultra-high dimensional linear models, but its standard projection-based implementation can be computationally costly and numerically unstable when predictors are strongly collinear. Motivated by this limitation, we propose an orthogonalized forward regression procedure, implemented recursively through Gram-Schmidt updates, that ranks predictors according to their unique contributions after removing the effects of variables already selected. This approach preserves the interpretability of forward regression while substantially reducing the cost of repeated projections. We further develop a path-based model size selection rule using statistics computed directly from the forward sequence, thereby avoiding cross-validation and extensive tuning. The resulting method is particularly well suited to settings in which the number of predictors far exceeds the sample size and strong collinearity renders the conventional forward fitting ineffective. Theoretically, we derive the optimal convergence rate for the proposed Gram-Schmidt forward regression, thereby extending existing results for projection-based forward regression, and further show that it enjoys sure screening property and variable selection consistency under suitable conditions. Simulation studies and empirical examples demonstrate that it provides a favorable balance among computational efficiency, numerical stability, screening accuracy, and predictive performance, especially in highly correlated ultra-high dimensional settings. 2025-07-07T05:15:48Z Jialuo Chen Zhaoxing Gao Yifan Jiang Ruey S. Tsay http://arxiv.org/abs/2603.19005v1 AgentDS Technical Report: Benchmarking the Future of Human-AI Collaboration in Domain-Specific Data Science 2026-03-19T15:11:13Z Data science plays a critical role in transforming complex data into actionable insights across numerous domains. Recent developments in large language models (LLMs) and artificial intelligence (AI) agents have significantly automated data science workflow. However, it remains unclear to what extent AI agents can match the performance of human experts on domain-specific data science tasks, and in which aspects human expertise continues to provide advantages. We introduce AgentDS, a benchmark and competition designed to evaluate both AI agents and human-AI collaboration performance in domain-specific data science. AgentDS consists of 17 challenges across six industries: commerce, food production, healthcare, insurance, manufacturing, and retail banking. We conducted an open competition involving 29 teams and 80 participants, enabling systematic comparison between human-AI collaborative approaches and AI-only baselines. Our results show that current AI agents struggle with domain-specific reasoning. AI-only baselines perform near or below the median of competition participants, while the strongest solutions arise from human-AI collaboration. These findings challenge the narrative of complete automation by AI and underscore the enduring importance of human expertise in data science, while illuminating directions for the next generation of AI. Visit the AgentDS website here: https://agentds.org/ and open source datasets here: https://huggingface.co/datasets/lainmn/AgentDS . 2026-03-19T15:11:13Z An Luo Jin Du Xun Xian Robert Specht Fangqiao Tian Ganghua Wang Xuan Bi Charles Fleming Ashish Kundu Jayanth Srinivasa Mingyi Hong Rui Zhang Tianxi Li Galin Jones Jie Ding http://arxiv.org/abs/2603.18990v1 Distributed lag non-linear models with spatial effect modification using Laplacian P-splines 2026-03-19T14:58:49Z Distributed lag non-linear models (DLNMs) are a popular approach to flexibly model the effect of time-delayed exposures. Classical DLNMs specify a common exposure-lag-response relationship across geographical areas. However, this relationship might be altered by an effect modifier that differs between spatial units. Although some methods have been proposed to account for effect modification, their applicability is context-dependent. For example, a meta-analysis can account for heterogeneity between groups, but this technique requires sufficiently large study groups. This limitation is particularly relevant when working with count data, where small numbers of events are often encountered. In this paper, we review existing methods that allow for spatial effect modification for count-based outcomes and propose a Bayesian DLNM alternative method that accounts for the modifier through flexible interaction effects. Through the use of Laplacian P-splines, we provide a computationally fast estimation procedure by avoiding the use of classical Markov Chain Monte Carlo (MCMC) approaches. The performance of the different methods is evaluated through simulation studies. Moreover, the practical applicability of our proposed method is showcased through a data application, containing daily temperature and mortality count data in 87 Italian cities. 2026-03-19T14:58:49Z Sara Rutten Thomas Neyens Elisa Duarte Antonio Gasparrini Christel Faes http://arxiv.org/abs/2504.09573v2 A grid-based methodology for fast online changepoint detection 2026-03-19T14:50:17Z We propose a grid-based methodology for online changepoint detection that allows offline changepoint tests to be applied to sequentially observed data. The methodology achieves low update and storage costs by testing for changepoints over a dynamically updating grid of candidate changepoint locations. For a broad class of test statistics, including those based on empirical averages and certain likelihood ratios, we show that the resulting online procedure has update and storage costs that grow at most logarithmically with the sample size. We further show that finite-sample power guarantees for the offline test translate directly into non-asymptotic upper bounds on the detection delay, under a mild robustness assumption. Building upon the methodology, we construct methods for detecting changes in the mean and in the covariance matrix of multivariate data, and prove near-optimal non-asymptotic upper bounds on their detection delays. The effectiveness of the methodology is supported by a simulation study, where we compare its performance for detecting mean changes with that of state-of-the-art online methods. To illustrate its practical applicability, we use the methodology to detect structural changes in currency exchange rates in real time. 2025-04-13T13:48:49Z Per August Jarval Moen http://arxiv.org/abs/2603.18957v1 BVSIMC: Bayesian Variable Selection-Guided Inductive Matrix Completion for Improved and Interpretable Drug Discovery 2026-03-19T14:25:10Z Recent advances in drug discovery have demonstrated that incorporating side information (e.g., chemical properties about drugs and genomic information about diseases) often greatly improves prediction performance. However, these side features can vary widely in relevance and are often noisy and high-dimensional. We propose Bayesian Variable Selection-Guided Inductive Matrix Completion (BVSIMC), a new Bayesian model that enables variable selection from side features in drug discovery. By learning sparse latent embeddings, BVSIMC improves both predictive accuracy and interpretability. We validate our method through simulation studies and two drug discovery applications: 1) prediction of drug resistance in Mycobacterium tuberculosis, and 2) prediction of new drug-disease associations in computational drug repositioning. On both synthetic and real data, BVSIMC outperforms several other state-of-the-art methods in terms of prediction. In our two real examples, BVSIMC further reveals the most clinically meaningful side features. 2026-03-19T14:25:10Z Sijian Fan Liyan Xiong Dayuan Wang Guoshuai Cai Ray Bai http://arxiv.org/abs/2603.18870v1 Inference in Regression Discontinuity Designs with Clustered Data 2026-03-19T13:15:49Z Clustered sampling is prevalent in empirical regression discontinuity (RD) designs, but it has not received much attention in the theoretical literature. In this paper, we introduce a general model-based framework for such settings and derive high-level conditions under which the standard local linear RD estimator is asymptotically normal. We verify that our high-level assumptions hold across a wide range of empirical designs, including settings of growing cluster sizes. We further show that clustered standard errors that are currently used in practice can be either inconsistent or overly conservative in finite samples. To address these issues, we propose a novel nearest-neighbor-type variance estimator and illustrate its properties in a diverse set of empirical applications. 2026-03-19T13:15:49Z Claudia Noack Tomasz Olma Christoph Rothe http://arxiv.org/abs/2603.18833v1 Estimation of Functional Principal Components from Sparse Functional Data 2026-03-19T12:35:09Z Sparse functional data arise when measurements are observed infrequently and at irregular time points for each subject, often in the presence of measurement error. These characteristics introduce additional challenges for functional principal component analysis. In this paper, we propose a new approach for extracting functional principal components from such data by combining basis expansion with maximum likelihood estimation. Orthogonality of the estimated eigenfunctions is preserved throughout the optimization using modified Gram-Schmidt orthonormalization. An information criterion is proposed to select both the optimal number of basis functions and the rank of the covariance structure. Principal component scores are subsequently estimated via conditional expectation, enabling accurate reconstruction of the underlying functional trajectories across the full domain despite sparse observations. Simulation studies demonstrate the effectiveness of the proposed method and show that it performs favorably compared with existing approaches. Its practical utility is illustrated through applications to CD4 cell count data from the Multicenter AIDS Cohort Study and somatic cell count data from Irish research dairy cattle. Supplementary materials, including technical details, additional simulation results, and the R package mGSFPCA, are available online. 2026-03-19T12:35:09Z Uche Mbaka University College Dublin Jiguo Cao Simon Fraser University Michelle Carey University College Dublin http://arxiv.org/abs/2603.14431v2 Deviation Tests for a High-dimensional Mean 2026-03-19T11:08:24Z This paper investigates testing for deviation of a high-dimensional mean vector $\boldsymbolμ$. In contrast to the standard one-sample significance test of the form: $H_0^\texttt{e} : \boldsymbolμ = \boldsymbolμ_0$ versus $H_1^\texttt{e} : \boldsymbolμ \neq \boldsymbolμ_0$, we focus on testing the deviation $H_0 : \|\boldsymbolμ - \boldsymbolμ_0\|_2 \ge d_0$ versus $H_1 : \|\boldsymbolμ - \boldsymbolμ_0\|_2 < d_0$ for a prespecified length $d_0 > 0$. Constructing a valid test statistic for this problem is technically nontrivial. By applying the concept of positive and negative feedback processes from control theory, we propose a test statistic based on a two-armed bandit (TAB) process. The deviation test is also extended to the two-sample setting. Simulation experiments confirm a good performance of the tests in finite samples. Finally, a real data analysis demonstrates the practical significance of the proposed deviation tests. 2026-03-15T15:15:55Z Zengjing Chen Ruihan Liu Jianfeng Yao