https://arxiv.org/api/PGvqVOfy8oE2A0o7DYTIQsgH7v02026-06-09T20:32:26Z36101015http://arxiv.org/abs/2606.09625v1A Synthetic Control Approach to Conditional Distributional Treatment Effects2026-06-08T15:26:21ZThis paper proposes a synthetic control (SC) framework for the estimation of conditional distributional treatment effects. Identification rests on a parallel trends condition formulated in the parameter space of the semiparametric distribution regression (DR) model, which keeps the counterfactual conditional distribution within the model class. The weights solve a least-squares problem subject to an adding-up constraint, yielding a closed-form estimator. We derive the asymptotic distribution of the counterfactual estimator, with DR estimation error and weight estimation error contributing at the same rate to the asymptotic variance. Moreover, we propose a supremum test for the null of no treatment effect, whose limit is the supremum of a Gaussian process. Simulations illustrate that conditioning on covariates can reveal effects being difficult to detect from the unconditional distribution alone. An application to the 1992 New Jersey minimum wage increase using CPS data finds effects concentrated in the minimum-wage corridor for low-education, low-experience workers.2026-06-08T15:26:21ZDominik Wiedhttp://arxiv.org/abs/2606.04875v2A Model Selection Criterion for Multidimensional Gaussian Processes: Application to Radial Velocities2026-06-08T14:13:21ZMultidimensional Gaussian Process (multi-GP) regression is widely used to disentangle stellar and planetary signals in radial velocities (RVs) by jointly modelling ancillary activity indicators. However, identifying the combination of indicators that best constrains the stellar signal in the RVs is non-trivial, as classical model comparison methods are not directly applicable when multi-GPs involve different time series combinations. In this work, we present an information criterion to compare multi-GP models based on their ability to explain the RV component, $\mathrm{MGIC}_{\rm rv}$. This metric combines the conditional RV likelihood with an effective parameter count that accounts for the regularisation imposed by the multi-GP model on the RV component. We demonstrate that $\mathrm{MGIC}_{\rm rv}$ provides a quantitative and robust framework for multi-GP model comparison, identifying the activity indicators that most effectively constrain the RV signal. Although developed in the context of RV analysis, the proposed criterion is general and applicable to multi-GP problems in which the inference focuses on a specific observable.2026-06-03T13:38:31ZAccepted for publication in MNRAS lettersOscar Barragán10.1093/mnras/stag1054http://arxiv.org/abs/2602.05807v3SpARCD: A Spectral Graph Framework for Revealing Differential Functional Connectivity in fMRI Data2026-06-08T13:19:53ZIdentifying brain regions that exhibit altered functional connectivity across cognitive or emotional states is a key problem in neuroscience. Existing methods, such as edge-wise testing, seed-based psychophysiological interaction (PPI) analysis, or correlation network comparison, typically suffer from low statistical power, arbitrary thresholding, and limited ability to capture distributed or nonlinear dependence patterns. We propose SpARCD (Spectral Analysis of Revealing Connectivity Differences), a novel statistical framework for detecting differences in brain connectivity between two experimental conditions. SpARCD leverages distance correlation, a dependence measure sensitive to both linear and nonlinear associations, to construct a weighted graph for each condition. It then constructs a differential operator via spectral filtering and uncovers connectivity changes by computing its leading eigenvectors. Inference is achieved via a permutation-based testing scheme that yields interpretable, region-level significance maps. Extensive simulation studies demonstrate that SpARCD achieves superior power relative to conventional edge-wise or univariate approaches, particularly in the presence of complex dependency structures. Application to fMRI data from 113 early PTSD patients performing an emotional face-matching task reveals distinct networks associated with emotional reactivity and regulatory processes. Overall, SpARCD provides a statistically rigorous and computationally efficient framework for comparing high-dimensional connectivity structures, with broad applicability to neuroimaging and other network-based scientific domains.2026-02-05T15:59:48ZShira YoffeZiv Ben-ZionGuy GurevitchTalma HendlerMalka GorfineAriel Jaffehttp://arxiv.org/abs/2602.16061v2Partial Identification under Missing Data Using Weak Shadow Variables from Pretrained Models2026-06-08T12:56:50ZEstimating population quantities such as mean outcomes from user feedback is fundamental to platform evaluation and social science, yet feedback is often missing not at random (MNAR): users with stronger opinions are more likely to respond, so standard estimators are biased and the estimand is not identified without additional assumptions. Existing approaches typically rely on strong parametric assumptions or bespoke auxiliary variables that may be unavailable in practice. In this paper, we develop a partial identification framework in which sharp bounds on the estimand are obtained by solving a pair of linear programs whose constraints encode the observed data structure. This formulation naturally incorporates outcome predictions from pretrained models, including large language models (LLMs), as additional linear constraints that tighten the feasible set. We call these predictions weak shadow variables: they satisfy a conditional independence assumption with respect to missingness but need not meet the completeness conditions required by classical shadow-variable methods. When predictions are sufficiently informative, the bounds collapse to a point, recovering standard identification as a special case. In finite samples, to provide valid coverage of the identified set, we propose a set-expansion estimator that achieves slower-than-$\sqrt{n}$ convergence rate in the set-identified regime and the standard $\sqrt{n}$ rate under point identification. In simulations and semi-synthetic experiments on customer-service dialogues, we find that LLM predictions are often ill-conditioned for classical shadow-variable methods yet remain highly effective in our framework. They shrink identification intervals by 75--83\% while maintaining valid coverage under realistic MNAR mechanisms.2026-02-17T22:18:27ZHongyu ChenDavid Simchi-LeviRuoxuan Xionghttp://arxiv.org/abs/2606.09391v1Kling-Gupta linear regression2026-06-08T12:06:14ZAlthough the Kling-Gupta efficiency ($\mathrm{KGE}$) is widely adopted for model evaluation in hydrology, its properties as a statistical estimator remain unexplored. Investigating these properties is necessary because parameter estimation and forecast evaluation are inherently linked. To address this, we formalize the negatively oriented Kling-Gupta loss $L_\mathrm{KG} = (1 - \mathrm{KGE})^2$ within an extremum estimation framework (equivalent to maximizing $\mathrm{KGE}$) and analyze its behavior in multiple linear regression. We establish explicit formulas for the parameter estimates, showing that Kling-Gupta linear regression scales the ordinary least squares (OLS) coefficient vector by a variance-inflation factor governed by the sample variances and covariances of the predictors and the response. We show that Kling-Gupta linear regression predictions replicate the sample variance of the response on the training set, in contrast to the variance reduction inherent to OLS, while both estimators maintain the sample mean of the observations and achieve the same sample correlation between the predictions and the response. We show analytically that no single estimator can simultaneously maximize both the Nash-Sutcliffe efficiency $\mathrm{NSE}$ and $\mathrm{KGE}$: the OLS estimator attains the maximum possible $\mathrm{NSE}$ but not the maximum $\mathrm{KGE}$, while the Kling-Gupta estimator maximizes $\mathrm{KGE}$ at the cost of $\mathrm{NSE}$. We prove the almost sure convergence of the Kling-Gupta estimator to well-defined population limits and express those limits algebraically. Furthermore, we evaluate the training and test set performance metrics for both estimators, demonstrating that for each estimator the metrics on the training set and on an independent test set converge asymptotically to identical limits (though the limits differ between OLS and Kling-Gupta regression).2026-06-08T12:06:14Z64 pages, 8 figures, 3 tablesHristos TyralisGeorgia Papacharalampoushttp://arxiv.org/abs/2606.09351v1In-Context Learning for the Imputation of Public Opinion Data with Large Language Models2026-06-08T11:25:10ZLarge language models have been widely evaluated as simulators of individual survey responses. In practice, however, fully unobserved responses are rare; the dominant problem is partial non-response. Imputation aims to restore the overall structure of a survey dataset by filling in these missing values. It has its own well-defined evaluation criteria and differs fundamentally from prediction. We propose to impute missing survey data through in-context learning (ICL). We systematically evaluate ICL design choices across different missingness mechanisms (MCAR, MAR, MNAR) on 150 opinion variables spanning 15 waves of the American Trends Panel. Compared to well-established statistical methods for data imputation like MICE PMM, our ICL approach consistently reduces absolute error across all missingness mechanisms, with the largest gains under non-random missingness (MNAR). Notably, the best-performing specification (gpt-oss-120b with 100 in-context examples) achieves near-nominal aggregate coverage (approaching the 95% level) with confidence intervals two to five times narrower than MICE PMM. We publish a Python package with an sklearn-like API to enable easy deployment of our method using local and proprietary LLMs.2026-06-08T11:25:10ZTobias HoltdirkGeorg AhnertJoseph W SakshaugAnna-Carolina Haenschhttp://arxiv.org/abs/2410.20169v3Bayes-assisted Confidence Regions: Focal Point Estimator and Bounded-influence Priors2026-06-08T10:25:14ZThe Frequentist, Assisted by Bayes (FAB) framework constructs confidence regions that leverage prior information about parameter values. FAB confidence regions (FAB-CRs) have smaller volume for values of the parameter that are likely under the prior while maintaining exact frequentist coverage. This work introduces several methodological and theoretical contributions to the FAB framework. For Gaussian likelihoods, we show that the posterior mean of the mean parameter is contained in the FAB-CR. More generally, this result extends to the posterior mean of the natural parameter for likelihoods in the natural exponential family. These results provide a natural Bayes-assisted estimator to be reported alongside the FAB-CR. Furthermore, for Gaussian likelihoods, we show that power-law tail conditions on the marginal likelihood induce robust FAB-CRs that are uniformly bounded and revert to standard frequentist confidence intervals for extreme observations. We translate this result into practice by proposing a class of shrinkage priors for the FAB framework that satisfy this condition without sacrificing analytic tractability. The resulting FAB estimators equal prominent Bayesian shrinkage estimators, including the horseshoe estimator, thereby establishing insightful connections between robust FAB-CRs and Bayesian shrinkage methods.2024-10-26T12:58:22Z35 pages, 17 figuresStefano CortinovisFrançois Caronhttp://arxiv.org/abs/2606.09307v1Robust high-dimensional Bayesian regression with non-Gaussian errors under global--local shrinkage priors2026-06-08T10:15:23ZMultivariate regression with many correlated responses and predictors commonly violates Gaussian error assumptions due to heavy tails, outliers, and asymmetry. Gaussian procedures then lose efficiency in coefficient estimation and produce biased estimates of conditional dependence graphs. We develop a robust Bayesian framework using a scale-location mixture error distribution and horseshoe+ global-local priors on both the regression coefficients and off-diagonals of the error precision matrix, coupling sparsity in the regression map with sparsity in the residual dependence structure. Theoretical contributions include joint posterior contraction, selection consistency for both supports, a Kullback-Leibler risk bound showing the dominance of horseshoe+ over horseshoe, and bounded sensitivity, ensuring that a single large outlier has vanishing influence under t errors. Simulations across four error regimes, contamination, and varying dimensions show that our estimator matches Gaussian procedures under normality and dominates them under heavy tails and skewness. Applications to FRED-MD macroeconomic data and S&P 500 daily returns recover interpretable sparse coefficient maps and residual dependence graphs while automatically down-weighting crisis-period observations.2026-06-08T10:15:23Z21 pages, 9 figures, 6 tablesMohammad Arashihttp://arxiv.org/abs/2505.01359v2Dual system estimation using mixed effects loglinear models2026-06-08T09:48:31ZIn official statistics, dual system estimation (DSE) is a well-known tool to estimate the size of a population. Two sources are linked, and the number of units that are missed by both sources is estimated. Often dual system estimation is carried out in each of the levels of a stratifying variable, such as region. DSE can be considered a loglinear independence model, and, with a stratifying variable, a loglinear conditional independence model. The standard approach is to estimate parameters for each level of the stratifying variable. Thus, when the number of levels of the stratifying variable is large, the number of parameters estimated is large as well. Mixed effects loglinear models, where sets of parameters involving the stratifying variable are replaced by a distribution parameterised by its mean and a variance, have also been proposed, and we investigate their properties through simulation. In our simulation studies the mixed effects loglinear model outperforms the fixed effects loglinear model although only to a small extent in terms of mean squared error. We show how mixed effects dual system estimation can be extended to multiple system estimation.2025-05-02T15:57:01ZCeejay HammondPaul A. SmithPeter G. M. van der Heijdenhttp://arxiv.org/abs/2606.09274v1Reverse Stress Testing for Multivariate Scenarios: A Conditional Framework for Stressed Time Series2026-06-08T09:42:22ZThis paper develops a methodological framework for reverse stress testing (RST) in which a multivariate stress scenario, coherent with the empirical dependence structure of a market, is reconstructed from a single exogenous shock prescribed on one asset class. The problem is formulated as the maximisation of the conditional density given the imposed shock, and is solved under three progressively weaker distributional assumptions. In the parametric setting, joint Gaussianity of the returns yields a closed-form modal scenario coinciding with the conditional mean of the non-shocked components. In the semiparametric setting, the modal scenario is estimated nonparametrically through the empirical likelihood methodology and the surrounding stressed trajectories are generated via a Gaussian or Student-t local sampling scheme. In the fully nonparametric setting, stressed trajectories are obtained by inverse-distance resampling of the historical observations within a Mahalanobis neighbourhood of the estimated scenario. The three variants are validated on real market data. The simulated scenarios prove to be economically coherent and capable of reproducing the standard risk-reward asymmetry observed in stressed market regimes.2026-06-08T09:42:22Z26 pages, 5 figures, 2 tablesMichele SparvieroLorenzo Violahttp://arxiv.org/abs/2410.23786v3Conformal inference for cell type annotation with graph-structured constraints2026-06-08T08:19:12ZConformal prediction is a framework for constructing prediction sets for machine learning models, relying solely on the exchangeability of training and test data and without requiring to specify a parametric distribution. Despite its wide applicability and popularity, its application in single-cell transcriptomics remains underexplored. This paper addresses this gap by developing an approach that leverages the rich information about cell-type relations, encoded in the graph structure of cell ontologies, to enhance the interpretability of reference-based cell-type annotation. Leveraging conformal risk control, we develop a novel conformal algorithm for graph-structured predictions and we demonstrate how incorporating graph constraints can improve the interpretation of cell-type predictions. This approach aims to generate more coherent conformal sets that align with the inherent relationships among classes, facilitating clearer and more intuitive interpretations of model predictions. Additionally, we provide a technique to address non-exchangeability, particularly when the cell-type distribution changes between training and test datasets. We implemented our method in the open-source R package scConform, available at https://bioconductor.posit.co/packages/release/bioc/html/scConform.html.2024-10-31T10:00:40ZDaniela CorbettaLivio FinosLudwig GeistlingerDavide Rissohttp://arxiv.org/abs/2606.09153v1The Asymptotic Distribution of Sample Canonical Directions in Gaussian Spiked High-dimensional CCA2026-06-08T07:46:09ZThis paper studies the asymptotic behavior of sample canonical directions in a finite-rank spiked high-dimensional canonical correlation analysis model under a Gaussian population assumption. Under the asymptotic regime in which the dimensions of the two data blocks grow proportionally with the sample size, sample canonical directions are generally not consistent estimators of their population counterparts, even when the corresponding sample canonical correlations separate from the bulk spectrum. To quantify directional recovery, we investigate the squared alignment between a sample canonical direction and its associated population direction. For each simple population spike, we first establish a deterministic first-order limit for this squared alignment, which gives an explicit measure of the population-level directional information retained by the sample direction. We then prove a central limit theorem for its fluctuations around the deterministic limit, with an explicit asymptotic variance expressed through deterministic limits of resolvent trace functionals. To make the theoretical quantities computable from data, we further construct plug-in estimators for both the limiting mean and the asymptotic variance by inverting the deterministic outlier eigenvalue map, and prove their consistency. Numerical simulations and a real-data illustration support the theoretical results and demonstrate how the proposed estimators assess the recovery quality of sample canonical directions.2026-06-08T07:46:09ZZhangni PuZhangxiao ZhuoJiang Huhttp://arxiv.org/abs/2606.09089v1Supervised Low-Rank Structure Discovery for Developmental Epigenetic Aging in Ultra-High-Dimensional DNA Methylation Data2026-06-08T06:36:32ZUltra-high-dimensional array-based CpG methylation studies require statistical frameworks that simultaneously provide supervised structure discovery, interpretability, scalable latent-dimension identification, and computational feasibility. We propose SOLAR (Supervised Orthogonal Low-rank Adaptive Regression), a supervised low-rank latent-factor framework for identifying CpG-level methylation structure associated with residualized DNAm age. SOLAR combines orthogonal low-rank regression with a penalized maximum a posteriori formulation, dimension-adaptive BIC-type penalization, and a trans-dimensional simulated-annealing strategy for automatic latent-rank selection, together with theoretical guarantees including identifiability, fixed-rank recovery, and rank-selection consistency under suitable regularity conditions. The framework additionally incorporates computationally and memory-efficient optimization strategies demonstrating scalability up to $p=10^7$, while analyses at $p=10^6$ remain feasible on standard desktop computing environments. Simulation studies demonstrate stable rank recovery, competitive supervised signal recovery, and strong scalability across moderate-, high-, and ultra-high-dimensional regimes. Using longitudinal EPIC-array CpG methylation data from the GUSTO birth cohort, comprising $n=1051$ methylation profiles collected across infancy and early childhood with approximately 860,000 assayed CpGs per sample, SOLAR identifies heterogeneous supervised methylation structure associated with residualized DNAm age beyond chronological age alone, together with biologically coherent CpG signatures and enrichment patterns.2026-06-08T06:36:32ZPriyam DasJiyeon SongLathika MohanrajKarolina A. AbergYi LiSubharup Guhahttp://arxiv.org/abs/2606.09049v1Data augmented bootstrap: Unifying confidence interval construction by approximate invariance2026-06-08T05:39:02ZWe propose the data augmented bootstrap (DAB), a framework for constructing confidence intervals from approximately invariant transformations of the data. As special cases, DAB recovers popular methods that rely on exact group symmetries, such as conformal prediction, wild bootstrap for Maximum Mean Discrepancy U-statistics and the recently proposed SymmPI. Meanwhile, DAB also recovers the classical bootstrap method, which exploits the dataset's approximate invariance under uniform sampling of data indices as the dataset size grows. For all DAB methods, we establish theoretical coverage results that interpolate between finite-sample and asymptotic guarantees according to the strength of the invariance, and without assuming a group structure. The approximate invariance is measured in the Kolmogorov distance and, for statistics that satisfy Gaussian universality, reduces to conditional mean and variance matching. This allows us to incorporate data augmentation (DA), a widely used machine learning heuristic based on approximate invariances, into known statistical methods. We empirically test the performance of incorporating DA into bootstrap, wild bootstrap and conformal prediction for simulated settings as well as for image, language and scientific data.2026-06-08T05:39:02ZKevin Han Huanghttp://arxiv.org/abs/2605.29348v2Efficient Inference for Incremental Causal Effects of Time to Treatment2026-06-08T05:00:10ZWe consider continuous time to treatment initiation. This can commonly occur in preventive medicine, such as disease screening and vaccination; it can also occur with non-fatal health conditions such as HIV infection without the onset of AIDS. While traditional causal inference focused on `when to treat' and its effects, we consider the incremental causal effect when the intensity of time to treatment initiation is intervened upon. We derive the efficient influence function for this estimand and develop an estimation framework that accommodates flexible machine learning methods while achieving fast convergence rates. Valid confidence bands are obtained leveraging empirical process theory. We illustrate our approach via simulation, and apply it to cervical cancer screening data to study the incremental effect of time to subsequent HPV testing on cervical intraepithelial neoplasia detection.2026-05-28T04:37:37ZZhichen ZhaoAndrew YingRonghui Xu