https://arxiv.org/api/yHLUDpHCjGBVX45/lQ2hgo4b+VE 2026-06-11T09:11:43Z 36146 495 15 http://arxiv.org/abs/2605.29112v1 Efficient First-Order Methods for Estimating Generalized Additive Index Models 2026-05-27T21:20:51Z

Generalized additive index models (GAIMs) offer a flexible semiparametric framework for capturing complex data relationships, balancing the interpretability of parametric models with the flexibility of nonparametric approaches. However, classical stage-wise estimation procedures for GAIMs suffer from computational inefficiencies due to their sequential nature and reliance on nonparametric smoothing. To overcome these drawbacks, we propose efficient, simultaneous estimation algorithms for GAIMs. By leveraging basis expansion, we cast the semiparametric estimation task as a finite-dimensional optimization problem solvable by first-order methods such as gradient descent (GD). Furthermore, we introduce a variational inequality (VI) estimation algorithm, extending the VI framework from generalized linear models to GAIMs. We provide a unified convergence result to a stationary point for both algorithms. Numerical experiments highlight the computational and statistical advantages of our methods over classical stage-wise procedures, and reveal the potential benefits of the VI-based approach over GD for non-canonical link functions.

2026-05-27T21:20:51Z Ziyu Peng Linglingzhi Zhu Yao Xie http://arxiv.org/abs/2605.26408v2 Function-Valued Causal Influence in Nonlinear Time Series 2026-05-27T21:07:14Z

Causal discovery in time series is increasingly performed using nonlinear machine-learning models, yet the resulting causal relationships are almost always summarized by scalar edge scores. We argue that this practice obscures the true object learned by nonlinear autoregressive models: a state-dependent function whose effect varies across regimes, magnitudes, and contexts. We formalize function-valued causal influence for additive, contribution-decomposable architectures and show that scalar causal scores constitute a severe information bottleneck, conflating between-state variation with within-state residual noise. Using Neural Additive Vector Autoregression as a representative architecture, we introduce a practical framework based on Individual Conditional Expectation for estimating causal response functions directly from trained models. Through controlled synthetic experiments, we demonstrate that edges with indistinguishable scalar scores can exhibit qualitatively different functional behaviors, including monotonic, thresholded, saturating, and sign-changing effects. An applied case study on democratic development further shows that function-valued analysis reveals regime-specific and asymmetric causal structure systematically missed by score-centric approaches.

2026-05-26T00:34:49Z 26 pages, 6 tables, 8 figures Valentina V. Kuskova Dmitry Zaytsev Michael Coppedge http://arxiv.org/abs/2605.29081v1 Bayesian Inference of Mixing and Transmission Heterogeneity in Stratified Disease Surveillance Models 2026-05-27T20:37:02Z

When surveillance data of infectious disease incidence (e.g. weekly case counts) are disaggregated by demographic indicators, disparities in long-run health outcomes between these groups become apparent. Accurate identification of high-risk subpopulations would enable policy-makers to target interventions early in an epidemic; but, temporal models of disease incidence typically lack robust treatment of multivariate (i.e. subpopulation-level) outcomes. We propose a novel Bayesian latent-variable extension of the endemic-epidemic (``EE'') modeling framework commonly used for this purpose. Specifically, we augment the EE model class with explicit representation of unobserved individual-level transmissibility; explicit separation of disease incidence and prevalence; and parametric estimation of between-demographic-groups mixing structure. The resulting model may be tailored for either rare-disease (highly-endemic) contexts or outbreak-driven (highly-epidemic) contexts, and is capable of inferring social contact mixing patterns from incidence data alone, including mixing patterns among multiply-stratified data. To demonstrate, we conduct a simulation study comparing our model to an existing doubly-stratified EE model in the intended rare-disease application regime. We then compare our inference to the competitor's for real incidence data of norovirus gastroenteritis in Berlin, 2011-2015, disaggregated by six age groups and twelve geographic regions. Finally, we report inference of our model on COVID-19 incidence recorded in Michigan during the first year of the pandemic, disaggregated by six age groups and sixty-six geographic regions.

2026-05-27T20:37:02Z Miles Moran Oregon State University Rob Trangucci Oregon State University Lisa Madsen Oregon State University http://arxiv.org/abs/2605.02574v2 Fast and accurate conditioning for large-scale and online Gaussian process prediction problems 2026-05-27T20:10:48Z

Gaussian Process (GP) models provide a flexible framework for prediction and uncertainty quantification. For most covariance functions, however, exact GP prediction with $n$ points scales as $\mathcal{O}(n^3)$, making it prohibitively expensive for large datasets or large numbers of prediction points. While nearest neighbor-based prediction can work well in certain settings, non-pathological circumstances (for example measurement noise) can severely restrict its efficiency. This work presents a complementary approach where one conditions on carefully designed linear combinations of data, which is particularly effective in the setting of jointly predicting many values in large connected regions of the data domain. For kernel functions that are smooth away from the origin and simple prediction domains, this method can be exponentially convergent in the number of linear combinations $r$ used for conditioning, and can be machine-precision machine-precision accurate for $r \approx 100$. This approach costs $\mathcal{O}(T r^2)$ work to compute where $T$ is the cost of solving a linear system with the data covariance matrix, and so in many cases can be computed in linear or near-linear cost by exploiting rank structure in well-behaved covariance matrices. At the cost of $\mathcal{O}(nr^2)$ additional precomputation work, this approach can also provide predictions at arbitrary points of a designated region in $\mathcal{O}(1)$ online work, making it particularly attractive for problems where prediction points are not known in advance.

2026-05-04T13:29:09Z Samanyu Arora Christopher J. Geoga http://arxiv.org/abs/2509.21707v3 SADA: Safe and Adaptive Aggregation of Multiple Black-Box Predictions in Semi-Supervised Learning 2026-05-27T19:19:05Z

Semi-supervised learning (SSL) arises in practice when labeled data are scarce or expensive to obtain, while large quantities of unlabeled data are readily available. With the growing adoption of machine learning techniques, it has become increasingly feasible to generate multiple predicted labels using a variety of models and algorithms, including deep learning, large language models, and generative AI. In this paper, we propose a novel approach that safely and adaptively aggregates multiple black-box predictions of uncertain quality for both inference and prediction tasks. Our method provides two key guarantees: (i) it never performs worse than using the labeled data alone, regardless of the quality of the predictions; and (ii) if any one of the predictions (without knowing which one) perfectly fits the ground truth, the algorithm adaptively exploits this to achieve either a faster convergence rate or the semiparametric efficiency bound. We demonstrate the effectiveness of the proposed algorithm through small-scale simulations and two real-data analyses with distinct scientific goals. A user-friendly R package, sada, is provided to facilitate practical implementation.

2025-09-26T00:02:54Z Jiawei Shan Zhifeng Chen Yiming Dong Yazhen Wang Jiwei Zhao http://arxiv.org/abs/2605.28974v1 Algorithm to check Maximum Likelihood Estimate Existence for integrated PCA 2026-05-27T18:23:10Z

Being encouraged by [AKRS] that provides an amazing bridge between Statistics and Invariant Theory, and especially by [FM], where quiver semi-invariant techniques apply to verify the existence of MLE for a recent iPCA model, we provide an enhancement to [FM]. Our Theorem 5.2 yields necessary and sufficient conditions for MLE to exist generically for any dimension vector. The conditions can be easily checked with our software [T] based on Derksen-Weyman algorithm and simplifying the application for statistics practitioners and non-specialists in quivers. For those deep in quiver Representation Theory, Theorem 5.2 relates the MLE existence to the local semi-simplicity of representations as introduced in [Sh07]. We also hope that our elementary and short text can serve for the experts in both domains as a warm start in a new category.

2026-05-27T18:23:10Z 6 pages Dmitri Shmelkin http://arxiv.org/abs/2605.28785v1 Beyond Exchangeability: Distribution-Shift-Aware Integration of External Control Data in Randomized Trials 2026-05-27T17:44:53Z

Randomized controlled trials (RCTs) are the gold standard for evaluating causal effects but are often costly and difficult to scale; consequently, they are frequently augmented with auxiliary external controls in many applications. Prior approaches for borrowing such data typically rely on exchangeability, under which the external controls are readily usable for inference in the trial population. In practice, however, differences in eligibility criteria, standard of care, and data collection procedures may induce distribution shifts between the RCT and the external controls, rendering exchangeability implausible. In this paper, we propose a novel framework for integrating external controls by explicitly modeling these distribution shifts. We construct augmented estimators by adapting trial-only efficient influence functions through calibration equations that balance the trial and external populations, thereby fully exploiting the external control data even when exchangeability fails. We further develop an adaptive shrinkage estimator that preserves consistency while guaranteeing efficiency dominance over the trial-only benchmark. Synthetic experiments and a real data application demonstrate the practical advantages of the proposed approaches.

2026-05-27T17:44:53Z Jiawei Shan Yiteng Tu Guanbo Wang Chao Ying Jiwei Zhao http://arxiv.org/abs/2605.28762v1 Deep Neural Networks for Doubly Robust Estimation with Nonprobability Survey Samples 2026-05-27T17:21:50Z

Integrating probability and nonprobability survey samples is an important problem in modern survey sampling. Nonprobability samples often contain rich outcome information but may lack population representativeness, whereas probability samples provide design-based auxiliary information but may not contain the study variable. We propose a deep neural network (DNN)-assisted doubly robust framework for estimating the finite population mean from these two data sources. The proposed method models the logit sampling score for the nonprobability sample as an unknown nonparametric function and estimates it by maximizing a pseudo-likelihood that combines information from the nonprobability sample and a reference probability sample. The DNN parameters are optimized using the ADAM algorithm. The resulting DNN-estimated sampling scores are incorporated into a DNN-assisted inverse-probability weighted estimator and a deep doubly robust estimator. We establish consistency and convergence rates under regularity conditions and evaluate the finite-sample performance of the proposed estimators through simulation studies and an empirical application using Pew Research Center and Behavioral Risk Factor Surveillance System data. The results suggest that the proposed estimators can improve robustness to parametric propensity-score misspecification, especially when the true selection mechanism is nonlinear.

2026-05-27T17:21:50Z 29 pages, 1 figure Yufang Dai Shihua Luo Wendy Lou Zilin Wang Xuewen Lu http://arxiv.org/abs/2605.28749v1 IV regression with distribution-valued outcomes 2026-05-27T17:07:47Z

We develop IV Fréchet regression (IVFR), an instrumental-variable (IV) method for settings where the outcome is an entire distribution. Framing the problem as an IV regression in 2-Wasserstein space, IVFR extends global Fréchet regression to the case with endogenous covariates. IVFR projects IV-weighted quantile curves onto the space of valid distributions and then recovers the corresponding regression coefficient functions. The projection provably reduces the estimation error in finite samples and guarantees valid fitted distributions. We show that the IVFR estimator converges weakly to a mean-zero Gaussian process and establish the validity of a multiplier bootstrap procedure for uniform inference. In simulations, the projection reduces the integrated mean squared error (IMSE) by up to 63% relative to existing methods. Revisiting the effects of Chinese import competition on the wage distribution within commuting zones, the proposed method produces 9-10% narrower confidence bands than existing methods. Using our novel uniform confidence bands, we find no evidence that import competition reduced wages at the very bottom of the distribution, but only between the 10th and 35th quantile. We also revisit the effect of county food stamp programs on the county's birth weight distribution and find no significant effects.

2026-05-27T17:07:47Z 37 pages, 4 figures, 2 tables David Van Dijcke Kaspar Wüthrich http://arxiv.org/abs/2012.02985v5 Selecting the number of components in PCA via random signflips 2026-05-27T16:48:03Z

Principal component analysis (PCA) is a foundational tool in modern data analysis, and a crucial step in PCA is selecting the number of components to keep. However, classical selection methods (e.g., scree plots, parallel analysis, etc.) lack statistical guarantees in the increasingly common setting of large-dimensional data with heterogeneous noise, i.e., where each entry may have a different noise variance. Moreover, it turns out that these methods, which are highly effective for homogeneous noise, can fail dramatically for data with heterogeneous noise. This paper proposes a new method called signflip parallel analysis (FlipPA) for the setting of approximately symmetric noise: it compares the data singular values to those of "empirical null" matrices generated by flipping the sign of each entry randomly with probability one-half. We develop a rigorous theory for FlipPA, showing that it has nonasymptotic type I error control and that it consistently selects the correct rank for signals rising above the noise floor in the large-dimensional limit (even when the noise is heterogeneous). We also rigorously explain why classical permutation-based parallel analysis degrades under heterogeneous noise. Finally, we illustrate that FlipPA compares favorably to state-of-the-art methods via numerical simulations and an illustration on data coming from astronomy.

2020-12-05T09:29:21Z 54 pages, 22 figures David Hong Yue Sheng Edgar Dobriban http://arxiv.org/abs/2411.18502v2 Isometry pursuit 2026-05-27T16:03:14Z

Isometry pursuit is a convex algorithm for identifying orthonormal column-submatrices of wide matrices. It consists of a novel normalization method followed by multitask basis pursuit. Applied to Jacobians of putative coordinate functions, it helps identity isometric embeddings from within interpretable dictionaries. We provide theoretical and experimental results justifying this method. For problems involving coordinate selection and diversification, it offers a synergistic alternative to greedy and brute force search.

2024-11-27T16:43:13Z Samson Koelle Marina Meila http://arxiv.org/abs/2605.28653v1 Adaptive clinical trials based on design-optimal e-values with automatic curtailment: An application to single-arm trials with binary data 2026-05-27T15:54:32Z

The e-value is gaining traction as a robust alternative to p-values and Bayes factors for quantifying statistical evidence. e-values are a promising method for adaptive clinical trials due to their anytime-validity: e-values ensure type I error rate control at any stopping time, facilitating repeated interim analyses, complex stopping rules, and valid inference under protocol deviations. The e-value literature focuses mostly on asymptotic optimality; however, sample sizes in clinical trials are often limited. To this end, we investigate e-value-based designs with finite-horizon optimality for single-arm multi-stage clinical trials with binary data. This setting is relevant in early-phase cancer trials, but it also facilitates an accessible introduction to the betting interpretation of e-values, which we use to construct e-values that either (1) maximize statistical power, or (2) minimize the expected sample size, with or without constraints on the minimum power. We construct these designs through (constrained) dynamic programming based on the currently observed e-value, the maximum sample size, and the pre-specified significance level. Using exact calculations, we show that, next to robustness, e-value-based designs can provide competitive operating characteristics to standard (non-)adaptive designs with and without futility stopping and outperform growth-rate-optimal e-values in finite samples. In addition, small e-values automatically indicate trial continuation is futile, e.g., an e-value of zero indicates the impossibility of an efficacy conclusion. Hence, e-value-based designs provide a viable alternative to the current state-of-the-art in single-arm binary trials, warranting extension to other adaptive clinical trial settings such as multi-arm multi-stage and response-adaptive designs.

2026-05-27T15:54:32Z 19 pages, 4 figures, 1 table Stef Baas Judith ter Schure Joost van Rosmalen http://arxiv.org/abs/2506.08928v2 Local MDI+: Local Feature Importances for Tree-Based Models 2026-05-27T15:47:04Z

Tree-based ensembles such as random forests remain the go-to for tabular data over deep learning models due to their prediction performance and computational efficiency. These advantages have led to their widespread deployment in high-stakes domains, where interpretability is essential for ensuring trustworthy predictions. This has motivated the development of popular local feature importance methods such as LIME and TreeSHAP. However, these approaches rely on approximations that ignore the model's internal structure and instead depend on potentially unstable perturbations. These issues are addressed in the global setting by MDI+, a global feature importance method which combines tree-based and linear feature importances by exploiting an equivalence between decision trees and least squares on a transformed node basis. However, the global MDI+ scores are not able to explain predictions when faced with heterogeneous individual characteristics. To address this gap, we propose Local MDI+ (LMDI+), a novel extension of the MDI+ framework that quantifies feature importances for each particular sample. Across twelve real-world benchmark datasets, LMDI+ outperforms existing baselines at identifying instance-specific predictive features, yielding an average 10% improvement in predictive performance when using only the selected features. It further demonstrates greater stability by consistently producing similar instance-level feature importance rankings across repeated model fits with different random seeds. Ablation experiments show that each component of LMDI+ contributes to these gains, and that the improvements extend beyond random forests to gradient boosting models. Finally, we show that LMDI+ enables local interpretability use cases by identifying closely matched counterfactuals for each classification benchmark and discovering homogeneous subgroups in a housing dataset case study.

2025-06-10T15:51:27Z Zhongyuan Liang Zachary T. Rewolinski Abhineet Agarwal Tiffany M. Tang Bin Yu http://arxiv.org/abs/2505.09861v3 LiDDA: Data Driven Attribution at LinkedIn 2026-05-27T15:34:36Z

Data Driven Attribution, which assigns conversion credits to marketing interactions based on causal patterns learned from data, is the foundation of modern marketing intelligence and vital to any marketing business and advertising platform. In this paper, we introduce a unified transformer-based attribution approach that can handle member-level data, aggregate-level data, and integration of external macro factors. We detail the large scale implementation of the approach at LinkedIn, showcasing significant impact. We also share learnings and insights which are broadly applicable to the marketing and ad tech fields.

2025-05-14T23:54:57Z John Bencina Erkut Aykutlug Yue Chen Zerui Zhang Stephanie Sorenson Shao Tang Changshuai Wei http://arxiv.org/abs/2605.28559v1 Sequential generalized kernel equating: Providing comparable scores across multiple test forms with nonequivalent groups and differently measured covariates 2026-05-27T14:48:45Z

Test equating using covariates may be applied to provide comparable scores from multiple test forms when no anchor items are available. However, its performance may be compromised if some of the covariates themselves are measured using different test forms. In this work, we propose sequential generalized kernel equating to account for possible differences in the distribution of covariates used in the NEC design. We evaluate the proposed approach through a simulation study within the kernel equating framework. Results indicate that equating the covariate reduces bias in equated test scores, particularly when the covariate distributions differ and the correlation between the covariate and the test score is strong. A real data example from a national high school leaving examination further demonstrates the practical application.

2026-05-27T14:48:45Z Michaela Vařejková Institute of Computer Science of the Czech Academy of Sciences, Prague, Czech Republic Faculty of Mathematics and Physics, Charles University, Prague, Czech Republic Patrícia Martinková Institute of Computer Science of the Czech Academy of Sciences, Prague, Czech Republic Faculty of Education, Charles University, Prague, Czech Republic Eva Potužníková Institute of Computer Science of the Czech Academy of Sciences, Prague, Czech Republic Faculty of Education, Charles University, Prague, Czech Republic