https://arxiv.org/api/IJTtNv65g7n5/mgMXGPDvM0d6UA 2026-06-11T01:17:02Z 36141 390 15 http://arxiv.org/abs/2501.18798v3 Targeted Data Fusion for Region-Specific Survival Effects in the AMP HIV Prevention Trials 2026-05-30T00:48:46Z

The Antibody Mediated Prevention (AMP) trials opened a new scientific frontier by showing that passively administered monoclonal broadly neutralizing antibodies (bnAbs) could prevent HIV-1 acquisition. Conducted across multiple geographic regions, including the United States, Brazil, Peru, Switzerland, and sub-Saharan Africa, the AMP trials revealed substantial regional heterogeneity in treatment efficacy. These differences, together with privacy and regulatory limits on central data pooling, call for methods that borrow strength across regions without sharing individual-level data. To estimate region- and treatment-specific survival curves under distributional heterogeneity, we develop a federated learning approach that combines site-specific estimators via an L1-regularized criterion that downweights data sources not aligned with the target. We further extend the framework to a general class of causal contrasts, including the risk difference (RD), survival ratio (SR), and restricted mean survival time (RMST) difference. Through extensive simulations and an analysis of the AMP trials under different target populations, we show that the proposed approach provides privacy-preserving, region-adaptive inference with improved precision.

2025-01-30T23:21:25Z Yi Liu Alexander W. Levis Ke Zhu Shu Yang Peter B. Gilbert Larry Han http://arxiv.org/abs/2605.20615v2 Evaluating causal indirect effects when mediators are left-censored by assay limit of quantification 2026-05-30T00:36:49Z

Causal mediation analysis is essential for disentangling the mechanisms by which investigational therapeutic and preventive agents impact clinical outcomes. However, the measurement of biological mediators is often subject to left-censoring by technical measurement limitations, most commonly an assay's limit of quantification. This form of censoring can pose severe challenges for both identification and estimation of causal mediation estimands, particularly when the censoring mechanism is deterministic and the resulting missingness is missing not at random (MNAR) or nonignorable. Motivated by the question of assessing the role of viral RNA in the action mechanism of monoclonal antibody therapies for COVID-19 in the Accelerating COVID-19 Therapeutics and Vaccine (ACTIV)-2 platform trial, we develop a semi-parametric framework for estimation of the natural direct and indirect effects when the mediator of interest is partially subject to this form of left-censoring. Our proposed strategy combines fractional imputation with a semi-parametric EM algorithm to flexibly estimate key components of the factorized data likelihood. Applying the proposed strategy to circumvent the left-censoring, we discuss both traditional plug-in and asymptotically efficient estimators of the direct and indirect effect estimands, introducing a data-adaptive $m$-out-of-$n$ bootstrap for robust inference under the imputation procedure. We demonstrate in numerical experiments that our approach significantly reduces bias and allows for reliable inference. An application to data from the ACTIV-2 platform trial confirms that monoclonal antibody therapies reduce the risk of hospitalization and death due to COVID-19, while suggesting that changes in viral RNA mediate only a modest proportion of the overall treatment effect.

2026-05-20T02:02:48Z Cong Jiang Michael D. Hughes Nima S. Hejazi http://arxiv.org/abs/2606.00436v1 Weighted Conformal Clustering 2026-05-29T23:58:56Z

Clustering is a central tool for discovering latent structure in unlabeled data; yet modern clustering pipelines often end with a hard assignment of each observation to a cluster without rigorous measures of assignment uncertainty. We propose a novel weighted conformal approach for constructing valid confidence sets for cluster labels. The key difficulty is that the labels available for calibration are not observed ground-truth labels, but synthetic labels produced by a data-dependent clustering algorithm. Our method develops a conformal inference algorithm that corrects the resulting mismatch with the latent target labels through weights by formulating conformal clustering as a conditional label-distribution shift problem. We first derive an oracle procedure that attains finite-sample marginal coverage and then develop a computationally tractable and implementable version using estimated conditional label probabilities and novel augmented calibration. We show that the coverage of the estimated-weight procedure depends on the estimator, giving an explicit bound on the loss relative to the nominal level. Empirical studies demonstrate that the proposed weighted approach offers improvements over the recently proposed split conformal clustering procedure in terms of informative confidence set size, especially in nonlinear and high-dimensional clustering applications.

2026-05-29T23:58:56Z Anirban Nath YoonHaeng Hur Genevera I. Allen http://arxiv.org/abs/2606.00425v1 Empirical Likelihood with Generative AI 2026-05-29T23:21:50Z

Moment conditions are widely used to identify parameters in models where the full likelihood is either unknown or intentionally left unspecified. Empirical likelihood methods address this problem by assigning probability weights to the observed data so that the sample moment conditions hold exactly. Building on this idea, we propose a nonparametric Bayesian framework based on exponentially tilted empirical likelihood. This Bayesian formulation is particularly appealing in settings where prior information is more naturally specified on the observables rather than on the underlying parameters. Such settings arise in the presence of auxiliary data sources or synthetic data generated by modern generative AI models.Inference proceeds by projecting posterior draws from a Dirichlet process onto the moment-restricted model, yielding a computationally efficient procedure that is naturally amenable to parallelization. We establish new Bernstein--von Mises and consistency theorems for the resulting projection posterior under both vanishing-prior and persistent-prior regimes. In an application to return prediction using overnight news headlines, we show that AI-generated auxiliary data can provide a useful source of indirect regularization when informative priors on the parameter itself are unavailable.

2026-05-29T23:21:50Z Jiguang Li Sid Kankanala Veronika Rockova http://arxiv.org/abs/2110.11074v3 A Unified Framework for Regularized Estimating Equations via Fixed-Point and Variational Inequality Problems 2026-05-29T22:53:28Z

Many statistics problems are formulated within an estimating equation framework instead of a minimization framework. However, the regularized estimating equations (REE) have been much less extensively studies than regularized minimization problems. In this paper, we study an improved regularized estimating equation formulation and explore its subsequent equivalences in terms of (1) fixed-point problem specified via the proximal operator of the corresponding regularizer, and (2) generalized variational inequality problems. Such equivalences hold under general conditions and accommodate nonconvex regularizers. Moreover, these equivalences open up new possibilities in theoretical analysis and computational algorithms when studying the REE.

2021-10-21T12:28:23Z Archer Y. Yang Yue Zhao Yi Lian Yuwen Gu Jun Fan http://arxiv.org/abs/2606.00402v1 A Distribution-Free Framework for Rewrite-Based Human-text Detection via Knockoff Filtering 2026-05-29T22:37:13Z

We propose a distribution-free statistical framework that converts arbitrary rewrite-based detectors into detectors with finite-sample FDR guarantees without retraining. Our key observation is that rewrite-based detection implicitly constructs knockoff samples, enabling LLM-generated text detection to be formulated as a multiple hypothesis testing problem with knockoff structure. This perspective separates the design of detection statistics from the control of false discoveries, allowing existing rewrite detectors to inherit finite-sample false discovery rate (FDR) guarantees through a simple calibration procedure. We demonstrate reliable FDR control with meaningful detection power across three detection models, 19 domains, and four LLMs.

2026-05-29T22:37:13Z Yi Liu http://arxiv.org/abs/2606.00346v1 Network knockoffs: controlling false discovery in dyadic space 2026-05-29T20:36:56Z

Phenomena such as epidemiological processes, hydrologic systems, social platforms, utility services, and supply chains can be represented as topological networks. A central question about these networks concerns connectivity and the permeability of edges. Dyadic regression and related approaches have been proposed to identify network features associated with pairwise node-level differences. In high-dimensional settings, it is important to control the number of spuriously selected features. However, controlling the false discovery rate for dyadic outcomes is challenging because dependence among dyads invalidates classic asymptotic procedures and complicates standard data splitting and knockoff approaches. We propose a novel knockoff variable selection procedure that simulates synthetic features directly on the topological network prior to constructing the augmented design matrix in dyadic space. Empirically, our method controls the false discovery rate for both node- and edge-level features. The Benjamini-Hochberg, Benjamini-Yekutieli, Storey Q-value, data-splitting, and standard knockoff procedures were all anticonservative. We applied our network knockoffs to assess the impassability of over 1000 stream barriers in North Carolina for Salvelinus fontinalis. Compared to data splitting and traditional knockoff approaches, our proposed approach selected a higher proportion of barriers previously assessed to impede fish movement.

2026-05-29T20:36:56Z 20 pages, 6 figures Justin Van Ee Yoichiro Kanno Jacob Rash Mevin Hooten http://arxiv.org/abs/2606.00327v1 Cluster Analysis with Resampling for Validation and Exploration (CARVE) 2026-05-29T20:09:20Z

Clustering is widely used across the sciences as the foundation for downstream data-driven scientific discoveries. However, clustering results are highly sensitive to the choice of algorithm, preprocessing, and the number of clusters $k$, producing scientific claims that are often not reproducible. The current state of the art for validating clustering solutions consists of clustering validation indices (CVIs) such as Silhouette, Davies-Bouldin, and Calinski-Harabasz, which rely on geometric assumptions that break down on the heavy-tailed, high-dimensional, and nonlinearly structured data encountered in biomedical research. Resampling-based alternatives - grounded in the ideas of clustering stability and generalizability - have been proposed but remain scattered across specialized tools with no unified, accessible software. We fill this gap with CARVE (Cluster Analysis with Resampling for Validation and Exploration), an open-source Python and R package that jointly evaluates multiple clustering algorithms and hyperparameters, returning stability and generalizability diagnostics at the global, cluster, and sample level together with principled selection rules and consensus-based cluster labels. Across six synthetic benchmarks CARVE consistently recovers near-optimal clusterings where classical indices degrade substantially. On experimental genomics and proteomics data sets, CARVE recovers finer biological structure when classical CVIs collapse entirely. CARVE is available with a scikit-learn-compatible Python API and an analogous R interface compatible with Seurat workflows.

2026-05-29T20:09:20Z Kai R. Wycik Tiffany M. Tang Tarek M. Zikry Genevera I. Allen http://arxiv.org/abs/2601.21696v2 Independent Component Discovery in Temporal Count Data 2026-05-29T19:37:42Z

Advances in data collection are producing growing volumes of temporal count observations, making adapted modeling increasingly necessary. In this work, we introduce a generative framework for independent component analysis of temporal count data, combining regime-adaptive dynamics with Poisson log-normal emissions. The model identifies disentangled components with regime-dependent contributions, enabling representation learning and perturbations analysis. Notably, we establish the identifiability of the model, supporting principled interpretation. To learn the parameters, we propose an efficient amortized variational inference procedure. Experiments on simulated data evaluate recovery of the mixing function and latent sources across diverse settings, while real-world applications to gut microbiome and climate datasets reveal co-variation patterns and regime shifts consistent with domain-specific knowledge.

2026-01-29T13:30:10Z 9 pages, 7 figures, Appendix provided Alexandre Chaussard Anna Bonnet Sylvain Le Corff http://arxiv.org/abs/2504.06108v3 Causal inference in connected populations with contagion 2026-05-29T19:36:27Z

Causal inference in connected populations is complicated by contagion and other real-world processes inducing dependence among outcomes. We address a gap in the literature on causal inference under contagion: while there is a growing body of work on estimating causal effects under contagion, little is known about how contagion impacts causal effects and inference. We provide insight into how contagion impacts causal effects and inference based on closed-form expressions for causal effects under contagion. These closed-form expressions reveal that the effects of interventions, spillover, and contagion are intertwined even in the simplest possible settings, and that contagion can decrease or increase causal effects. We discuss statistical implications, including asymptotic bias of model-based estimators ignoring dependence among outcomes due to contagion, violations of neighborhood exposure assumptions underlying design-based estimators by unrestricted contagion, and possible remedies.

2025-04-08T14:55:34Z Subhankar Bhadra Michael Schweinberger http://arxiv.org/abs/2501.02409v6 Interpretable Neural ODEs for Gene Regulatory Network Discovery under Perturbations 2026-05-29T19:29:00Z

Modern high-throughput biological datasets containing thousands of perturbations enable large-scale discovery of causal graphs that represent regulatory interactions between genes. Differentiable causal graphical models and regression-based methods have been developed to infer gene regulatory networks (GRNs) from interventional datasets. However, existing approaches fail to capture the non-linear dynamics of biological processes such as cellular differentiation. To address this limitation, we propose PerturbODE, a novel framework that employs interpretable neural ordinary differential equations (neural ODEs) to model cell state trajectories under perturbations and derive the underlying causal GRN from the neural ODE parameters, enabling downstream simulation of unseen genetic interventions. The GRN is encoded via a single-hidden-layer feedforward network, implicitly grouping genes into interpretable co-regulated modules. We demonstrate PerturbODE's efficacy in GRN inference and extension to perturbation response prediction across both simulated and real overexpression datasets.

2025-01-05T01:04:23Z Zaikang Lin Sei Chang Aaron Zweig Minseo Kang Fabian J. Theis Elham Azizi David A. Knowles http://arxiv.org/abs/2606.00293v1 Accurate Large-sample Uncertainty Quantification using Stochastic Gradient Markov Chain Monte Carlo 2026-05-29T19:24:38Z

Tuning algorithms such as stochastic gradient descent (SGD) and stochastic gradient Langevin dynamics (SGLD) for approximate sampling and uncertainty quantification remains challenging, particularly in the practically relevant settings when the batch size is large or the model is misspecified. Existing theory that provides tuning guidance relies on continuous-time limits or strong statistical assumptions, which can become quantitatively inaccurate in these regimes. We address these shortcomings by proposing new discrete-time approximations to SG(L)D with and without momentum, which enables accurate predictions of the stationary covariance, iterate average covariance, and integrated autocorrelation time. Moreover, we prove quantitative, non-asymptotic error bounds showing that these estimates are sufficiently accurate for practical tuning and uncertainty quantification. Numerical experiments demonstrate that our theory yields improved tuning guidance across a range of models and data-generating distributions where existing approaches fail, including when using the $β$-divergence rather than log-loss to obtain statistically robust inferences.

2026-05-29T19:24:38Z Proceedings of the 43rd International Conference on Machine Learning, Seoul, South Korea. PMLR 306, 2026 Yu Wang Jie Ding Jonathan H. Huggins http://arxiv.org/abs/2606.00231v1 On Asymptotic Outlier Rejection in Bayesian Mixed Poisson Regression Models Under Extreme Target and Covariate Values 2026-05-29T18:06:11Z

Bayesian models are claimed to be fully robust against outliers if, asymptotically, observations infinitely far from the other data do not influence the posterior. Early works in robust Bayesian inference concentrated on continuous distributions and i.i.d. observations. Robustness results were then extended to linear regression in the presence of infinite residuals, either through an outlying outcome or an outlying covariate. Recently, Hamura et al. (2025, arXiv:2106.10503) presented a count regression model, with Poisson-Rescaled Beta (-RSB) target distribution and Gaussian latent variables (GLVs), which is robust against infinitely large counts and able to handle zero-inflation. We continue from the work of Hamura et al. and study the robustness properties of mixed Poisson regression models with GLVs in the presence of outlying data points arising from either corrupted covariates or corrupted target values. While in linear regression the two cases are interchangeable, as both infinite target or covariates lead to infinite residuals, we show that in count regression infinite covariates is not a symmetric case to infinite target. Specifically, we show that mixed Poisson models are not asymptotically robust to outliers resulting from infinite covariates. We then consider three alternative mixed Poissons (Poisson-Gamma, Poisson-log-t, and Poisson-RSB) as target distribution and examine, both theoretically and via simulations as well as real-world case studies, their behavior in the presence of outliers of three alternative types: large target value as well as large and small covariate values. Our results show that models robust to data points with an anomalous target are not robust to data points with anomalous covariates, calling for methodological development for models that are robust for covariate outliers.

2026-05-29T18:06:11Z 42 pages, 8 figures Ilaria Pia Jarno Vanhatalo http://arxiv.org/abs/2605.31567v1 Addressing errors in multiple variables using generalized raking and cumulative probability models 2026-05-29T17:34:03Z

Routinely collected data, such as electronic health record (EHR) data, are frequently used for biomedical research, but these data are prone to errors, which can bias study findings. Validating data in subsamples of records can reduce bias, and the efficiency of estimates can be improved by incorporating in analyses both the error-prone data available on the entire cohort and the validated data available on the subsample. One approach to incorporate both data sources is with generalized raking, which calibrates validation sampling weights using error-prone data from the entire cohort. Motivated by an EHR study of maternal weight gain during pregnancy with a validation subsample, we develop and illustrate generalized raking techniques for cumulative probability models (CPMs). CPMs are robust, rank-based and semiparametric models for continuous, ordinal, or mixed type outcome data. We develop efficient generalized raking estimators for CPMs, evaluate their performance relative to competing methods, and demonstrate the utility and strengths of generalized raking with CPMs in a study that examines factors associated with weight gain during pregnancy.

2026-05-29T17:34:03Z Eric S. Kawaguchi Chun Li Frank E. Harrell Pamela A. Shaw Thomas Lumley Bryan E. Shepherd http://arxiv.org/abs/2603.25971v2 Design-Based Anytime-Valid Inference for Randomized Experiments with Delayed Outcomes and Staggered Entry 2026-05-29T16:38:13Z

Delayed outcomes are ubiquitous in online experimentation: treatment can affect whether an outcome occurs, when it occurs, and its realized value. To accommodate staggered entry while remaining robust to environmental nonstationarity and unit-level heterogeneity, we adopt a design-based perspective and target the sample cumulative reward in each arm as a function of calendar time. Our confidence sequences allow practitioners to continuously monitor the counterfactual incremental reward, such as revenue, that would have been realized by calendar time $t$ had all entered units been assigned to treatment rather than control. The main technical challenge is the choice of design-based filtration, complicated by the presence of asynchronous potential outcome times. We show that the IPW treatment-effect estimation error is not a martingale with respect to any filtration, while each arm-specific IPW estimation error is a martingale with respect to a carefully chosen arm-specific event-time filtration. We therefore construct a confidence sequence for the treatment effect by combining two arm-level confidence sequences with a union bound, and further demonstrate that this can outperform the traditional design-based variance upper bound. Finally, we characterize the class of augmentations for which the per-arm AIPW estimation error remains a martingale.

2026-03-26T23:23:34Z Michael Lindon Nathan Kallus