https://arxiv.org/api/e/CPsdpsIfLMqlo6Z1LWQ2ozmLM2026-06-10T23:06:58Z3612436015http://arxiv.org/abs/2606.00402v1A Distribution-Free Framework for Rewrite-Based Human-text Detection via Knockoff Filtering2026-05-29T22:37:13ZWe propose a distribution-free statistical framework that converts arbitrary rewrite-based detectors into detectors with finite-sample FDR guarantees without retraining. Our key observation is that rewrite-based detection implicitly constructs knockoff samples, enabling LLM-generated text detection to be formulated as a multiple hypothesis testing problem with knockoff structure. This perspective separates the design of detection statistics from the control of false discoveries, allowing existing rewrite detectors to inherit finite-sample false discovery rate (FDR) guarantees through a simple calibration procedure. We demonstrate reliable FDR control with meaningful detection power across three detection models, 19 domains, and four LLMs.2026-05-29T22:37:13ZYi Liuhttp://arxiv.org/abs/2606.00346v1Network knockoffs: controlling false discovery in dyadic space2026-05-29T20:36:56ZPhenomena such as epidemiological processes, hydrologic systems, social platforms, utility services, and supply chains can be represented as topological networks. A central question about these networks concerns connectivity and the permeability of edges. Dyadic regression and related approaches have been proposed to identify network features associated with pairwise node-level differences. In high-dimensional settings, it is important to control the number of spuriously selected features. However, controlling the false discovery rate for dyadic outcomes is challenging because dependence among dyads invalidates classic asymptotic procedures and complicates standard data splitting and knockoff approaches. We propose a novel knockoff variable selection procedure that simulates synthetic features directly on the topological network prior to constructing the augmented design matrix in dyadic space. Empirically, our method controls the false discovery rate for both node- and edge-level features. The Benjamini-Hochberg, Benjamini-Yekutieli, Storey Q-value, data-splitting, and standard knockoff procedures were all anticonservative. We applied our network knockoffs to assess the impassability of over 1000 stream barriers in North Carolina for Salvelinus fontinalis. Compared to data splitting and traditional knockoff approaches, our proposed approach selected a higher proportion of barriers previously assessed to impede fish movement.2026-05-29T20:36:56Z20 pages, 6 figuresJustin Van EeYoichiro KannoJacob RashMevin Hootenhttp://arxiv.org/abs/2606.00327v1Cluster Analysis with Resampling for Validation and Exploration (CARVE)2026-05-29T20:09:20ZClustering is widely used across the sciences as the foundation for downstream data-driven scientific discoveries. However, clustering results are highly sensitive to the choice of algorithm, preprocessing, and the number of clusters $k$, producing scientific claims that are often not reproducible. The current state of the art for validating clustering solutions consists of clustering validation indices (CVIs) such as Silhouette, Davies-Bouldin, and Calinski-Harabasz, which rely on geometric assumptions that break down on the heavy-tailed, high-dimensional, and nonlinearly structured data encountered in biomedical research. Resampling-based alternatives - grounded in the ideas of clustering stability and generalizability - have been proposed but remain scattered across specialized tools with no unified, accessible software. We fill this gap with CARVE (Cluster Analysis with Resampling for Validation and Exploration), an open-source Python and R package that jointly evaluates multiple clustering algorithms and hyperparameters, returning stability and generalizability diagnostics at the global, cluster, and sample level together with principled selection rules and consensus-based cluster labels. Across six synthetic benchmarks CARVE consistently recovers near-optimal clusterings where classical indices degrade substantially. On experimental genomics and proteomics data sets, CARVE recovers finer biological structure when classical CVIs collapse entirely. CARVE is available with a scikit-learn-compatible Python API and an analogous R interface compatible with Seurat workflows.2026-05-29T20:09:20ZKai R. WycikTiffany M. TangTarek M. ZikryGenevera I. Allenhttp://arxiv.org/abs/2601.21696v2Independent Component Discovery in Temporal Count Data2026-05-29T19:37:42ZAdvances in data collection are producing growing volumes of temporal count observations, making adapted modeling increasingly necessary. In this work, we introduce a generative framework for independent component analysis of temporal count data, combining regime-adaptive dynamics with Poisson log-normal emissions. The model identifies disentangled components with regime-dependent contributions, enabling representation learning and perturbations analysis. Notably, we establish the identifiability of the model, supporting principled interpretation. To learn the parameters, we propose an efficient amortized variational inference procedure. Experiments on simulated data evaluate recovery of the mixing function and latent sources across diverse settings, while real-world applications to gut microbiome and climate datasets reveal co-variation patterns and regime shifts consistent with domain-specific knowledge.2026-01-29T13:30:10Z9 pages, 7 figures, Appendix providedAlexandre ChaussardAnna BonnetSylvain Le Corffhttp://arxiv.org/abs/2504.06108v3Causal inference in connected populations with contagion2026-05-29T19:36:27ZCausal inference in connected populations is complicated by contagion and other real-world processes inducing dependence among outcomes. We address a gap in the literature on causal inference under contagion: while there is a growing body of work on estimating causal effects under contagion, little is known about how contagion impacts causal effects and inference. We provide insight into how contagion impacts causal effects and inference based on closed-form expressions for causal effects under contagion. These closed-form expressions reveal that the effects of interventions, spillover, and contagion are intertwined even in the simplest possible settings, and that contagion can decrease or increase causal effects. We discuss statistical implications, including asymptotic bias of model-based estimators ignoring dependence among outcomes due to contagion, violations of neighborhood exposure assumptions underlying design-based estimators by unrestricted contagion, and possible remedies.2025-04-08T14:55:34ZSubhankar BhadraMichael Schweinbergerhttp://arxiv.org/abs/2501.02409v6Interpretable Neural ODEs for Gene Regulatory Network Discovery under Perturbations2026-05-29T19:29:00ZModern high-throughput biological datasets containing thousands of perturbations enable large-scale discovery of causal graphs that represent regulatory interactions between genes. Differentiable causal graphical models and regression-based methods have been developed to infer gene regulatory networks (GRNs) from interventional datasets. However, existing approaches fail to capture the non-linear dynamics of biological processes such as cellular differentiation. To address this limitation, we propose PerturbODE, a novel framework that employs interpretable neural ordinary differential equations (neural ODEs) to model cell state trajectories under perturbations and derive the underlying causal GRN from the neural ODE parameters, enabling downstream simulation of unseen genetic interventions. The GRN is encoded via a single-hidden-layer feedforward network, implicitly grouping genes into interpretable co-regulated modules. We demonstrate PerturbODE's efficacy in GRN inference and extension to perturbation response prediction across both simulated and real overexpression datasets.2025-01-05T01:04:23ZZaikang LinSei ChangAaron ZweigMinseo KangFabian J. TheisElham AziziDavid A. Knowleshttp://arxiv.org/abs/2606.00293v1Accurate Large-sample Uncertainty Quantification using Stochastic Gradient Markov Chain Monte Carlo2026-05-29T19:24:38ZTuning algorithms such as stochastic gradient descent (SGD) and stochastic gradient Langevin dynamics (SGLD) for approximate sampling and uncertainty quantification remains challenging, particularly in the practically relevant settings when the batch size is large or the model is misspecified. Existing theory that provides tuning guidance relies on continuous-time limits or strong statistical assumptions, which can become quantitatively inaccurate in these regimes. We address these shortcomings by proposing new discrete-time approximations to SG(L)D with and without momentum, which enables accurate predictions of the stationary covariance, iterate average covariance, and integrated autocorrelation time. Moreover, we prove quantitative, non-asymptotic error bounds showing that these estimates are sufficiently accurate for practical tuning and uncertainty quantification. Numerical experiments demonstrate that our theory yields improved tuning guidance across a range of models and data-generating distributions where existing approaches fail, including when using the $β$-divergence rather than log-loss to obtain statistically robust inferences.2026-05-29T19:24:38ZProceedings of the 43rd International Conference on Machine Learning, Seoul, South Korea. PMLR 306, 2026Yu WangJie DingJonathan H. Hugginshttp://arxiv.org/abs/2606.00231v1On Asymptotic Outlier Rejection in Bayesian Mixed Poisson Regression Models Under Extreme Target and Covariate Values2026-05-29T18:06:11ZBayesian models are claimed to be fully robust against outliers if, asymptotically, observations infinitely far from the other data do not influence the posterior. Early works in robust Bayesian inference concentrated on continuous distributions and i.i.d. observations. Robustness results were then extended to linear regression in the presence of infinite residuals, either through an outlying outcome or an outlying covariate. Recently, Hamura et al. (2025, arXiv:2106.10503) presented a count regression model, with Poisson-Rescaled Beta (-RSB) target distribution and Gaussian latent variables (GLVs), which is robust against infinitely large counts and able to handle zero-inflation. We continue from the work of Hamura et al. and study the robustness properties of mixed Poisson regression models with GLVs in the presence of outlying data points arising from either corrupted covariates or corrupted target values. While in linear regression the two cases are interchangeable, as both infinite target or covariates lead to infinite residuals, we show that in count regression infinite covariates is not a symmetric case to infinite target. Specifically, we show that mixed Poisson models are not asymptotically robust to outliers resulting from infinite covariates. We then consider three alternative mixed Poissons (Poisson-Gamma, Poisson-log-t, and Poisson-RSB) as target distribution and examine, both theoretically and via simulations as well as real-world case studies, their behavior in the presence of outliers of three alternative types: large target value as well as large and small covariate values. Our results show that models robust to data points with an anomalous target are not robust to data points with anomalous covariates, calling for methodological development for models that are robust for covariate outliers.2026-05-29T18:06:11Z42 pages, 8 figuresIlaria PiaJarno Vanhatalohttp://arxiv.org/abs/2605.31567v1Addressing errors in multiple variables using generalized raking and cumulative probability models2026-05-29T17:34:03ZRoutinely collected data, such as electronic health record (EHR) data, are frequently used for biomedical research, but these data are prone to errors, which can bias study findings. Validating data in subsamples of records can reduce bias, and the efficiency of estimates can be improved by incorporating in analyses both the error-prone data available on the entire cohort and the validated data available on the subsample. One approach to incorporate both data sources is with generalized raking, which calibrates validation sampling weights using error-prone data from the entire cohort. Motivated by an EHR study of maternal weight gain during pregnancy with a validation subsample, we develop and illustrate generalized raking techniques for cumulative probability models (CPMs). CPMs are robust, rank-based and semiparametric models for continuous, ordinal, or mixed type outcome data. We develop efficient generalized raking estimators for CPMs, evaluate their performance relative to competing methods, and demonstrate the utility and strengths of generalized raking with CPMs in a study that examines factors associated with weight gain during pregnancy.2026-05-29T17:34:03ZEric S. KawaguchiChun LiFrank E. HarrellPamela A. ShawThomas LumleyBryan E. Shepherdhttp://arxiv.org/abs/2603.25971v2Design-Based Anytime-Valid Inference for Randomized Experiments with Delayed Outcomes and Staggered Entry2026-05-29T16:38:13ZDelayed outcomes are ubiquitous in online experimentation: treatment can affect whether an outcome occurs, when it occurs, and its realized value. To accommodate staggered entry while remaining robust to environmental nonstationarity and unit-level heterogeneity, we adopt a design-based perspective and target the sample cumulative reward in each arm as a function of calendar time. Our confidence sequences allow practitioners to continuously monitor the counterfactual incremental reward, such as revenue, that would have been realized by calendar time $t$ had all entered units been assigned to treatment rather than control. The main technical challenge is the choice of design-based filtration, complicated by the presence of asynchronous potential outcome times. We show that the IPW treatment-effect estimation error is not a martingale with respect to any filtration, while each arm-specific IPW estimation error is a martingale with respect to a carefully chosen arm-specific event-time filtration. We therefore construct a confidence sequence for the treatment effect by combining two arm-level confidence sequences with a union bound, and further demonstrate that this can outperform the traditional design-based variance upper bound. Finally, we characterize the class of augmentations for which the per-arm AIPW estimation error remains a martingale.2026-03-26T23:23:34ZMichael LindonNathan Kallushttp://arxiv.org/abs/2504.21688v4Assessing Racial Disparities in Healthcare Expenditures via Mediator Distribution Shifts2026-05-29T15:54:34ZRacial disparities in healthcare expenditures are well-documented, yet the underlying drivers remain complex. This study develops a framework to decompose such disparities through shifts in the distributions of mediating variables, rather than treating race itself as a manipulable exposure. We define disparities as differences in covariate-adjusted outcome distributions across racial groups, and decompose the total disparity into a component attributable to differences in mediator distributions, and a residual component that remains after equalizing those distributions. Using data from the Medical Expenditures Panel Survey (MEPS), we examine the extent to which expenditure disparities would persist or be reduced if mediators such as socioeconomic status (SES), insurance access, health behaviors, or health status were equalized across racial groups. To ensure valid inference, we derive asymptotically linear estimators based on influence-function techniques and flexible machine learning, including super learners and a two-part model designed for the zero-inflated, right-skewed nature of expenditure data.
Applying this framework to MEPS data from 2009 and 2016, substantial disparities were observed across all pairwise racial comparisons, with the largest gaps observed between non-Hispanic Whites and Hispanics in both years. Differences in SES and health status were the largest contributors to these disparities, with insurance access also playing a meaningful role, particularly for Hispanic populations, whereas health behaviors contributed minimally. Residual disparities persisted, especially in comparisons involving non-Hispanic Whites, suggesting the influence of unmeasured or structural factors.2025-04-30T14:23:50ZStatistics in Medicine, 45(13-14), e70606, 2026Xiaxian OuXinwei HeDavid BenkeserRazieh Nabi10.1002/sim.70606http://arxiv.org/abs/2508.11131v3Estimating Effects of Longitudinal Modified Treatment Policies (LMTPs) on Rates of Change in Health Outcomes2026-05-29T15:44:13ZLongitudinal data often contains outcomes measured at multiple visits, and scientific interest may lie in quantifying the effect of an intervention on an outcome's rate of change. For example, one may wish to study the progression (or trajectory) of a disease over time under different hypothetical interventions. We extend the longitudinal modified treatment policy (LMTP) methodology to estimate effects of complex, exposure-dependent interventions on rates of change in an outcome over time. We exploit the theoretical properties of a nonparametric efficient influence function (EIF)-based estimator to introduce a novel inference framework that can be used to construct simultaneous confidence intervals for a variety of causal effects of interest and to formally test relevant global and local hypotheses about rates of change. We demonstrate the utility of our framework in investigating whether a longitudinal shift intervention affects an outcome's counterfactual trajectory, as compared with no intervention. We present results from a simulation study to illustrate the performance of our inference framework in a longitudinal setting with time-varying confounding and a continuous exposure. We also apply our inference framework to the Columbia Brain Health DataBank (CBDB) to examine the effect of shifting blood pressure on the progression of dementia.2025-08-15T00:40:02ZAnja ShahuWeijie XiaYing WeiDaniel Malinskyhttp://arxiv.org/abs/2407.08485v3Logistic lasso regression with nearest neighbors for gradient-based dimension reduction2026-05-29T15:41:49ZThis paper investigates a new approach to estimate the gradient of the conditional probability given the covariates in the binary classification framework. The proposed approach consists of fitting a localized nearest-neighbor logistic model with $\ell_1$-penalty in order to cope with possibly high-dimensional covariates. Our theoretical analysis shows that the pointwise convergence rate of the gradient estimator is optimal under very mild assumptions. Moreover, using an outer product of such gradient estimates at several points in the covariate space, we provide a new method for estimating the central subspace, a well-known object allowing to carry out dimension reduction within the covariate space. Our implementation uses cross-validation on the misclassification rate to estimate the dimension of this subspace. We find that the proposed approach outperforms existing competitors in synthetic and real data applications.2024-07-11T13:19:15Z27 pages of the main document and 26 pages of supplement, 5 figures of the main document and 912 figures of the supplement, and 1 tableTouqeer AhmadFrançois PortierGilles Stupflerhttp://arxiv.org/abs/2605.31443v1Modeling Covariate Transition for Efficient Estimation of Longitudinal Treatment Effects in Randomized Experiments2026-05-29T15:40:07ZWe present a regression-adjustment framework designed for the estimation of longitudinal treatment effects in randomized experiments under static regimes. While regression-adjustment methods are useful for variance reduction in randomized experiments by using pre-treatment covariates, they usually focus only on average effects, from which we cannot obtain valuable insights into when the effects appear and how long they continue. To address this issue, we consider intermediate outcomes and evolving post-treatment covariates over time, and we represent such dynamic trajectories using transition kernels. Furthermore, we establish the asymptotic normality and the semiparametric efficiency bound for our estimator, enabling more powerful statistical inference. Simulation studies and empirical analysis using A/B test data from a streaming platform in Japan show the practical advantages of our method.2026-05-29T15:40:07ZAccepted by ICML'26The 43rd International Conference on Machine Learning, 2026Naoki ChiharaTatsushi OkaYasuko MatsubaraYasushi SakuraiShota Yasuihttp://arxiv.org/abs/2605.31440v1Synthetic Data Generation With Incomplete Survey Data Under Informative Sampling2026-05-29T15:36:39ZWe propose a Bayesian framework for data synthesis and imputation in complex survey settings with informative sampling. To address variance underestimation in existing Bayesian approaches and to accommodate the missing data encountered in survey data, we introduce an adaptive weighting scheme for parameter estimation. We show that the proposed weighting yields consistent estimators with an asymptotically valid Godambe information matrix. The framework is flexible, accommodating a broad class of Bayesian models and facilitating practical implementation. Simulation studies demonstrate that the proposed method provides accurate uncertainty quantification for both model parameters and synthetic population inference.2026-05-29T15:36:39ZAyat AlmomaniWon ChangYoungdeok HwangYoung Min KimHang J. Kim