https://arxiv.org/api/kNm/bJjB1vhHrYDl2Jsu2EMJJ3g 2026-03-20T23:12:25Z 34634 120 15 http://arxiv.org/abs/2603.10405v2 Surrogate-Assisted Targeted Learning for Nested Bridge Functionals under Administrative Censoring 2026-03-15T19:22:01Z Delayed primary outcomes and administratively censored follow-up create a general semiparametric estimation problem: the target causal functional depends on an endpoint observed only for a shrinking subset of units at analysis time, while earlier surrogate measurements remain widely available. In such settings, inverse-probabilityweighted estimators can become unstable as observation probabilities approach the positivity boundary, and complete-case model-based analyses can be highly sensitive to outcome-model specification. We develop a surrogate-assisted targeted minimum loss estimator for this nested causal functional. Identification proceeds through a surrogate-bridge representation that integrates an observed-outcome regression over the conditional surrogate distribution, thereby avoiding inverse observation weights in the target parameter itself. We show that the estimator is asymptotically linear and doubly robust (in the sense that first-order bias vanishes when either nuisance component is consistently estimated), and we characterize two structural features of the problem: under surrogate-mediated missing at random, the censoring mechanism contributes no separate tangent-space component to the efficient influence function; and for nested bridge functionals, a one-step debiased machine-learning construction leaves a second-order cross-product remainder involving the conditional surrogate law. The proposed two-stage targeting step removes this term without requiring direct estimation of that law. Simulation studies demonstrate stable finite-sample performance under substantial administrative censoring, and a design-calibrated analysis based on the Washington State EPT study illustrates the method in a realistic stepped-wedge cluster-randomized setting. 2026-03-11T04:38:35Z 3 figures,1 supplement Lin Li http://arxiv.org/abs/2603.14543v1 Gradient Boosting for Spatial Panel Models with Random and Fixed Effects 2026-03-15T18:33:56Z Due to the increase in data availability in urban and regional studies, various spatial panel models have emerged to model spatial panel data, which exhibit spatial patterns and spatial dependencies between observations across time. Although estimation is usually based on maximum likelihood or generalized method of moments, these methods may fail to yield unique solutions if researchers are faced with high-dimensional settings. This article proposes a model-based gradient boosting algorithm, which enables estimation with interpretable results that is feasible in low- and high-dimensional settings. Due to its modular nature, the flexible model-based gradient boosting algorithm is suitable for a variety of spatial panel models, which can include random and fixed effects. The general framework also enables data-driven model and variable selection as well as implicit regularization where the bias-variance trade-off is controlled for, thereby enhancing accuracy of prediction on out-of-sample spatial panel data. Monte Carlo experiments concerned with the performance of estimation and variable selection confirm proper functionality in low- and high-dimensional settings while real-world applications including non-life insurance in Italian districts, rice production in Indonesian farms and life expectancy in German districts illustrate the potential application. 2026-03-15T18:33:56Z Michael Balzer Adhen Benlahlou http://arxiv.org/abs/2512.24413v2 Demystifying Proximal Causal Inference 2026-03-15T17:49:58Z Proximal causal inference (PCI) has emerged as a promising framework for identifying and estimating causal effects in the presence of unobserved confounders. While many traditional causal inference methods rely on the assumption of no unobserved confounding, this assumption is likely often violated. PCI addresses this challenge by relying on an alternative set of assumptions regarding the relationships between treatment, outcome, and auxiliary variables that serve as proxies for unmeasured confounders. We review existing identification results, discuss the assumptions necessary for valid causal effect estimation via PCI, and compare different PCI estimation methods. We offer practical guidance on operationalizing PCI, with a focus on selecting and evaluating proxy variables using domain knowledge, measurement error perspectives, and negative control analogies. Through conceptual examples, we demonstrate tensions in proxy selection and discuss the importance of clearly defining the unobserved confounding mechanism. By bridging formal results with applied considerations, this work aims to demystify PCI, encourage thoughtful use in practice, and identify open directions for methodological development and empirical research. 2025-12-30T18:55:09Z 33 pages, 5 figures Grace V. Ringlein Trang Quynh Nguyen Peter P. Zandi Elizabeth A. Stuart Harsh Parikh http://arxiv.org/abs/2603.14479v1 Risk-Calibrated Process Capability Approval with Finite Samples 2026-03-15T16:47:59Z Process capability indices such as $C_{pk}$ are widely used in manufacturing to support supplier qualification, pilot-build release, and production approval. In practice, approval decisions are often based on deterministic threshold rules of the form $\widehat{C}_{pk} \ge C_0$. Because $\widehat{C}_{pk}$ is estimated from finite samples, however, such decisions are inherently stochastic, especially when the true capability lies near the approval threshold. This paper develops a risk-calibrated decision framework for process capability approval that explicitly accounts for estimation uncertainty and asymmetric operational loss. Capability approval is formulated as a binary statistical decision problem, leading to a rule of the form $\widehat{C}_{pk} \ge C_0 + k\,SE(\widehat{C}_{pk})$, where the calibration constant $k$ is determined either by a tolerable failure probability or by a false-accept/false-reject cost ratio. The resulting formulation unifies several commonly used procedures, including deterministic thresholding, lower confidence bound rules, and probability-based approval rules, and naturally extends them to cost-sensitive decision rules derived from asymmetric operational loss. Simulation experiments and an industrial case study show that risk calibration primarily affects near-threshold decisions, improves approval stability, and can substantially reduce expected operational loss when false acceptance is more costly than false rejection. 2026-03-15T16:47:59Z 16 pages, 4 figures Fei Jiang Lei Yang http://arxiv.org/abs/2402.09698v5 Combining Evidence Across Filtrations 2026-03-15T15:35:50Z In sequential anytime-valid inference, any admissible procedure must be based on e-processes: generalizations of test martingales that quantify the accumulated evidence against a composite null hypothesis at any stopping time. This paper proposes a method for combining e-processes constructed in different filtrations but for the same null. Although e-processes in the same filtration can be combined effortlessly (by averaging), e-processes in different filtrations cannot because their validity in a coarser filtration does not translate to a finer filtration. This issue arises in sequential tests of randomness and independence, as well as in the evaluation of sequential forecasters. We establish that a class of functions called adjusters can lift arbitrary e-processes across filtrations. The result yields a generally applicable "adjust-then-combine" procedure, which we demonstrate on the problem of testing randomness in real-world financial data. Furthermore, we prove a characterization theorem for adjusters that formalizes a sense in which using adjusters is necessary. There are two major implications. First, if we have a powerful e-process in a coarsened filtration, then we readily have a powerful e-process in the original filtration. Second, when we coarsen the filtration to construct an e-process, there is a logarithmic cost to recovering validity in the original filtration. 2024-02-15T04:16:59Z Accepted for publication in the Journal of the Royal Statistical Society: Series B (Statistical Methodology). Code is available at https://github.com/yjchoe/CombiningEvidenceAcrossFiltrations Yo Joong Choe Aaditya Ramdas http://arxiv.org/abs/2603.14423v1 Tighter Confidence Intervals under Without Replacement Sampling via Empirical Rate Functions 2026-03-15T15:08:40Z We consider the problem of constructing confidence intervals (CIs) for the population mean of $N$ values $\{x_1, \ldots, x_N\} \subset Σ^N$ based on a random sample of size $n$, denoted by $X^n \equiv (X_1, \ldots, X_n)$, drawn uniformly without replacement (WoR). We begin by focusing on the finite alphabet ($|Σ| = k <\infty$) and moderate accuracy ($\log(1/α_N) \gg (k+1)\log N$) regime, and derive a fundamental lower bound on the width of any level-$(1-α_N)$ CI in terms of the inverse of the WoR rate functions from the theory of large deviations. Guided by this lower bound, we propose a new level-$(1-α_N)$ CI using an empirical inverse rate function, and show that in certain asymptotic regimes the width of this CI matches the lower bound up to constants. We also derive a dual formulation of the inverse rate function that enables efficient computation of our proposed CI. We then move beyond the finite alphabet case and use a Bernoulli coupling idea to construct an almost sure CI for $Σ= [0,1]$, and a conceptually simple nonasymptotic CI for the case of $Σ$ being a $(2,D)$ smooth Banach space. For both finite and general alphabets, our results employ classical large deviation techniques in novel ways, thus establishing new connections between estimation under WoR sampling and the theory of large deviations. 2026-03-15T15:08:40Z 39 pages, 4 figures Shubhanshu Shekhar Aaditya Ramdas http://arxiv.org/abs/2506.13646v3 Parsimonious Compactly Supported Covariance Models in the Gauss Hypergeometric Class: Identifiability, Reparameterizations, and Asymptotic Properties 2026-03-15T14:26:24Z We study covariance functions in the Gauss hypergeometric ($\mathcal{GH}$) class, a flexible family that encompasses the Generalized Wendland ($\mathcal{GW}$) and Matérn ($\mathcal{MT}$) models. We derive sharp validity conditions, providing a complete characterization of the admissible parameter space, and show that the model exhibits structural identifiability issues under both increasing- and fixed-domain asymptotics. To resolve this issue, we introduce a parsimonious compactly supported subclass selected via a maximum integral range criterion. The resulting hypergeometric model can be viewed as a structural refinement of the $\mathcal{GW}$ family and admits compact-support reparameterizations that recover the $\mathcal{MT}$ model as a limit case. We further establish strong consistency and asymptotic normality of the maximum likelihood estimator of the associated microergodic parameter under fixed-domain asymptotics. Simulation experiments and a real-data application to climate data illustrate the finite-sample behavior and practical performance of the proposed model. 2025-06-16T16:08:25Z 25 pages, 8 gigures Moreno Bevilacqua Christian Caamaño-Carrillo Tarik Faouzi Xavier Emery http://arxiv.org/abs/2603.14387v1 Label Noise Cleaning for Supervised Classification via Bernoulli Random Sampling 2026-03-15T13:54:55Z Label noise - incorrect labels assigned to observations - can substantially degrade the performance of supervised classifiers. This paper proposes a label noise cleaning method based on Bernoulli random sampling. We show that the mean label noise levels of subsets generated by Bernoulli random sampling containing a given observation are identically distributed for all clean observations, and identically distributed, with a different distribution, for all noisy observations. Although the mean label noise levels are not independent across observations, by introducing an independent coupling we further prove that they converge to a mixture of two well-separated distributions corresponding to clean and noisy observations. By establishing a linear model between cross-validated classification errors and label noise levels, we are able to approximate this mixture distribution and thereby separate clean and noisy observations without any prior label information. The proposed method is classifier-agnostic, theoretically justified, and demonstrates strong performance on both simulated and real datasets. 2026-03-15T13:54:55Z Yuxin Liu Xiong Jin Yang Han http://arxiv.org/abs/2603.14381v1 A Bayesian Critique of Rank-Based Methods for Surrogate Marker Evaluation 2026-03-15T13:39:05Z Surrogate markers are often employed in clinical trials to replace primary outcomes that may be difficult, expensive, or time-consuming to measure directly. These markers can accelerate the evaluation of new treatments, provided they reliably capture the causal relationship between treatment and true clinical benefit. Parast et al. (2024) recently proposed a rank-based approach for evaluating surrogate markers, characterized by its nonparametric nature and minimal assumptions. While this method is useful in small-sample model-agnostic settings, it has several limitations, including a lack of clear causal interpretation, low statistical power, and insufficient robustness to different data-generating mechanisms. In this paper, we propose a Bayesian approach that addresses these shortcomings by focusing on causal treatment effect estimands and, in doing so, improves power through covariate adjustment. We demonstrate the advantages of our proposed method through a simulation study designed to highlight gains in both accuracy and power. 2026-03-15T13:39:05Z Pietro Carlotti Layla Parast http://arxiv.org/abs/2403.19818v2 Testing common structure in high-dimensional factor models: change-point and two-sample procedures 2026-03-15T12:10:02Z This work proposes a novel procedure to test for common structures across two high-dimensional factor models. The introduced test allows to uncover whether two factor models are driven by the same loading matrix up to some linear transformation. The test can be used to discover inter-individual relationships between two datasets. In addition, it can be applied to test for structural changes over time in the loading matrix of an individual factor model. The test aims to reduce the set of possible alternatives in a classical change-point setting. The theoretical results establish the asymptotic behavior of the introduced test statistic. The theory is supported by a simulation study showing promising results in empirical test size and power. Two real data applications are considered: the first investigates changes in the loadings of the celebrated US macroeconomic dataset of Stock and Watson, and the second examines similarities of the loadings of macroeconomic indicators for the US and South Korea. 2024-03-28T20:08:02Z Marie-Christine Düker Vladas Pipiras http://arxiv.org/abs/2505.16124v2 Controlling the false discovery rate in high-dimensional linear models using model-X knockoffs and $p$-values 2026-03-15T06:48:41Z We propose a novel multiple testing methodology for controlling the false discovery rate (FDR) in high-dimensional linear models that integrates model-X knockoff techniques with debiased penalized regression estimators. At the foundation of our methodology, we construct and study two sets of naturally paired high-dimensional test statistics and the associated $p$-values for evaluating the same null hypotheses. The first set is shown to be asymptotically mutually independent, justifying the use of the Benjamini-Hochberg procedure. We further exploit the pairing structure through a two-step procedure aimed at improving power. Our theoretical results establish the key properties of the framework with respect to asymptotic FDR control and formally characterize the associated power gains of the two-step procedure. Importantly, our framework accommodates general dependence in the design matrix. Extensive simulations demonstrate that our methods outperform existing approaches -- particularly those relying on empirical FDP estimates -- in both power and FDR control accuracy, with notable gains in settings involving weaker signals, small sample sizes, or low target FDR levels. 2025-05-22T01:59:15Z Jinyuan Chang Chenlong Li Cheng Yong Tang Zhengtian Zhu http://arxiv.org/abs/2603.14233v1 Conformalized Robust Principal Component Analysis 2026-03-15T05:58:56Z Robust principal component analysis (RPCA) is a widely used technique for recovering low-rank structure from matrices with missing entries and sparse, possibly large-magnitude corruptions. Although numerous algorithms achieve accurate point estimation, they offer little guidance on the uncertainty of recovered entries, limiting their reliability in practice. In this paper, we propose conformal prediction-RPCA (CP-RPCA), a practical and distribution-free framework for uncertainty quantification in robust matrix recovery. Our proposed method supports both split and full conformal implementations and incorporates weighted calibration to handle heterogeneous observation probabilities. We provide theoretical guarantees for finite-sample coverage and demonstrate through extensive simulations that CP-RPCA delivers reliable uncertainty quantification under severe outliers, missing data and model misspecification. Empirical results show that CP-RPCA can produce informative intervals and remain competitive in efficiency when the RPCA model is well specified, making it a scalable and robust tool for uncertainty-aware matrix analysis. 2026-03-15T05:58:56Z Liangliang Yuan Lei Wang Quan Kong Liuhua Peng http://arxiv.org/abs/2603.14231v1 Rank-based Maxsum test for high dimensional regression coefficient 2026-03-15T05:48:25Z We study global inference for regression coefficients in high-dimensional linear models under potentially heavy-tailed errors. While sum-type tests are powerful for dense alternatives and max-type tests excel for sparse alternatives, practical applications rarely reveal the sparsity level, and many existing procedures rely on light-tail assumptions. Motivated by the Wilcoxon-score sum test of Feng et al. (2013) and the two Wilcoxon-score maximum tests of Xu and Zhou (2021), we establish under $H_0$ the asymptotic independence between the rank-based sum statistic and each max statistic. These joint limit results justify principled $p$-value aggregation, and we propose two adaptive rank-based maxsum tests via the Cauchy combination method (Liu and Xie, 2020). The proposed procedures inherit robustness from rank-based construction and adaptivity from combining dense- and sparse-sensitive components. Simulation studies confirm accurate size control and strong power across a wide range of error distributions and sparsity regimes. 2026-03-15T05:48:25Z 1 pages, 1 table, 2 figures Ping Zhao Liangliang Yuan http://arxiv.org/abs/2603.14169v1 Beyond Means: Topological Causal Effects under Persistent-Homology Ignorability 2026-03-15T01:03:32Z Average treatment effects (ATE) and conditional average treatment effects (CATE) are foundational causal estimands, but they target changes in expected outcomes and can miss treatment-induced changes in the shape of outcome distributions. A canonical failure mode occurs when control outcomes are unimodal, treated outcomes become bimodal, and both distributions have the same mean. In such cases mean-based causal estimands are zero even though the geometry and topology of the outcome law change substantially. This paper develops a topological causal framework based on persistent homology. We formalize a persistent-homology ignorability condition, define topological analogues of CATE and ATE, and prove that these estimands are identifiable up to an explicit error bound under approximate topological ignorability. We also clarify a subtle but important point: a marginal persistence-diagram effect is not identified from conditional topological ignorability alone because persistent homology does not in general commute with mixtures over covariates. To preserve the original intuition while ensuring scientific correctness, we retain the marginal effect as a motivating quantity, but place the mathematically sound conditional estimands at the center of the theory. A synthetic experiment with mean-preserving topology change shows that mean-based causal estimands remain near zero while the proposed topological effect increases sharply and remains recoverable after adjustment for confounding. 2026-03-15T01:03:32Z Amir Saki Usef Faghihi http://arxiv.org/abs/2303.11786v3 Skeleton Regression: A Graph-Based Approach to Estimation with Manifold Structure 2026-03-14T22:44:15Z We introduce a new regression framework designed to deal with large-scale, complex data that lies around a low-dimensional manifold with noises. Our approach first constructs a graph representation, referred to as the skeleton, to capture the underlying geometric structure. We then define metrics on the skeleton graph and apply nonparametric regression techniques, along with feature transformations based on the graph, to estimate the regression function. We also discuss the limitations of some nonparametric regressors with respect to the general metric space such as the skeleton graph. The proposed regression framework suggests a novel way to deal with data with underlying geometric structures and provides additional advantages in handling the union of multiple manifolds, additive noises, and noisy observations. We provide statistical guarantees for the proposed method and demonstrate its effectiveness through simulations and real data examples. 2023-03-19T21:45:40Z Zeyu Wei Yen-Chi Chen