https://arxiv.org/api/8MPB1jNByAsQV+BGD7Ep6rLly5A 2026-06-13T19:50:16Z 36171 645 15 http://arxiv.org/abs/2605.24838v1 Adaptable High-Dimensional Change Point Detection via Ridge Regularization 2026-05-24T03:07:56Z

We study the problem of detecting multiple change points in the mean vectors of an independent sequence of high-dimensional observations. We propose a family of ridge-regularized CUSUM statistics built upon the adaptable ridge-regularized Hotelling's T2 test of Li et al. (Ann. Statist. 48 (2020) 1815-1847). The proposed tests are designed for dense alternatives in the high-dimensional regime where the dimension is comparable to the sample size. By introducing ridge regularization, the procedure achieves a stable form of sample covariance normalization and attains adaptability with respect to the underlying population covariance structure. We derive the limiting distributions of the proposed statistics under mild conditions, both under the null hypothesis and under a class of local alternatives. We further develop a principled framework for selecting the regularization parameter by maximizing asymptotic power. Extensive simulation studies demonstrate that the proposed tests compare favorably with a wide range of existing methods across diverse settings. The performance of the proposed test procedure is illustrated through an application to a panel of daily log-returns from S&P 500 constituents spanning 2007-2025.

2026-05-24T03:07:56Z 49 pages Haoran Li Haotian Xu http://arxiv.org/abs/2605.24707v1 Shared hidden-factor information framework for multiple behavioral tasks 2026-05-23T19:20:06Z

Understanding cognitive processes in major depressive disorder (MDD) often relies on behavioral tasks, which are typically analyzed separately, overlooking potential correlations and shared latent structure. To address this limitation, we propose the Shared Hidden-factor Information Framework for Multiple Behavioral Tasks (SHIFT), a joint modeling approach that leverages shared information across tasks, allowing each task to benefit from information learned by the others. SHIFT introduces subject-specific latent factors that capture cross-task dependencies while accommodating individual heterogeneity in decision-making, response times (RTs), and strategy switching. To address computational challenges without requiring high-dimensional integration, we develop an expectation-maximization with variational approximation algorithm that preserves both temporal structure and between-task dependencies. Through extensive simulation studies, we demonstrate that SHIFT substantially improves estimation accuracy and efficiency relative to single-task analyses. We then apply SHIFT to a study of MDD to jointly model the Probabilistic Reward Task (PRT) and the Flanker Task (FT). Results indicate that MDD participants show lower engagement in the PRT and reduced focus in the FT compared with healthy controls. Moreover, when individuals are engaged and focused, they exhibit longer RTs. Although observed RTs do not predict treatment response, the shared parameters recovered by SHIFT showed suggestive treatment-modulation patterns, indicating their potential as exploratory behavioral markers for therapeutic outcomes.

2026-05-23T19:20:06Z Yuan Bian Yuanjia Wang Xingche Guo http://arxiv.org/abs/2512.18508v3 Selection-Induced Contraction of Innovation Statistics in Gated Kalman Filters 2026-05-23T16:58:13Z

Validation gating is a fundamental component of classical Kalman-based tracking systems. Only measurements whose normalized innovation squared (NIS) falls below a prescribed threshold are considered for state update. While this procedure is statistically motivated by the chi-square distribution, it implicitly replaces the unconditional innovation process with a conditionally observed one, restricted to the validation event. This paper shows that innovation statistics computed after gating converge to gate-conditioned rather than nominal quantities. Under classical linear--Gaussian assumptions, we derive exact expressions for the first- and second-order moments of the innovation conditioned on ellipsoidal gating, and show that gating induces a deterministic, dimension-dependent contraction of the innovation covariance. The analysis is extended to NN association, which is shown to act as an additional statistical selection operator. We prove that selecting the minimum-norm innovation among multiple in-gate measurements introduces an unavoidable energy contraction, implying that nominal innovation statistics cannot be preserved under nontrivial gating and association. Closed-form results in the two-dimensional case quantify the combined effects and illustrate their practical significance.

2025-12-20T20:56:21Z 9 pages, preprint Barak Or http://arxiv.org/abs/2603.16833v4 Refined Inference for Asymptotically Linear Estimators with Non-Negligible Second-Order Remainders 2026-05-23T16:02:52Z

Asymptotically linear estimators in semiparametric models are usually studied through a von Mises expansion in which first-order inference is based on the influence-function variance. This reduction is valid only when the second-order remainder is negligible not only in probability but also in variance, a requirement not implied by the usual product-rate conditions ensuring asymptotic linearity. We study the regime in which the remainder contributes variance at order $n^{-1}$, so that the total sampling variance differs from the standard influence-function approximation by a non-vanishing first-order term. We derive a finite-sample variance decomposition separating the influence-function variance, the remainder variance, and their covariance, and characterize sandwich validity through the vanishing of scaled remainder variance: under a negligible cross term, the sandwich estimator is consistent for the total sampling variance when $n\,\mathrm{Var}(R_{\mathrm{rem}})\to 0$ and materially underestimates it in the complementary near-boundary regime $n\,\mathrm{Var}(R_{\mathrm{rem}})\to c_R>0$. We then establish asymptotic validity of two refined procedures in the near-boundary regime: the leave-one-out jackknife and the pairs bootstrap. Jackknife validity is obtained through a self-normalization argument; bootstrap validity is established directly under a Mallows--2 condition. We also extend the theory to clustered data and derive an analytic expression showing how intra-cluster correlation amplifies the sandwich gap through the remainder term. Simulations illustrate the regime and confirm the predicted coverage behaviour of the competing variance estimators.

2026-03-17T17:39:24Z Lin Li http://arxiv.org/abs/2605.24601v1 Bayesian Conformal-Projective Prediction 2026-05-23T14:33:26Z

We propose a general robust prediction framework, termed conformal-projective prediction (CPP), that integrates Bayesian predictive modeling with ideas from conformal prediction. Rather than assessing conformity through residual-based scores, the CPP criterion defines conformity distributionally: a candidate value for a future response is considered conforming to the extent that its inclusion in the data leaves the leave-one-out predictive distributions of the observed responses undisturbed. The framework requires only that the leave-one-out and swapped predictive distributions are available in closed form and that the swapped predictive mean is differentiable in the candidate value. Under these conditions, we establish a general bounded-influence proposition and a general local convexity lemma, and prove that CPP dominates any plug-in predictor with unbounded influence in asymptotic variance under $ε$-contamination models. When the posterior mean is linear in the observations, as in Gaussian linear models, basis-expansion regression, and Gaussian process regression, the swapped predictive mean is affine in the candidate value, yielding closed-form or one-dimensional optimization solutions and an efficient rank-two computational update; all general theoretical results specialize to explicit corollaries in this setting. Simulation experiments and two data analyses under the Gaussian linear model illustrate the finite-sample advantages of the proposed method, confirming the theoretical predictions across contamination levels, sample sizes, and predictor dimensions.

2026-05-23T14:33:26Z Arkaprava Roy Malay Ghosh http://arxiv.org/abs/2605.24587v1 Synthetic Heterogeneous-Effects LASSO: A Fixed-effects Estimation Approach for High-dimensional Mixed-effects Models 2026-05-23T13:57:27Z

This paper studies variable selection and post-selection inference for high-dimensional clustered data using marginal-model-based procedures. We show that, when covariates are heterogeneously distributed across clusters, marginal-model LASSO may use them as sparse proxies for latent cluster effects, shifting the estimation target away from the structural fixed effects and inducing false selections. To address this problem, we propose Synthetic Heterogeneous-Effects LASSO (SHEL), a fixed-effects penalized framework that incorporates cluster-level synthetic approximations to the latent heterogeneity. We establish theoretical properties of SHEL in high-dimensional settings and develop procedures for valid post-selection inference. The finite sample performance of the proposed method is investigated through extensive simulation studies. A longitudinal bulk RNA-seq dataset of enriched blood neutrophils from hospitalized COVID-19 patients is analyzed to demonstrate the method in a real application.

2026-05-23T13:57:27Z Shangyuan Ye Cong Zhang Ying Chen Ye Liang Guanbo Wang http://arxiv.org/abs/2605.24346v1 Using the target trial framework for combining information: external comparator analyses and other applications 2026-05-23T02:13:06Z

We describe how the target trial framework can be used to plan and report analyses that attempt to answer causal questions by combining information from multiple, diverse sources. Such analyses may involve comparisons of treatments evaluated in different populations, for example when an index trial is combined with other data sources in external comparator analyses, or when extending causal inferences from a randomized trial to a new target population in generalizability and transportability analyses. When planning such analyses, the specification of the target trial supports the explicit definition of the target population with an associated sampling model. We propose this as an additional component for the target trial framework, especially relevant for analyses that combine information, because it influences the choice of eligibility criteria, the specification of the causal model, the choice of causal contrasts, and reasoning about identification strategies. Furthermore, the framework encourages careful mapping of data elements from multiple data sources to a single target trial. This mapping process can highlight potentially irreconcilable misalignments between data sources with respect to specific components of the framework -- for example, in the definitions of eligibility criteria, treatment assignment, and treatment receipt. Such misalignments can arise when attempts to specify a target trial that aligns with a specific data source introduce or worsen misalignments with other proposed data sources. The extent of such misalignments may warrant switching to other data sources, or prospectively obtaining data, to emulate the proposed target trial. We conclude that the target trial framework promotes transparent discussion about the design of and assumptions made in analyses that answer causal questions by combining information from diverse sources.

2026-05-23T02:13:06Z Lawson Ung Miguel A. Hernán Issa J. Dahabreh http://arxiv.org/abs/2603.14561v5 Variance Inference Beyond the Sandwich for Asymptotically Linear Estimators with Second-Order Remainders 2026-05-22T22:37:52Z

Semiparametric estimators admitting a von Mises expansion often reduce inference to the influence-function variance. This reduction is justified when the second-order remainder is negligible in variance, a condition that is stronger than the usual product-rate requirement guaranteeing classical asymptotic linearity. When the remainder contributes non-negligible variance, the standard sandwich can underestimate the total sampling variance and Wald intervals can undercover; we call this the \emph{near-boundary regime}. We derive a finite-sample variance decomposition separating influence-function and remainder components, give a practical characterization of when sandwich variance can fail, and show that the leave-one-out jackknife and pairs cluster bootstrap can estimate the total variance under explicit regularity conditions. For the jackknife, consistency follows from a self-normalization argument; for the bootstrap, we work under a Mallows-2 consistency condition. An analytic expression for the amplification of the sandwich gap by intra-cluster correlation is derived for clustered data. A simulation study using a surrogate-assisted targeted learning estimator in stepped-wedge cluster-randomized trials illustrates the regime: the variance ratio $\hat{V}_{\rm JK}/\hat{V}_{\rm Sand}$ is 1.14--1.38 and persistent across cluster counts, and the refined procedures substantially improve coverage.

2026-03-15T19:23:26Z 15 main tex page, 1 supplement Lin Li Pengcheng Wu http://arxiv.org/abs/2512.12398v2 Scalable Spatial Stream Network (S3N) Models 2026-05-22T21:31:25Z

Understanding how habitats shape species distributions and abundances across river networks remains a longstanding and fundamental challenge in ecology, with direct implications for effective biodiversity management and conservation. We introduce a scalable spatial stream network (S3N) model that enables estimation, inference, and prediction with greater computational efficiency than previously possible. S3Ns extend nearest-neighbor Gaussian processes (NNGPs) to include ecologically salient stream network dependence structure. Additionally, S3Ns implement more efficient preprocessing than SSNs; while the computational cost of estimation is a function of the number of observation points and not of the number of reaches, the preprocessing is a function of both. We demonstrate that S3Ns accurately recover spatial and covariance parameters 2-3 orders of magnitude faster than existing spatial stream network models. We then apply S3Ns to estimate the population sizes and geographic distributions of 285 fish species in the entire Ohio River Basin (>4,000 river km, approximately 170,000 reaches and 9,000 observation points) on a laptop. These results indicate the promise of S3Ns for mapping freshwater variables and quantifying the influence of environmental drivers across extensive, complex river networks with many observation points.

2025-12-13T17:09:22Z Jessica P. Kunke Julian D. Olden Tyler H. McCormick http://arxiv.org/abs/2605.24169v1 Post-Processing Posterior Predictive P-values 2026-05-22T19:38:28Z

This article addresses issues of model criticism and model comparison in Bayesian contexts, and focusses on the use of the so-called posterior predictive p-values (ppp values). These involve a general discrepancy or conflict measure and depend on the prior, the model, and the data. They are used in statistical practice to quantify the degree of surprise or conflict in data, and for purposes of comparing different combinations of prior and model. The distribution of such ppp values is however far from uniform, as we demonstrate for different models, making their interpretation and comparison a difficult matter. We propose a natural calibration of the ppp values, where the resulting cppp values are uniform on the unit interval under model conditions. The cppp values, which in general rely on a double simulation scheme for their computation, may then be used to assess and compare different priors and models. Our methods also make it possible to compare parametric with nonparametric model specifications, in that genuine `measures of surprise' are put on the same canonical uniform scale. Our techniques are illustrated for some applications to real data. We also present supplementing theoretical results on various properties of the ppp and cppp.

2026-05-22T19:38:28Z 35 pages, 5 figures. This is the authors' Statistical Research Report, Department of Mathematics, University of Oslo, from 2005, later accepted in modified form in Journal of the American Statistician, 2006, vol. 101, pp 1157-1174 Journal of the American Statistician, 2006, vol. 101, pp 1157-1174 Nils Lid Hjort Fredrik A. Dahl Gunnhildur Högnadóttir Steinbakk http://arxiv.org/abs/2605.24167v1 Modified treatment policies that depend on the natural history of treatment 2026-05-22T19:37:15Z

Longitudinal modified treatment policies (LMTP) are a class of interventions that allow the definition, identification, and estimation of causal effects in general settings, such as with continuous or multivariate exposures, treatment regimens that require grace periods. Targeted machine learning estimators (i.e., double/debiased) have been formulated for LMTPs that assign the exposure at time $t$ as a function of the natural value of treatment at time $t$. However, important applications such as estimating the effect of a delay in the start of a treatment require formulating LMTPs that depend not only on the natural value of treatment at time $t$ but also on the \textit{history} of the natural value of treatment prior to time $t$. This paper develops targeted learning estimators for this general case. We discuss the definition of the effects, and propose estimators that use an augmented-data version of the sequential regression form of the longitudinal g-computation formula. Our estimators are based on the efficient influence function and provide $\sqrt{n}$ inference under standard doubly robust rate assumptions on the convergence of the outcome and treatment regressions. We apply the new estimators to assess the effect of delaying a risky pain treatment by one month on 12-month incidence of opioid use disorder.

2026-05-22T19:37:15Z Iván Díaz Nicholas T. Williams Paweł Morzywołek Kara E. Rudolph http://arxiv.org/abs/2405.07026v4 Selective Randomization Inference for Adaptive Experiments 2026-05-22T18:51:47Z

Adaptive experiments use preliminary analyses of the data to inform further course of action and are commonly used in many disciplines including medical and social sciences. Because the null hypothesis and experimental design are data-dependent, it has long been recognized that statistical inference for adaptive experiments is not straightforward. Most existing methods only apply to specific adaptive designs and rely on strong assumptions. In this work, we propose selective randomization inference as a general framework for analysing adaptive experiments. In a nutshell, our approach applies conditional post-selection inference to randomization tests. By using directed acyclic graphs to describe the data generating process, we derive a selective randomization p-value that controls the selective type-I error. As inference only relies on the randomness in the treatment assignment, no modelling assumptions or independent and identically distributed data are needed. We elaborate on conditions that render the proposed p-value computable and provide rejection sampling and MCMC algorithms to find a Monte Carlo approximation. Moreover, this article shows how to estimate and construct confidence intervals for a homogeneous treatment effect. Lastly, we demonstrate our method and compare it with other randomization tests using synthetic and real-world data.

2024-05-11T14:56:27Z Tobias Freidling Qingyuan Zhao Zijun Gao http://arxiv.org/abs/2605.24118v1 PCA score regression: the art of losing power 2026-05-22T18:24:46Z

The regression of principal component scores (RPCS) on covariates is a widely used analytic approach to detect and test for associations between functional measurements and study participant characteristics. Here we show that: (1) RPCS loses power relative to Function on Scalar Regression (FoSR); (2) the amount of power loss depends on the correlation between the PCs and the true effect; (3) if not corrected for multiplicity, RPCS has inflated $α$-level; and (4) current RPCS methods do not provide valid inference for the true effect. In contrast, we show that Function on Scalar Regression (FoSR) can avoid these problems using a particular combination of modeling tools. We validate these theoretical findings through extensive simulations and illustrate their practical implications using minute-level accelerometry data from the National Health and Nutrition Examination Survey (NHANES).

2026-05-22T18:24:46Z Yu Lu Nidhi Pai Erjia Cui Ciprian Crainiceanu http://arxiv.org/abs/2605.05428v3 Parameter estimation for kappa distributions using the EM algorithm in the superstatistical framework 2026-05-22T16:49:36Z

Kappa distributions are widely used in space plasma physics to model velocity distribution functions with heavy tails. Parameter estimation in these distributions is, however, complicated by the fact that the kappa distribution does not belong to the exponential family, so it admits no sufficient statistics and direct maximum likelihood requires numerical optimization without analytically closed-form update equations. Working within the Beck-Cohen superstatistics framework, where a gamma-distributed inverse temperature $β$ generates the kappa distribution upon marginalization, we treat $β$ as a latent variable. This hierarchical description restores the exponential family structure that the marginal kappa distribution lacks, and yields an analytically tractable implementation of the expectation-maximization (EM) algorithm whose E-step and M-step admit closed-form expressions in terms of sufficient statistics. Applied to synthetic data drawn from the model, the algorithm converges monotonically to a stationary point of the marginal kappa log-likelihood and recovers the generating parameters consistently across the explored range of $κ$. EM thus offers a tractable and transparent route to inference in superstatistical systems with local temperature fluctuations.

2026-05-06T20:37:35Z Leonardo Herrera-Fuenzalida Sergio Davis http://arxiv.org/abs/2605.11138v2 Field Theory of Data: Anomaly Detection via the Functional Renormalization Group. The 2D Ising Model as a Benchmark 2026-05-22T16:26:23Z

We establish a correspondence between anomaly detection in high-noise regimes and the renormalization group flow of non-equilibrium field theories. We provide a physical grounding for this framework by proving that the detection of phase transitions in interacting non-equilibrium systems maps to the study of an effective equilibrium field theory near its Gaussian fixed point, which we identify with the universal Marchenko-Pastur distribution. Applying the Functional Renormalization Group to the two-dimensional Model A, we demonstrate that the noise-to-signal ratio acts as a physical temperature, where the signal emerges as ordered domains within a thermalized background of fluctuations. Using the exact Onsager solution as a benchmark, we show that this approach identifies critical thresholds with an error below 4%, significantly outperforming standard information-theoretic metrics such as the Kullback-Leibler divergence. Our results provide a universal strategy for resolving structures in complex datasets near criticality, bridging the gap between statistical mechanics and statistical inference.

2026-05-11T18:43:14Z 15 pages, 2 appendixes; correction of typos and captions, improved clarity Riccardo Finotello Vincent Lahoche Parham Radpay Dine Ousmane Samary