https://arxiv.org/api/afM3YgllHul4rGYbFcUVwMXCrJI2026-06-18T13:21:53Z3629682515http://arxiv.org/abs/2503.05632v2A Functional Approach to Curve Alignment and Shape Analysis2026-05-24T13:04:23ZIn many image analysis problems, the contours of objects carry important statistical information about shape. Such contours are typically affected by deformation variables including scaling, translation, rotation, and reparametrization. Previous studies in statistical shape analysis have mainly focused on analyzing contours and shapes through discrete observations. While this approach might offer computational advantages, it overlooks the continuous nature of these objects and their underlying geometric structure. It also ignores potential dependencies between the deformation variables and their effect on the shape, which may result in a loss of statistical information and reduced interpretability. In this paper, we introduce a novel framework for analyzing shapes within the context of Functional Data Analysis (FDA). Basis expansion techniques are employed to derive analytic solutions for the estimation of deformation variables, namely scaling, translation, rotation, and reparametrization, thereby achieving curve alignment. A generative model for random contours is then developed using principal component analysis techniques. Numerical experiments on simulated data and the \textit{MPEG-7} database demonstrate that our method successfully identifies deformation parameters and captures the underlying distribution of random contours in settings where traditional FDA methods fail.2025-03-07T17:55:14ZIssam-Ali MoindjiéCédric BeaulacMarie-Hélène Descaryhttp://arxiv.org/abs/2605.24995v1Information-Theoretic Reliability is Robust to Analytic Choice: A 24-Specification Multiverse on Public Cognitive Test-Retest Data2026-05-24T10:48:31ZBackground. The reliability paradox describes the empirical observation that cognitive tasks producing robust group-level effects often yield poor between-individual reliability. Existing approaches rely predominantly on the intraclass correlation coefficient (ICC), which captures only linear, second-moment dependence between test and retest.
Methods. We introduce a normalized, information-theoretic complement to ICC, NLRΔ, defined as the difference between empirically estimated mutual information and the analytic Gaussian baseline implied by the test-retest correlation. We pair NLRΔ with ICC(2,1), bias-corrected and accelerated (BCa) bootstrap intervals, Benjamini-Hochberg false discovery rate (FDR) control, and a 24-cell multiverse over the KSG nearest-neighbour parameter, correlation method, and minimum-sample threshold. The full pipeline is governed by pre-specified claim contracts, content-addressed provenance, and SHA-256-verified raw data ingestion, and is released as the MixMind Reliability Framework.
Results. Across 50 estimable primary measures from the Flanker, Stroop, Stop-Signal, Go/No-Go, and Posner task families, the median NLRΔ is -0.138 nats, with interquartile range [-0.257, -0.034]. Zero of 50 primary measures exceed the headline rule. The companion ICC(2,1) analysis recovers the classical reliability paradox pattern, and the 24-specification multiverse yields 0 of 1,200 estimable cells passing the headline rule.
Conclusions. On these two public datasets, replacing or augmenting ICC with an information-theoretic reliability measure does not rescue cognitive tasks from the reliability paradox. The robust null is invariant to the analytic choices examined here. We release the full pipeline, raw-data hashes, and contracts to enable exact replication and extension to other datasets and tasks.2026-05-24T10:48:31Z12 pages, 2 figures, 3 tables; software and reproducibility materials archived at Zenodo DOI 10.5281/zenodo.20207371Maria Westrinhttp://arxiv.org/abs/2605.24858v1Optimal Estimation of Discrete Multiview Distributions under Heteroskedastic Multinomial Sampling2026-05-24T04:28:38ZMultiview latent-variable models provide a fundamental framework for discrete data analysis, with applications to latent structure models, topic models, and mixtures of product distributions. In the discrete setting, the joint distribution of the observed views can be represented as a nonnegative low-rank tensor, which we call a multiview density tensor. We study the problem of estimating this tensor from multinomial count data. A key challenge is that multinomial sampling induces heteroskedastic and dependent noise, so the difficulty of estimation depends not only on the ambient dimensions and rank, but also on how the probability mass is distributed across different locations of sample space.
We propose a general scaling framework for density tensor estimation under multinomial sampling. This framework leads to a spectral estimator for which we prove a Frobenius-norm upper bound that directly handles heteroskedasticity and negative dependence. For the original multiview model, we obtain fiber-mass-dependent Frobenius upper bounds and minimax lower bounds showing that this dependence is unavoidable. Under $\ell_1$ loss, we develop both oracle and feasible data-driven estimators based on the same scaling principle, establish minimax lower bounds, and show near-optimality for the oracle rule at fixed rank and for slice normalization under bounded slice-to-fiber imbalance. Simulations support the theory and demonstrate the robustness of the proposed methods.2026-05-24T04:28:38ZRunshi TangJulien ChhorOlga KloppAlexandre B. TsybakovAnru R. Zhanghttp://arxiv.org/abs/2605.24854v1Deep Regression for Repeated Measurements under Covariate Shift2026-05-24T04:17:14ZThis paper studies nonparametric regression with repeated measurements when the response in the target domain is unobservable or costly to collect. We adopt a transfer learning framework that leverages a source domain with observable responses under covariate shift. The target regression function is estimated by correcting the distribution shift via the density ratio. We consider both known and unknown density ratio scenarios, which reflect different data available for nonparametric regression estimation. In both cases, we further address two settings: the uniformly bounded density ratio and the unbounded case with finite moment conditions. Under the unknown density ratio scenario, both the density ratio and the target regression function are estimated using rectified linear unit (ReLU) feedforward neural networks (FNNs), whereas under the known density ratio scenario, only the target regression function is estimated by ReLU FNNs. Theoretically, we establish non-asymptotic error bounds for the proposed estimators and prove that they achieve the minimax optimal convergence rate under the repeated measurements setting. Notably, we develop a novel approximation theory where the constants of the network parameters depend polynomially, rather than exponentially as in existing works, on the dimension, thereby mitigating the curse of dimensionality. Consequently, we derive sharper non-asymptotic bounds for the stochastic error. The finite sample performance of the proposed method is demonstrated through numerical simulations and a real data application.2026-05-24T04:17:14Z59 pages, 2 figures, 2 tables, including appendixYingxuan WangXiangyu XingWangli Xuhttp://arxiv.org/abs/2605.24848v1Distributional Conformal Prediction for Markov Processes2026-05-24T03:41:28ZWe introduce the Markov Distributional Conformal Prediction (MDCP) method that extends the distributional conformal prediction (previously developed for regression) to the setting of a strictly stationary Markov process. Instead of relying on a specific model structure to do prediction, the idea of distributional conformal prediction interval aligns with the Model-Free (MF) Prediction Principle. In analogy to MF prediction of Markov processes, our method exploits the probability integral transform based on estimated transition distribution functions to transform the Markov data to an i.i.d.~dataset. We show a non-asymptotic error bound of MDCPs unconditional coverage rate under a $β$-mixing condition and other standard assumptions on the kernel estimators. The asymptotic validity of the conditional prediction interval is also verified. In addition, we show that our conditional prediction interval is still asymptotically valid with Markov processes being $L^p$-$m$-approximable instead of satisfying the mixing property. Numerical simulations and real data experiments are deployed to empirically illustrate the finite-sample performance of MDCP, and compare it with the MF bootstrap prediction method.2026-05-24T03:41:28Z54 pages, 5 figuresDehao DaiKejin WuDimitris N. Politishttp://arxiv.org/abs/2605.24838v1Adaptable High-Dimensional Change Point Detection via Ridge Regularization2026-05-24T03:07:56ZWe study the problem of detecting multiple change points in the mean vectors of an independent sequence of high-dimensional observations. We propose a family of ridge-regularized CUSUM statistics built upon the adaptable ridge-regularized Hotelling's T2 test of Li et al. (Ann. Statist. 48 (2020) 1815-1847). The proposed tests are designed for dense alternatives in the high-dimensional regime where the dimension is comparable to the sample size. By introducing ridge regularization, the procedure achieves a stable form of sample covariance normalization and attains adaptability with respect to the underlying population covariance structure. We derive the limiting distributions of the proposed statistics under mild conditions, both under the null hypothesis and under a class of local alternatives. We further develop a principled framework for selecting the regularization parameter by maximizing asymptotic power. Extensive simulation studies demonstrate that the proposed tests compare favorably with a wide range of existing methods across diverse settings. The performance of the proposed test procedure is illustrated through an application to a panel of daily log-returns from S&P 500 constituents spanning 2007-2025.2026-05-24T03:07:56Z49 pagesHaoran LiHaotian Xuhttp://arxiv.org/abs/2605.24707v1Shared hidden-factor information framework for multiple behavioral tasks2026-05-23T19:20:06ZUnderstanding cognitive processes in major depressive disorder (MDD) often relies on behavioral tasks, which are typically analyzed separately, overlooking potential correlations and shared latent structure. To address this limitation, we propose the Shared Hidden-factor Information Framework for Multiple Behavioral Tasks (SHIFT), a joint modeling approach that leverages shared information across tasks, allowing each task to benefit from information learned by the others. SHIFT introduces subject-specific latent factors that capture cross-task dependencies while accommodating individual heterogeneity in decision-making, response times (RTs), and strategy switching. To address computational challenges without requiring high-dimensional integration, we develop an expectation-maximization with variational approximation algorithm that preserves both temporal structure and between-task dependencies. Through extensive simulation studies, we demonstrate that SHIFT substantially improves estimation accuracy and efficiency relative to single-task analyses. We then apply SHIFT to a study of MDD to jointly model the Probabilistic Reward Task (PRT) and the Flanker Task (FT). Results indicate that MDD participants show lower engagement in the PRT and reduced focus in the FT compared with healthy controls. Moreover, when individuals are engaged and focused, they exhibit longer RTs. Although observed RTs do not predict treatment response, the shared parameters recovered by SHIFT showed suggestive treatment-modulation patterns, indicating their potential as exploratory behavioral markers for therapeutic outcomes.2026-05-23T19:20:06ZYuan BianYuanjia WangXingche Guohttp://arxiv.org/abs/2512.18508v3Selection-Induced Contraction of Innovation Statistics in Gated Kalman Filters2026-05-23T16:58:13ZValidation gating is a fundamental component of classical Kalman-based tracking systems. Only measurements whose normalized innovation squared (NIS) falls below a prescribed threshold are considered for state update. While this procedure is statistically motivated by the chi-square distribution, it implicitly replaces the unconditional innovation process with a conditionally observed one, restricted to the validation event. This paper shows that innovation statistics computed after gating converge to gate-conditioned rather than nominal quantities. Under classical linear--Gaussian assumptions, we derive exact expressions for the first- and second-order moments of the innovation conditioned on ellipsoidal gating, and show that gating induces a deterministic, dimension-dependent contraction of the innovation covariance. The analysis is extended to NN association, which is shown to act as an additional statistical selection operator. We prove that selecting the minimum-norm innovation among multiple in-gate measurements introduces an unavoidable energy contraction, implying that nominal innovation statistics cannot be preserved under nontrivial gating and association. Closed-form results in the two-dimensional case quantify the combined effects and illustrate their practical significance.2025-12-20T20:56:21Z9 pages, preprintBarak Orhttp://arxiv.org/abs/2603.16833v4Refined Inference for Asymptotically Linear Estimators with Non-Negligible Second-Order Remainders2026-05-23T16:02:52ZAsymptotically linear estimators in semiparametric models are usually studied through a von Mises expansion in which first-order inference is based on the influence-function variance. This reduction is valid only when the second-order remainder is negligible not only in probability but also in variance, a requirement not implied by the usual product-rate conditions ensuring asymptotic linearity. We study the regime in which the remainder contributes variance at order $n^{-1}$, so that the total sampling variance differs from the standard influence-function approximation by a non-vanishing first-order term.
We derive a finite-sample variance decomposition separating the influence-function variance, the remainder variance, and their covariance, and characterize sandwich validity through the vanishing of scaled remainder variance: under a negligible cross term, the sandwich estimator is consistent for the total sampling variance when $n\,\mathrm{Var}(R_{\mathrm{rem}})\to 0$ and materially underestimates it in the complementary near-boundary regime $n\,\mathrm{Var}(R_{\mathrm{rem}})\to c_R>0$. We then establish asymptotic validity of two refined procedures in the near-boundary regime: the leave-one-out jackknife and the pairs bootstrap. Jackknife validity is obtained through a self-normalization argument; bootstrap validity is established directly under a Mallows--2 condition. We also extend the theory to clustered data and derive an analytic expression showing how intra-cluster correlation amplifies the sandwich gap through the remainder term. Simulations illustrate the regime and confirm the predicted coverage behaviour of the competing variance estimators.2026-03-17T17:39:24ZLin Lihttp://arxiv.org/abs/2605.24601v1Bayesian Conformal-Projective Prediction2026-05-23T14:33:26ZWe propose a general robust prediction framework, termed conformal-projective prediction (CPP), that integrates Bayesian predictive modeling with ideas from conformal prediction. Rather than assessing conformity through residual-based scores, the CPP criterion defines conformity distributionally: a candidate value for a future response is considered conforming to the extent that its inclusion in the data leaves the leave-one-out predictive distributions of the observed responses undisturbed. The framework requires only that the leave-one-out and swapped predictive distributions are available in closed form and that the swapped predictive mean is differentiable in the candidate value. Under these conditions, we establish a general bounded-influence proposition and a general local convexity lemma, and prove that CPP dominates any plug-in predictor with unbounded influence in asymptotic variance under $ε$-contamination models. When the posterior mean is linear in the observations, as in Gaussian linear models, basis-expansion regression, and Gaussian process regression, the swapped predictive mean is affine in the candidate value, yielding closed-form or one-dimensional optimization solutions and an efficient rank-two computational update; all general theoretical results specialize to explicit corollaries in this setting. Simulation experiments and two data analyses under the Gaussian linear model illustrate the finite-sample advantages of the proposed method, confirming the theoretical predictions across contamination levels, sample sizes, and predictor dimensions.2026-05-23T14:33:26ZArkaprava RoyMalay Ghoshhttp://arxiv.org/abs/2605.24587v1Synthetic Heterogeneous-Effects LASSO: A Fixed-effects Estimation Approach for High-dimensional Mixed-effects Models2026-05-23T13:57:27ZThis paper studies variable selection and post-selection inference for high-dimensional clustered data using marginal-model-based procedures. We show that, when covariates are heterogeneously distributed across clusters, marginal-model LASSO may use them as sparse proxies for latent cluster effects, shifting the estimation target away from the structural fixed effects and inducing false selections. To address this problem, we propose Synthetic Heterogeneous-Effects LASSO (SHEL), a fixed-effects penalized framework that incorporates cluster-level synthetic approximations to the latent heterogeneity. We establish theoretical properties of SHEL in high-dimensional settings and develop procedures for valid post-selection inference. The finite sample performance of the proposed method is investigated through extensive simulation studies. A longitudinal bulk RNA-seq dataset of enriched blood neutrophils from hospitalized COVID-19 patients is analyzed to demonstrate the method in a real application.2026-05-23T13:57:27ZShangyuan YeCong ZhangYing ChenYe LiangGuanbo Wanghttp://arxiv.org/abs/2605.24346v1Using the target trial framework for combining information: external comparator analyses and other applications2026-05-23T02:13:06ZWe describe how the target trial framework can be used to plan and report analyses that attempt to answer causal questions by combining information from multiple, diverse sources. Such analyses may involve comparisons of treatments evaluated in different populations, for example when an index trial is combined with other data sources in external comparator analyses, or when extending causal inferences from a randomized trial to a new target population in generalizability and transportability analyses. When planning such analyses, the specification of the target trial supports the explicit definition of the target population with an associated sampling model. We propose this as an additional component for the target trial framework, especially relevant for analyses that combine information, because it influences the choice of eligibility criteria, the specification of the causal model, the choice of causal contrasts, and reasoning about identification strategies. Furthermore, the framework encourages careful mapping of data elements from multiple data sources to a single target trial. This mapping process can highlight potentially irreconcilable misalignments between data sources with respect to specific components of the framework -- for example, in the definitions of eligibility criteria, treatment assignment, and treatment receipt. Such misalignments can arise when attempts to specify a target trial that aligns with a specific data source introduce or worsen misalignments with other proposed data sources. The extent of such misalignments may warrant switching to other data sources, or prospectively obtaining data, to emulate the proposed target trial. We conclude that the target trial framework promotes transparent discussion about the design of and assumptions made in analyses that answer causal questions by combining information from diverse sources.2026-05-23T02:13:06ZLawson UngMiguel A. HernánIssa J. Dahabrehhttp://arxiv.org/abs/2603.14561v5Variance Inference Beyond the Sandwich for Asymptotically Linear Estimators with Second-Order Remainders2026-05-22T22:37:52ZSemiparametric estimators admitting a von Mises expansion often reduce inference to the influence-function variance. This reduction is justified when the second-order remainder is negligible in variance, a condition that is stronger than the usual product-rate requirement guaranteeing classical asymptotic linearity. When the remainder contributes non-negligible variance, the standard sandwich can underestimate the total sampling variance and Wald intervals can undercover; we call this the \emph{near-boundary regime}. We derive a finite-sample variance decomposition separating influence-function and remainder components, give a practical characterization of when sandwich variance can fail, and show that the leave-one-out jackknife and pairs cluster bootstrap can estimate the total variance under explicit regularity conditions. For the jackknife, consistency follows from a self-normalization argument; for the bootstrap, we work under a Mallows-2 consistency condition. An analytic expression for the amplification of the sandwich gap by intra-cluster correlation is derived for clustered data. A simulation study using a surrogate-assisted targeted learning estimator in stepped-wedge cluster-randomized trials illustrates the regime: the variance ratio $\hat{V}_{\rm JK}/\hat{V}_{\rm Sand}$ is 1.14--1.38 and persistent across cluster counts, and the refined procedures substantially improve coverage.2026-03-15T19:23:26Z15 main tex page, 1 supplementLin LiPengcheng Wuhttp://arxiv.org/abs/2512.12398v2Scalable Spatial Stream Network (S3N) Models2026-05-22T21:31:25ZUnderstanding how habitats shape species distributions and abundances across river networks remains a longstanding and fundamental challenge in ecology, with direct implications for effective biodiversity management and conservation. We introduce a scalable spatial stream network (S3N) model that enables estimation, inference, and prediction with greater computational efficiency than previously possible. S3Ns extend nearest-neighbor Gaussian processes (NNGPs) to include ecologically salient stream network dependence structure. Additionally, S3Ns implement more efficient preprocessing than SSNs; while the computational cost of estimation is a function of the number of observation points and not of the number of reaches, the preprocessing is a function of both. We demonstrate that S3Ns accurately recover spatial and covariance parameters 2-3 orders of magnitude faster than existing spatial stream network models. We then apply S3Ns to estimate the population sizes and geographic distributions of 285 fish species in the entire Ohio River Basin (>4,000 river km, approximately 170,000 reaches and 9,000 observation points) on a laptop. These results indicate the promise of S3Ns for mapping freshwater variables and quantifying the influence of environmental drivers across extensive, complex river networks with many observation points.2025-12-13T17:09:22ZJessica P. KunkeJulian D. OldenTyler H. McCormickhttp://arxiv.org/abs/2605.24169v1Post-Processing Posterior Predictive P-values2026-05-22T19:38:28ZThis article addresses issues of model criticism and model comparison in Bayesian contexts, and focusses on the use of the so-called posterior predictive p-values (ppp values). These involve a general discrepancy or conflict measure and depend on the prior, the model, and the data. They are used in statistical practice to quantify the degree of surprise or conflict in data, and for purposes of comparing different combinations of prior and model. The distribution of such ppp values is however far from uniform, as we demonstrate for different models, making their interpretation and comparison a difficult matter. We propose a natural calibration of the ppp values, where the resulting cppp values are uniform on the unit interval under model conditions. The cppp values, which in general rely on a double simulation scheme for their computation, may then be used to assess and compare different priors and models. Our methods also make it possible to compare parametric with nonparametric model specifications, in that genuine `measures of surprise' are put on the same canonical uniform scale. Our techniques are illustrated for some applications to real data. We also present supplementing theoretical results on various properties of the ppp and cppp.2026-05-22T19:38:28Z35 pages, 5 figures. This is the authors' Statistical Research Report, Department of Mathematics, University of Oslo, from 2005, later accepted in modified form in Journal of the American Statistician, 2006, vol. 101, pp 1157-1174Journal of the American Statistician, 2006, vol. 101, pp 1157-1174Nils Lid HjortFredrik A. DahlGunnhildur Högnadóttir Steinbakk