https://arxiv.org/api/A6KTJD2AZE7Uxdf8r3wBV/OsbLU 2026-06-10T03:11:34Z 36124 90 15 http://arxiv.org/abs/2606.05450v2 Eigenvector Spatial Filters Nuclear Norm Matrix Completion with Application to Air Quality Data 2026-06-06T19:32:05Z

Reliable reconstruction of missing observations in environmental panel datasets is essential for accurate exposure assessment and policy analysis. Traditional nuclear norm matrix completion methods effectively impute missing entries in low-rank matrices, yet often overlook the spatial dependence inherent to air quality processes. This paper introduces the Eigenvector Spatial Filters Nuclear Norm Matrix Completion (ESFNNMC) method, an extension of nuclear norm fixed-effects matrix completion that replaces unit-specific intercepts with a set of Moran-type eigenvectors capturing spatial autocorrelation in the data. To estimate the model, we propose a Block-Coordinate Descent (BCD) approach for multiconvex optimization problems, with soft-thresholded singular value decomposition and cross-validated regularization. Through comprehensive simulations varying missingness patterns, the level of spatial and temporal autocorrelation, and dimension, shape, and rank structure of the matrices, ESFNNMC demonstrates substantial improvements in imputation accuracy over the standard fixed-effects approach, while keeping the computational cost approximately unchanged. The method is applied to impute missing entries in daily PM10 measurements in 64 monitoring stations in Lombardy, Italy, during the year 2021.

2026-06-03T21:11:18Z 29 pages, 5 figures, 14 tables, draft version (to do not cite yet) Rodolfo Metulini http://arxiv.org/abs/2605.14610v3 Parametrically Adaptive Transition Polynomial: a Signed-Parity Continuous-alpha Extension of Kunchenko Stochastic Polynomials 2026-06-06T18:36:54Z

Kunchenko's method of polynomial maximization provides a semiparametric apparatus for parameter estimation under non-Gaussian errors, but its classical power basis relies on finite higher-order integer moments. This paper introduces the Parametrically Adaptive Transition Polynomial (PATP), a signed-parity fractional-power family controlled by a continuous parameter alpha in [0,1]. The quadratic exponent map p_i(alpha) connects the fractal regime p_i(0)=1/i, the degenerate linear point p_i(1/2)=1, and the signed-parity integer-power regime p_i(1)=i. For the degree-S=2 case we derive a closed-form variance-reduction coefficient g_2(alpha) in terms of signed and absolute fractional moments, identify the singular behavior at alpha=1/2, and state the moment and regularity conditions under which the formula is meaningful. The construction should be read as a Form-B PATP analogue within Kunchenko's generalized apparatus, not as an exact recovery of the canonical even-power PMM basis at alpha=1. Numerical illustrations on canonical distributions are used to examine the finite-sample behavior of the signed-parity estimator and to mark the boundary of applicability for extremely heavy-tailed cases such as Cauchy.

2026-05-14T09:26:53Z 35 pages, 8 figures. Code supplement: https://github.com/SZabolotnii/Ku-PATP-code-supplement Serhii Zabolotnii http://arxiv.org/abs/2606.08289v1 Direct domain estimation via regression-tree-assisted estimators in the production of official statistics 2026-06-06T18:20:26Z

National statistical offices (NSOs) produce their estimates under a single weighting system (uni-weight approach): one set of weights, independent of the variable of interest, is used to estimate multiple parameters and multiple subpopulations (domains). In this paper we study, within the family of model-assisted estimators and from a design-based perspective of direct estimation, the use of regression trees as the assisting model for estimating totals in unplanned domains. We distinguish two strategies: (i) fitting a single tree at the population level and deriving from it uni-weight weights applicable to any domain, and fitting a domain-specific tree. We show that both estimators can be written as weighted sums with weights that do not depend on $y$, preserving the uni-weight property and additivity benchmarking with respect to the population total. Extending to trees the classical result, we argue why the estimator built from a population-level model tends to behave like the Horvitz-Thompson estimator within domains, whereas the domain-specific model can achieve substantial variance reductions. A simulation study based on microdata from the Uruguayan Continuous Household Survey (ECH) illustrates the behavior of the estimators at the population level and by department

2026-06-06T18:20:26Z Juan Pablo Ferreira http://arxiv.org/abs/2606.08261v1 Sparse Longitudinal Functional Principal Component Analysis for Episodic Ambulatory Behavioral Assessments 2026-06-06T17:16:37Z

Accurately monitoring mental fatigue is critical for improving workplace safety and productivity. A recent study examined unobtrusively collected smartphone typing speed as a potential ambulatory proxy assessment of mental fatigue using data from the Intern Health Study (IHS). While population-level average typing speed patterns were found to be consistent with validated measures of mental fatigue, how these trajectories vary across participants and days may inform opportune moments for just-in-time interventions and remains an open question. Treating typing speed trajectories as sparsely observed functional data, we propose a novel sparse longitudinal functional principal component analysis (sparse LFPCA) method for decomposing variability and predicting individual curves. Specifically, sparse data are accommodated by casting covariance estimation as a structured penalized spline regression problem, enabling simultaneous estimation and smoothing of multiple covariance components while borrowing information across locations in the functional domain. Simulations show that sparse LFPCA (1) accurately estimates eigenfunctions and generates reasonable predictions for underlying curves, and (2) achieves similar or superior performance compared to existing alternatives. Our analysis of typing speed data collected from IHS reveals new and interpretable participant- and day-level patterns not captured by previous analyses and can be used to tailor behavioral interventions.

2026-06-06T17:16:37Z Nidhi Pai Yu Fang Srijan Sen Zhenke Wu Erjia Cui http://arxiv.org/abs/2501.04615v4 Doubly Robust and Efficient Calibration of Prediction Sets for Right-Censored Time-to-Event Outcomes 2026-06-06T17:11:41Z

Our objective is to construct well-calibrated prediction sets for a time-to-event outcome subject to right-censoring with guaranteed coverage. Inspired by modern conformal inference, our approach avoids the need for a well-specified parametric or semiparametric survival model. Unlike existing conformal methods for survival data, which assume Type-I censoring with fully observed censoring times, we consider the more common right-censoring setting in which only the censoring time or only the event time is observed, whichever comes first. Under a standard conditional independence censoring condition, we propose and analyze several lower prediction bounds for the survival time of a future observation, including inverse-probability-of-censoring weighting, and its augmented version based on the semiparametric efficient influence function for the relevant marginal quantile of the outcome accounting for dependent censoring. We formally establish asymptotic coverage guarantees of the proposed methods, and demonstrate both theoretically and through empirical experiments, that the augmented approach substantially improves efficiency over all other proposed methods. Specifically, its coverage error bound is doubly robust, and therefore of second order, thus ensuring that it is asymptotically negligible relative to the coverage error of the other methods.

2025-01-08T16:57:18Z 48 pages, 11 figures Rebecca Farina Eric J. Tchetgen Tchetgen Arun Kumar Kuchibhotla http://arxiv.org/abs/2410.20885v3 A Distributed Lag Approach to the Generalised Dynamic Factor Model 2026-06-06T16:43:28Z

We propose a simple estimator for the dynamic decomposition of the Generalized Dynamic Factor Model that avoids frequency-domain methods. First, we show that it is a reasonable approximation to assume that the dynamic common component of the Generalized Dynamic Factor Model admits a representation in terms of current and lagged statically pervasive factors. Then, assuming finite lag order, this simplification reduces estimation to a regression of the observed variables on estimated factors and their lags, where the factors are extracted via static principal components. The proposed approach naturally accommodates weak, non-pervasive factors within the dynamic common space. We establish consistency and asymptotic normality for both the dynamic and weak common components under a new asymptotic framework that allows for such weak factors. In an application to three high-dimensional time series panels of European macroeconomic data we detect a sizeable weak common component share in several key macroeconomic indicators.

2024-10-28T10:07:06Z Philipp Gersing http://arxiv.org/abs/2512.03983v2 Statistical hypothesis testing for differences between layers in dynamic multiplex networks 2026-06-06T15:56:43Z

With the emergence of dynamic multiplex networks, corresponding to graphs where multiple types of edges evolve over time, a key inferential task is to determine whether the layers associated with different edge types differ in their connectivity. In this work, we introduce a hypothesis testing framework, under a latent space network model, for assessing whether the layers share a common latent representation. The method we propose extends previous literature related to the problem of pairwise testing for random graphs and enables global testing of differences between layers in multiplex graphs. While we introduce the method as a test for differences between layers, it can easily be adapted to test for differences between time points. We construct a test statistic based on a spectral embedding of an unfolded representation of the graph adjacency matrices and demonstrate its ability to detect differences across layers in the asymptotic regime where the number of nodes in each graph tends to infinity. The finite-sample properties of the test are empirically demonstrated by assessing its performance on both simulated data and a biological dataset describing the neural activity of larval Drosophila.

2025-12-03T17:14:33Z 12 pages, 3 figures Maximilian Baum Francesco Sanna Passino Axel Gandy http://arxiv.org/abs/2606.08196v1 Beyond Additivity: Causal Discovery in Location-Scale Noise Models with Hidden Variables 2026-06-06T14:25:03Z

We study causal discovery from observational data when some variables are hidden and the data-generating process follows a location-scale noise model (LSNM). Existing methods that handle hidden confounders typically assume additive noise, but in practice, causes often modulate not just the mean but also the variance of their effects. We prove that acyclic directed mixed graphs (ADMGs) satisfying a bow-free condition are identifiable under LSNM with hidden variables, establishing the first identifiability result for causally insufficient models beyond noise additivity. We further provide sufficient conditions for identifying causal direction even when the bow-free assumption is violated. Our two-stage algorithm, LSNM-UV, is sound and complete, and experiments demonstrate improved performance over additive baselines on heteroscedastic data.

2026-06-06T14:25:03Z 33 pages, 4 figures Mariyam Khan Shohei Shimizu Thong Pham http://arxiv.org/abs/2507.00260v3 Disentangled Feature Importance 2026-06-06T14:24:30Z

When predictors are statistically dependent, the appropriate definition of feature importance depends on the operational goal. Conditional-incremental measures are well-suited for feature selection, acquisition, and compression, where shared predictive information is treated as redundancy. For post-hoc interpretation, however, the goal is often to attribute predictive signals across correlated measurement channels. We introduce Disentangled Feature Importance (DFI), a population-level attribution framework for this setting. DFI maps covariates to an independent latent representation under a specified entropic optimal transport geometry, computes latent importance, and attributes it back to the original covariates through barycentric sensitivities. We show that broad conditional-incremental FI functionals target conditional incremental predictive value under squared-error loss, and therefore answer a different question from attribution of shared predictive signal under dependence. Under fixed transport cost, reference law, and regularization level, DFI defines a well-specified family of estimands. Latent scores admit a functional ANOVA interpretation, and in the Gaussian linear case, the attributed DFI recovers the classical $R^2$ decomposition for correlated regressors. We derive influence-function-based inference under nuisance-rate and smoothness conditions, and show in simulations and an HIV-1 neutralization-resistance analysis that DFI yields stable, interpretable, uncertainty-quantified attributions of shared predictive signal.

2025-06-30T20:54:48Z 29 main and 44 supplementary pages Jin-Hong Du Kathryn Roeder Larry Wasserman http://arxiv.org/abs/2505.24066v2 Adaptive Resolution for Finite-Rank Gaussian Processes 2026-06-06T13:00:34Z

Finite-rank approximations are widely used to scale Gaussian process (GP) regression, but their posterior behavior can differ from that of the corresponding parent GP prior. We study a class of finite-rank GP priors built from locally supported basis expansions with dependent Gaussian coefficients. Our framework covers finite-element approximations based on the stochastic partial differential equation (SPDE) representation of Matérn GPs and regular-grid GP interpolation schemes. We show that, with a suitable prior on the resolution parameter $N$, these finite-rank expansions inherit the same posterior contraction rate as the corresponding parent GP prior under the same bandwidth specification used for that parent prior. Consequently, the interpolation construction under a squared-exponential parent GP attains the minimax-optimal rate up to logarithmic factors under a hierarchical prior on the bandwidth parameter and on $N$, while the SPDE construction attains the same rate under a bandwidth scaling depending on the sample size and the smoothness of the true function, together with a prior on $N$. We also develop a posterior sampler for the hierarchical interpolation model that jointly updates the resolution and bandwidth parameters, and we provide numerical studies that support the theory.

2025-05-29T23:18:33Z 48 pages, 5 figures Jaehoan Kim Anirban Bhattacharya Debdeep Pati http://arxiv.org/abs/2402.06428v3 Smooth Transformation Models for Survival Analysis: A Tutorial Using R 2026-06-06T10:01:33Z

Over the last five decades, we have seen strong methodological advances in survival analysis, using parametric methods and, more prominently, methods based on non-/semi-parametric estimation. As the methodological landscape continues to evolve, the task of navigating through the multitude of methods and identifying available software resources is becoming increasingly challenging -- especially in more complex scenarios, such as when dealing with interval-censored or clustered survival data, non-proportional hazards, or dependent censoring. This tutorial explores the potential of using the framework of smooth transformation models for survival analysis in the R system for statistical computing. This framework provides a unified maximum-likelihood approach that covers a wide range of survival models, including well-established ones such as the Weibull model and a fully parametric version of the famous Cox proportional hazards model, and various extensions for more complex scenarios. We explore models for non-proportional/crossing hazards, dependent censoring, clustered observations and extensions towards personalised medicine within this framework. Using survival data from a two-arm randomised controlled trial on rectal cancer therapy, we demonstrate how survival analysis tasks can be seamlessly navigated in R within this framework using the implementation provided by the "tram" package, and few related packages.

2024-02-09T14:16:29Z Sandra Siegfried Bálint Tamási Torsten Hothorn 10.1177/09622802251414595 http://arxiv.org/abs/2605.10406v2 Multi-Fidelity Quantile Regression 2026-06-06T08:18:07Z

High-fidelity (HF) data are often expensive to collect and therefore scarce, making conditional quantiles difficult to estimate accurately. We propose a two-stage, model-agnostic method for multi-fidelity quantile regression. The central idea is a local quantile link: at each covariate value, the HF quantile is represented as a low-fidelity (LF) quantile evaluated at a covariate-dependent level. This reformulation reduces the problem to estimating the level function, which can be smoother than the HF quantile itself when the LF and HF conditional distributions have similar shapes. We also study the complementary regime in which this advantage weakens and introduce a correction step to improve robustness. Our theory characterizes when the proposed estimator converges faster than direct quantile regression using HF data alone and when the correction step provides further improvement. Experiments on synthetic and real data show that our method yields more accurate quantile estimates and tighter conformal prediction intervals.

2026-05-11T11:43:38Z 69 pages, 12 figures, 3 tables Yixiang Liu Yao Zhang http://arxiv.org/abs/2604.06278v4 Predictive Volatility of Machine Learning in Micro-Samples: A Regularised Assessment of Regional Poverty 2026-06-06T07:22:29Z

Small regional datasets pose a dual statistical problem: correlated predictors inflate estimation variance, while flexible learners can become unstable because the available information per adaptive degree of freedom is limited. We examine this issue through predictive volatility, defined as the cross-sample dispersion and upper-tail behaviour of out-of-sample loss. Using simulation evidence reported for sparse linear, near-linear and heavy-tailed settings, we compare ordinary least squares, frequentist penalties, Bayesian shrinkage models, bounded-response and spatial specifications, and flexible machine-learning procedures. In the reported simulation results, regularised linear estimators generally dominate in the linear high-collinearity micro-sample settings and remain the most reliable overall, whereas tree-based methods become more competitive only when the signal is weakly nonlinear and the sample size is larger. In the empirical application to 34 Indonesian provinces, ridge yields the best leave-one-out performance, followed by elastic net and lasso. Across the Bayesian shrinkage specifications, ICT skills show the most consistent negative association with poverty, with the strongest support under horseshoe and spike-and-slab formulations. These results suggest that, in micro-sample regional modelling, the main constraint is limited information per effective degree of freedom rather than insufficient algorithmic flexibility.

2026-04-07T09:41:12Z Corrections are needed A. H. Jamaluddin A. T. R. Dani N. I. Mahat V. Ratnasari S. S. M. Fauzi http://arxiv.org/abs/2606.07986v1 Inference for High-Dimensional Sparse Spectral Precision Matrices 2026-06-06T05:27:35Z

Gaussian graphical models in the spectral domain offer a principled approach for recovering conditional dependence structures in stationary high-dimensional time series. Inference on the spectral precision matrix at a fixed frequency enables tests of frequency-specific conditional associations among time series components. The problem is challenging because finite-sample discrete Fourier transforms induce truncation and smoothing biases, while the complex-valued nature of the spectral precision matrix complicates high-dimensional variance estimation, rendering methods for i.i.d. samples not directly applicable. Existing approaches do not provide full likelihood-based inference for the discrete Fourier transforms. We propose a high-dimensional inference framework for sparse spectral precision matrices using the full likelihood of neighboring discrete Fourier transforms. We construct a debiased complex graphical lasso estimator at any fixed frequency. Using asymptotic theory for quadratic forms of multivariate time series, we establish its asymptotic normality and construct entry-wise consistent covariance estimators by aggregating information across neighboring frequencies. The key theoretical contribution is the simultaneous control of regularization, finite-sample truncation, and smoothing biases, enabling valid inference. Simulation studies show reliable coverage away from zero frequency and improved detection power over the benchmark, with false discovery rates near the desired level.

2026-06-06T05:27:35Z 47 pages, 5 figures, 5 tables Navonil Deb Younghoon Kim Sumanta Basu http://arxiv.org/abs/2606.07981v1 Making Recursive Bayesian Inference Robust 2026-06-06T05:12:21Z

While Bayesian inference has become increasingly popular with advances in computational resources, its algorithms can be computationally prohibitive and may not scale with large datasets. This has led to growing interest in alternative algorithms, such as approximation methods and variants of Markov chain Monte Carlo. Among these approaches, prior proposal-recursive Bayesian (PP-RB) inference facilitates scalable Bayesian computation by recursively updating the posterior distribution across stages and utilizing parallel computing resources. While the well-known ``degeneracy'' issue in PP-RB has been studied, another limitation that PP-RB can yield incorrect inferences when posterior distributions shift substantially between stages has remained unsolved. To address this, we propose parallel-tempered prior proposal-recursive Bayesian (PPP-RB) inference, which extends PP-RB by leveraging the key idea underlying Metropolis-coupled Markov chain Monte Carlo. We show both theoretically and empirically that PPP-RB targets the true posterior distribution. We illustrate PPP-RB through numerical studies and real data analysis in application to earthquake count data and sea surface salinity in the North Atlantic region. In these applications, we compare PPP-RB with PP-RB and a standard MCMC, demonstrating that PPP-RB is more efficient in terms of effective sample size per elapsed time.

2026-06-06T05:12:21Z Myungsoo Yoo Daniel Würzler Barreto Mevin B. Hooten