https://arxiv.org/api/tYji+b/1iVBSSDC0D4kMPSigBsk 2026-06-10T07:39:17Z 36124 135 15 http://arxiv.org/abs/2606.06753v1 Cluster-Aware Conformal Calibration for Spatio-Temporal Distributional Prediction 2026-06-04T22:32:40Z

DeepKriging-style models, such as Spatio-Temporal DeepKriging, improve scalability through basis-function embeddings and stochastic gradient learning; however, fixed regular-grid spatial bases remain inefficient under highly non-uniform sampling patterns, often over-allocating capacity to sparse regions while under-resolving dense clusters. To address this limitation, we propose a practical extension of DeepKriging for reliable spatio-temporal distributional forecasting, incorporating cluster-adaptive spatial bases - whose centers and scales are initialized from {the spatial sampling density} - to better capture heterogeneous spatial sampling, together with cluster-aware conformal calibration that determines prediction-interval widths within spatial clusters (with a global fallback when calibration samples are insufficient). The resulting calibration pipeline explicitly targets spatial heterogeneity and local miscalibration, and experiments, including simulation studies and PM$_{2.5}$ data analysis, demonstrate substantially improved coverage accuracy and tail reliability under clustered observation patterns compared with a global conformal baseline.

2026-06-04T22:32:40Z Gooyoung Kim Chae Young Lim Wen-Ting Wang Hao-Yun Huang Wei-Ying Wu http://arxiv.org/abs/2606.07680v1 A Counting Process View of Relational Event Models: Practical Asymptotics 2026-06-04T21:47:55Z

Relational Event Models (REMs) provide a rigorous framework for analyzing dyadic interactions observed in continuous time, capturing history-dependent dynamics such as triadic closure and reciprocity. Framing REMs through the lens of counting processes embeds the model in a rich theoretical foundation, facilitating its mathematical analysis. While Maximum Likelihood Estimation (MLE) is standard practice for estimating these models, the underlying statistical guarantees rely on specific asymptotic regimes, namely, whether the network size (n), the observational period (T), or both approach infinity. We review the theoretical foundations of such counting-process-based models, formalizing the core assumptions required to achieve asymptotic normality across these different limits. With a specific focus on Cox-type multiplicative models, we detail the circumstances under which these assumptions hold. Supported by simulation studies, we illustrate how structural modeling choices, including temporal windowing and logarithmic transformations, affect empirical coverage and estimator convergence. We thereby derive several guiding principles for specifying such models in realistic contexts, bridging theory and practice.

2026-06-04T21:47:55Z Cornelius Fritz Alexander Fuchs-Kreiss http://arxiv.org/abs/2606.06730v1 Bayesian genome-wide clustering and variable selection of transcriptomic data via rank-based mixtures 2026-06-04T21:30:13Z

With the increasing availability of ranking data, there has been a growing demand for appropriate unsupervised rank-based inferential frameworks capable of handling high-dimensional datasets and providing uncertainty quantification for all estimates. Rank-based methods have also seen a growing popularity in -omics pipelines, as ranking continuous measurements provides a robust means of handling non-normally distributed data. The Bayesian Mallows model (BMM) has emerged as a promising choice because of its adaptability to various types of ranking data and its flexible framework, integrating cluster-wise rank aggregation with inference at the individual level. However, the scalability of BMM to ultra-high-dimensional settings, such as -omics analyses, has remained limited. The present paper addresses this issue by introducing the first rank-based model generalizing BMM to jointly handle clustering and variable selection, namely the lower-dimensional Bayesian Mallows Model Mixture (lowBM3). The proposed method provides a novel Bayesian framework that simultaneously handles heterogeneity in the sample, unsupervised parameter estimation, and model selection in a scalable manner for ultra-high-dimensional data. Additionally, a companion postprocessing framework is introduced to provide posterior summaries of the discrete posterior distributions of both the consensus ranking and the variable selector. Simulation studies are performed to assess the performance of the method. The usefulness of the method is also shown in an application to signature discovery for cancer genomics, where RNA-seq bulk gene expression data obtained from breast cancer patients are clustered genome-wide.

2026-06-04T21:30:13Z 60 pages, 25 figures Emilie Eliseussen Haakon Muggerud Luca Coraggio Ida Scheel Thomas Fleischer Valeria Vitelli http://arxiv.org/abs/2606.06705v1 Estimating Evolving Functions with Dynamic Gaussian Processes 2026-06-04T20:44:23Z

This paper develops the Dynamic Gaussian Process (DGP), a framework for estimating functions governed by integro-difference equations (IDEs). IDEs model continuous functions that evolve with discrete-time dynamics and arise naturally from time-discretization of linear partial differential equations (PDEs). The DGP extends Gaussian process regression to time-varying functions and extends Kalman filtering to infinite-dimensional states. The DGP posterior remains a Gaussian process with closed-form mean and covariance updates, and separable kernel structure reduces the problem to a finite-dimensional Kalman filter on basis function coefficients. This paper extends the DGP to vector-valued states, enabling the treatment of higher-order PDEs, and provides a stability and approximation error analysis for the basis function approximation. The functional L2 estimation error decomposes exactly into in-subspace and out-of-subspace contributions, and all approximation errors vanish as the number of basis functions grows. The framework is demonstrated on the heat equation and on the wave equation, the latter with a vector-valued state. Code is available at https://github.com/JvHulst/Dynamic_Gaussian_Processes.

2026-06-04T20:44:23Z This manuscript is a preprint submitted to a SIAM journal J. S. van Hulst W. P. M. H. Heemels D. J. Antunes http://arxiv.org/abs/2606.06699v1 Robust inference for cyclic-stress accelerated life tests under interval monitoring with lognormal lifetimes 2026-06-04T20:37:01Z

Highly reliable products are often tested under accelerated conditions to provoke failures within a feasible timeframe. For products whose service life involves repeated alternation between two stress levels, such as automotive air-conditioners, batteries, and aerospace components, cyclic-stress accelerated life testing (CyALT) provides a more realistic loading profile than conventional accelerated tests. In practice, failures are often recorded only at scheduled inspection times, leading to interval-censored counts rather than exact lifetimes. Moreover, traditional maximum likelihood estimation is sensitive to data contamination, which is a genuine concern in small-sample industrial experiments. This paper develops robust inferential procedures for CyALT models with lognormal lifetimes under interval monitoring. Robust estimators are obtained by minimizing a weighted density power divergence (WDPD), leading to the weighted minimum density power divergence estimator (WMDPDE). We establish the asymptotic distribution of the WMDPDE, derive influence function expressions to characterize the robustness, and present asymptotic and bootstrap confidence intervals for important lifetime characteristics. A simulation study confirms that the WMDPDE provides substantial protection against outliers while retaining high efficiency under clean data. The methodology is illustrated through the analysis of an air-conditioner reliability dataset, demonstrating the practical advantages of robust inference in the CyALT framework.

2026-06-04T20:37:01Z 35 pages, 7 figures, 6 tables María Jaenada Leandro Pardo Kiran Prajapat http://arxiv.org/abs/2606.07677v1 Disentangling Latent Risk Pathways via Bayesian Hypergraph Inference 2026-06-04T19:56:09Z

Electronic health records (EHR) pose large-scale multi-disease modeling problems in which many outcomes are rare and strongly influenced by shared risk factors. While modern approaches achieve strong predictive performance, they often treat diseases independently or rely on black-box architectures, offering limited insight into how risk factors organize disease risk and little principled uncertainty quantification. We introduce a Bayesian hypergraph inference framework that reframes multi-disease modeling around latent, risk-factor-modulated disease pathways. Risk factors act on hyperedges, latent disease subsets with shared risk patterns, allowing diseases to participate in multiple distinct pathways and enabling interpretable, higher-order structure beyond pairwise associations. A repulsion prior encourages parsimonious and identifiable structure, while posterior inference provides calibrated uncertainty over both disease groupings and risk-factor influence. To enable scalable inference on large EHR datasets, we develop a structured variational inference algorithm that preserves logical dependencies among hyperedge existence, disease membership, and pathway-level effects. Experiments on simulated data and UK Biobank demonstrate stable and interpretable disease pathway structure, well-calibrated uncertainty, improved estimation for rare diseases, and competitive predictive performance.

2026-06-04T19:56:09Z ICML 2026 Oral Shengxian Ding Haonan Gao Pangpang Liu Xinyuan Tian Yize Zhao http://arxiv.org/abs/2606.06483v1 Statistically and Computationally Optimal Estimation and Inference of Common Subspaces 2026-06-04T17:58:14Z

Given multiple data matrices, many problems in statistics and data science rely on estimating a common subspace that captures certain structure shared by all the data matrices. In this paper we investigate the statistical and computational limits for the common subspace model in which one observes a collection of symmetric low-rank matrices perturbed by noise, where each low-rank matrix shares the same common subspace. Our main results identify several regimes of the signal-to-noise ratio (SNR) such that estimation and inference are statistically or computationally optimal, and we refer to these regimes as weak SNR, moderate SNR, strong estimation SNR, and strong inference SNR. First, we propose an estimator based on projected gradient descent initialized via spectral sum of squares and show that it achieves the optimal $\sinΘ$ error rate under strong estimation SNR. These results are complemented by both statistical and computational lower bounds identifying the weak and moderate estimation SNR regimes. Next, we turn to statistical inference for the $\sinΘ$ distance itself, and we show that our estimator has an asymptotically Gaussian distribution in the strong inference SNR regime. Based on this limiting result we propose confidence intervals and show that they are adaptively minimax optimal in the strong inference SNR regime, where adaptivity is measured in terms of the SNR. Finally, we show that adaptive confidence intervals are information-theoretically impossible below the strong inference SNR regime. Consequently, our results unveil a novel phenomenon: despite the SNR being ``above'' the computational limit for estimation, adaptive statistical inference may still be information-theoretically impossible.

2026-06-04T17:58:14Z Joshua Agterberg http://arxiv.org/abs/2601.16821v3 Directional-Shift Dirichlet ARMA Models for Compositional Time Series with Structural Break Intervention 2026-06-04T17:44:22Z

Compositional time series frequently exhibit structural breaks due to external shocks, policy changes, or market disruptions. Standard methods either ignore such breaks or handle them through fixed effects that cannot extrapolate beyond the sample, or step-function dummies that impose instantaneous adjustment. We develop a Bayesian Dirichlet ARMA model augmented with a directional-shift intervention mechanism that captures structural breaks through three interpretable parameters: a direction vector specifying which components gain or lose share, an amplitude controlling redistribution magnitude, and a logistic gate governing transition timing and speed. The model preserves compositional constraints by construction, maintains DARMA dynamics for short-run dependence, and produces coherent probabilistic forecasts through and after structural breaks. The intervention trajectory corresponds to geodesic motion on the simplex and is invariant to the choice of ILR basis. A simulation study with 400 fits across 8 scenarios shows near-zero amplitude bias and nominal 80\% credible interval coverage when the shift direction is correctly identified (77.5\% of cases); supplementary studies confirm robustness across extreme transition speeds and non-monotone DGPs. Two empirical applications to COVID-era Airbnb data characterize performance relative to simpler alternatives. Where the break is monotone and ongoing, the intervention model achieves near-nominal calibration (79.6\%) while the fixed effect substantially under-covers (66.1\%). Where post-break dynamics are non-monotone, both models are acceptably calibrated and the fixed effect outperforms on point accuracy. The intervention model's advantages are thus specific to settings with roughly monotone structural transitions.

2026-01-23T15:19:55Z Harrison Katz http://arxiv.org/abs/2606.06441v1 Leveraging External Controls for Treatment Switching in Randomized Controlled Trials: A Weighted Causal Inference Framework for Overall Survival 2026-06-04T17:41:40Z

In many oncology clinical trials where overall survival is a key endpoint, patients are permitted to switch from the control arm to the experimental treatment arm or other suitable therapies. Switching can occur for various reasons, including disease progression. This violates the causal guarantees of randomized treatment assignment, resulting in biased treatment effect estimates. Existing methods often require strong assumptions, complicated model specifications, or both. In this paper, we propose a general framework that incorporates external controls to account for treatment switching in randomized controlled trials. Leveraging the synthetic control method and balancing weights from observational causal inference, we propose several estimators that use multiple imputation and time-varying weights to adjust for treatment switching. We also discuss approaches to selecting the risk set of external controls to impute from. Through extensive simulation studies, we show that our proposed methods lead to meaningful statistical improvements relative to standard adjustment methods that utilize external controls in naive ways or those that do not utilize external controls at all. We then demonstrate the utility of our external control-based approaches with two phase III oncology trials.

2026-06-04T17:41:40Z Andy A. Shen Chenqi Fu Ray Lin http://arxiv.org/abs/2606.06426v1 A Robust Framework for Model Order Selection in Correlated Large-Dimensional CES Noise 2026-06-04T17:29:50Z

This paper addresses model order selection under large-dimensional, correlated, non-Gaussian noise. Sources are assumed to be embedded in additive Complex Elliptically Symmetric (CES) noise with an unknown Toeplitz-structured scatter matrix. We propose a two-stage robust framework: (i) a noise-whitening step based on a Toeplitz-rectified $M$-estimator of the scatter matrix, and (ii) signal subspace rank inference via large-dimensional Random Matrix Theory (RMT). Almost sure consistency of the proposed estimators is established, together with explicit RMT eigenvalue upper bounds separating signal from noise components, in the regime where the observation dimension $m$ and the sample size $N$ grow proportionally. Three estimation branches are derived, based respectively on the sample covariance matrix (SCM), Maronna's $M$-estimator, and the distribution-free Tyler $M$-estimator for whitening. The methodology is validated on synthetic data, real hyperspectral images, EEG recordings, and financial data, with significant gains over AIC and unwhitened methods.

2026-06-04T17:29:50Z 13 pages (Main Paper), 6 pages (Supplementary Material), 9 figures Eugénie Terreaux Emmanuelle Jay Frédéric Pascal Jean-Philippe Ovarlez http://arxiv.org/abs/2511.15427v3 Tractable Estimation of Nonlinear Panels with Interactive Fixed Effects 2026-06-04T17:25:35Z

Interactive fixed effects are routinely controlled for in linear panel models. While an analogous fixed effects (FE) estimator for nonlinear models has been available in the literature (Chen, Fernandez-Val and Weidner, 2021), it sees much more limited use in applied research because its implementation involves solving a high-dimensional non-convex problem. In this paper, we complement the theoretical analysis of Chen, Fernandez-Val and Weidner (2021) by providing a new computationally efficient estimator that is asymptotically equivalent to their estimator. Unlike the previously proposed FE estimator, our estimator avoids solving a high-dimensional non-convex optimization problem and can be feasibly computed in large nonlinear panels. Our proposed method involves two steps. In the first step, we convexify the optimization problem using nuclear norm regularization (NNR) and obtain preliminary NNR estimators of the parameters, including the fixed effects. Then, we find the global solution of the original optimization problem using a standard gradient descent method initialized at these preliminary estimates. To make our method readily applicable in practice, we also propose specific numerical algorithms for solving the involved optimization problems, establish their convergence, and provide their efficient implementation in our R package NNRPanel.

2025-11-19T13:26:48Z Andrei Zeleneev Weisheng Zhang http://arxiv.org/abs/2505.01318v3 Modeling Large Nonstationary Spatial Data with the Full-Scale Basis Graphical Lasso 2026-06-04T17:14:29Z

We propose a new approach for the modeling large datasets of nonstationary spatial processes that combines a latent low rank process and a sparse covariance model. The low rank component coefficients are endowed with a flexible graphical Gaussian Markov random field model. The utilization of a low rank and compactly-supported covariance structure combines the full-scale approximation and the basis graphical lasso; we term this new approach the full-scale basis graphical lasso (FSBGL). Estimation employs a graphical lasso-penalized likelihood, which is optimized using a difference-of-convex scheme. We illustrate the proposed approach on synthetic fields as well as with a challenging high-resolution simulation dataset of the thermosphere. In a comparison against state-of-the-art spatial models, the FSBGL performs better at capturing salient features of the thermospheric temperature fields, even with limited available training data.

2025-05-02T14:46:02Z Matthew LeDuc William Kleiber Tomoko Matsuo http://arxiv.org/abs/2606.06411v1 Smooth Concordance Metrics for Survival Models 2026-06-04T17:14:24Z

Concordance indices are widely popular metrics for assessing the ability of predictive survival models to discriminate underlying risk levels. However, these statistics have also been criticized for using only the rank orderings of the model's predicted risk scores and being insensitive to important model features, such as the addition of strong predictor variables into the model. In this paper, we address these limitations by developing smooth concordance metrics that model the underlying risk discrimination probabilities as continuous functions of the predicted risk score differences, where the shapes of these functions are estimated from the observed data. As a result, these smooth concordance metrics assess model performance across the entire range of possible risk score differences, allowing one to identify specific scenarios where the candidate model performs especially well or better than other models. Simulations show that the proposed smooth concordance metrics provide more detailed information about risk discrimination performance and are much more sensitive to the addition of meaningful predictors. We apply these methods to compare predictive survival models for cancer recurrence.

2026-06-04T17:14:24Z Nicholas Hartman Grace Richards http://arxiv.org/abs/2606.06384v1 Estimation of the sub-Gaussian parameter 2026-06-04T16:48:31Z

The sub-Gaussian parameter (also called the variance proxy) of a mean-zero random variable $X$ is defined as $ξ^2_* = \sup_{λ\in \mathbb{R}} L(λ)$ where $L(λ) = \frac{2}{λ^2} \log \mathbb{E} e^{λX}$ is a weighted cumulant generating function. Despite the ubiquity of sub-Gaussian random variables, the estimation of $ξ^2_*$ has received little attention and is not yet well understood. In this work, we study a natural estimator of $ξ^2_*$ based on constrained maximization of the empirical analogue of $L$. We prove that the estimator is consistent bound the rates of convergence under assumptions on $L$: if $L$ has an maximizer, then our bound is $O_p(n^{-1/2 + \varepsilon})$ for any $\varepsilon > 0$; if the argmax of $L$ is also bounded, then the bound improves to $O_p(n^{-1/2})$. We show that our assumptions on $L$ are necessary by proving that the minimax risk over all sub-Gaussian distributions is $Ω(1)$; imposing increasingly strong assumptions on the tail growth of $L$ yields a continuum of classes whose minimax lower bound interpolates between $Ω(1/\log n)$ and $Ω(1)$. Root-n rate is possible if we restrict to a subclass of distributions where $L$ attains its supremum in a bounded region, in which case our estimator is minimax optimal. If the underlying distribution is not sub-Gaussian, we show that our estimator goes to infinity with a divergence rate controlled by the tail of the distribution. Finally, we apply our estimator in a Gene Ontology (GO) enrichment study to construct p-values for a large-scale permutation test, showing that it can serve as a reliable alternative to the peaks-over-threshold approach, particularly in regimes where the peaks-over-threshold method is of uncertain validity.

2026-06-04T16:48:31Z 31 pages, 3 figures, and 1 table Jason Liu Min Xu Jinchuan Xing http://arxiv.org/abs/2605.07096v2 Query-efficient model evaluation using cached responses 2026-06-04T16:41:18Z

Evaluating a new model on an existing benchmark is often necessary to understand its behavior before deployment. For modern evaluation frameworks, generating and evaluating a response for all queries can be prohibitively expensive. In practice, responses from previously-evaluated models are often cached -- creating a potential opportunity to use this additional information to decrease the number of queries required to accurately evaluate a new model. In this paper, we introduce an approach for predicting benchmark performance that leverages cached model responses based on the Data Kernel Perspective Space (DKPS), a method for quantifying the relationship between models in the black-box setting. Theoretically, we show that DKPS-based methods are query-efficient under certain conditions. Empirically, we demonstrate that DKPS-based methods achieve the same mean absolute error as baselines with a substantially decreased query budget. We conclude by proposing an offline method for selecting a set of queries that maximizes the goodness-of-fit on reference models, improving prediction accuracy over random query selection.

2026-05-08T01:24:06Z Hayden Helm Ben Johnson Carey Priebe