REX-SUB: A Scalable Subsampling Strategy for Modeling Large Spatial Datasets

2026-05-15T15:35:42Z

Recent advances in data collection technologies have led to the emergence of massive spatial datasets, with measurements obtained at millions of spatial locations. Geostatistical models typically employ Gaussian processes (GPs) to capture spatial dependence, but standard GP fitting becomes prohibitive at such scales. A promising solution is optimal subsampling, where a subset of locations is selected that optimizes a criterion. In this study, we propose a randomized exchange algorithm for subsampling (REX-SUB) which efficiently selects small subsamples that minimize prediction errors in the fitted spatial GP models. To further improve computational efficiency, we embed a scalable Vecchia approximation to the GP's joint likelihood, which takes advantage of sparsity in the precision matrix to enable fast inference on the selected subsamples. Through a simulation study and an application to a remotely sensed precipitable water dataset, we show that REX-SUB yields lower mean squared prediction errors and interval scores compared to competing subsampling strategies.

Asymptotic properties of the MLE in distributional regression under random censoring

2026-05-15T15:24:27Z

Distributional regression aims to find the best candidate in a given parametric family of conditional distributions to model a given dataset. As each candidate in the distribution family can be identified by the corresponding distribution parameters, a common approach for this task is to use the maximum likelihood estimator (MLE) for the parameters. In this paper, we establish theoretical results for this estimator in case the response variable is subject to random right censoring. In particular, we provide proofs of almost sure consistency and asymptotic normality of the MLE under censoring. The empirical behavior is illustrated by a simulation study and a real data example.

Asymptotic inference with flexible covariate adjustment under rerandomization and stratified rerandomization

2026-05-15T14:45:56Z

Rerandomization is an effective treatment allocation procedure to control for baseline covariate imbalance. For estimating the average treatment effect, rerandomization has been previously shown to improve the precision of the unadjusted and the linearly-adjusted estimators over simple randomization without compromising consistency. However, it remains unclear whether such results apply more generally to the class of M-estimators, including the g-computation formula with generalized linear regression and doubly-robust methods, and more broadly, to efficient estimators with data-adaptive machine learners. In this paper, we develop the asymptotic theory for a more general class of covariate-adjusted estimators under rerandomization and its stratified extension. We prove that the asymptotic linearity and the influence function remain identical for any M-estimator under simple randomization and rerandomization, but rerandomization may lead to a non-Gaussian asymptotic distribution. We further explain, drawing examples from several common M-estimators, that asymptotic normality can be achieved if rerandomization variables are appropriately adjusted for in the final estimator. These results are extended to stratified rerandomization. Finally, we study the asymptotic theory for efficient estimators based on data-adaptive machine learners, and prove their efficiency optimality under rerandomization and stratified rerandomization. Our results are demonstrated via simulations and re-analyses of a cluster-randomized experiment that used stratified rerandomization.

A General Framework for Optimal Group Sequential Testing via Mixed-Integer Linear Programming

2026-05-15T14:29:19Z

Sequential hypothesis tests are widely adopted as a principled way to perform multiple tests on data that arrives over time. In particular, researchers frequently utilize group sequential hypothesis tests (GST) to test the same hypotheses at K times or "groups" while data arrives sequentially. In this setting, many methods have been proposed to allow researchers to uniformly control type-1 error across K checks (often known as various alpha-spending budgets). Although these methods are all successfully valid in controlling uniform type-1 error, it is not clear which of these methods are optimal when trying to reject the null as soon as possible. In this paper, we directly optimize the rejection criterion in the GST setting under the same constraints of controlling type-1 and type-2 errors. We use a sample average approximation combined with mixed integer linear programming (S-MILP) approach for this problem and show how our S-MILP approach dominates classical GST procedures such as Lan-DeMets, Pocock, and O'Brien-Fleming methods. We also find that the optimal solution typically aggressively spends the alpha-budget early, shedding insight to the long-standing debate of which alpha-spending budgets are more efficient. We finally apply our optimal S-MILP approach to a recent study on acute kidney injury interventions and find our optimal S-MILP approach can reach the same statistically significant conclusion faster than the original study and other GST methods.

Quasi-Bayesian Local Projection Instrumental-Variables Method: Application to Renewable Energy and Electricity Prices

2026-05-15T13:56:37Z

This paper introduces a quasi-Bayesian approach for local projection instrumental-variables (LP-IV) estimation. It builds a moment-based quasi-posterior using the generalized method of moments (GMM) objective and applies a roughness-penalty prior to smooth impulse responses over different horizons. The approach maintains the key first-order features of traditional LP-IV methods, while enhancing stability in finite samples and allowing for joint inference through simultaneous bands. Simulations indicate that this regularization decreases root mean squared error compared to standard GMM, especially at medium and longer horizons. An application to Danish electricity markets highlights the method's practical usefulness.

Statistical Inference for Smoothed Support Vector Machines in High Dimensions: From Offline to Online Data

2026-05-15T12:49:18Z

High-dimensional classification problems often rely on the Lasso-penalized linear Support Vector Machines (SVMs). However, the double non-smoothness induced by the hinge loss and Lasso penalty in this model makes statistical inference challenging and impedes computational efficiency. In this paper, we propose a unified inference framework in both offline and online settings. In the offline case, by applying a convolution smoothing technique to the hinge loss, we construct a debiased estimator that eliminates the shrinkage bias, thereby building a valid confidence interval. For online streaming data, we develop a real-time estimator and inference procedure that relies only on summary statistics of historical data. Theoretically, we provide rigorous proofs for the asymptotic normality of our offline and online debiased estimators. Simulation studies and real data applications demonstrate that our methods achieve valid statistical inference and improved computational efficiency.

Bayesian nonparametric boundary detection for multiple areal data

2026-05-15T12:31:57Z

We consider the problem of boundary detection for areal data, focusing on situations where for each areal unit multiple observations are available. We propose a Bayesian nonparametric mixture model for the area-specific population densities, with spatially dependent weights and a random number of components. Contrary to previously proposed methods for boundary detection, which consider one observation per areal unit, ours does not require external information such as area-specific covariates or dissimilarity metrics. Instead, by exploiting information from multiple samples per area, it is able to identify boundaries between areas that exhibit different densities. Crucially, the number of mixture components needs to be learned from data to obtain meaningful boundary detection, due to the non-identifiability of overfitted mixtures. Therefore, we assume it random by placing a prior on it. The motivating application is the analysis of economic inequality in the greater Los Angeles region, which typically yields social inequality and unrest. Efficient posterior computation is facilitated by a transdimensional Markov Chain Monte Carlo sampler which exploits the recently introduced optimal auxiliary priors to improve the mixing. The methodology is validated via extensive simulations and applied to the income data in the greater Los Angeles region. We identify several boundaries in the income distributions, which can be explained ex-post in terms of the percentage of the population without health insurance, though not in terms of the total number of crimes, showing the usefulness of such an analysis to policymakers.

Cross-Validation in Bipartite Networks

2026-05-15T11:04:11Z

Bipartite networks, which encode interactions between two distinct types of entities, arise widely in applications and exhibit inherent asymmetry across node sets. Despite a growing literature on bipartite community detection, estimating community numbers $(K_1, K_2)$, a critical issue for bipartite network analysis, remains theoretically underdeveloped without any model selection consistency established, to our knowledge. Indeed, the inherent asymmetry and the two-dimensional parameter space with possibly drastically different $K_1$ and $K_2$ pose unique challenges that differ from unipartite cases. In particular, the candidate models may simultaneously overfit one node set while underfitting the other. To address these challenges, we propose Bipartite Cross-Validation (BCV), a penalized cross-validation framework that jointly selects $(K_1,K_2)$ in a fully data-driven manner. We establish the first model selection consistency for bipartite networks, notably accommodating the regime where the numbers of communities scale with the network size, revealing the intricate interplay between sparsity and model complexity. Simulations and real-data applications demonstrate strong finite-sample performance of BCV.

Bayesian Inference for Non-Conjugate Distance Dependent Chinese Restaurant Process Models

2026-05-15T11:00:44Z

The distance dependent Chinese Restaurant Process (ddCRP) provides a flexible prior distribution for clustering observations, incorporating covariate information through pairwise distances and accommodating a rich variety of cluster structures. When cluster parameters are conjugate to the likelihood, Bayesian inference is straightforward. In the non-conjugate setting, however, inference becomes substantially more challenging due to the trans-dimensional parameter spaces that arise as cluster assignments change. We develop a reversible jump Markov chain Monte Carlo (RJMCMC) framework to address this challenge, targeting the dimension-changing nature of cluster parameter vectors when observation assignments are updated. We introduce and compare several proposal strategies for birth and death moves, including prior-based, independence, and data-driven moment-matching proposals that target regions of high posterior density. For fixed-dimensional moves, we propose a posterior resampling strategy that improves acceptance rates while maintaining computational efficiency. Through a simulation study and an application to Old Faithful eruption durations, we demonstrate moment-matched proposals offer a principled, data-driven alternative to prior-based proposals. The resulting methodology provides a general RJMCMC framework for ddCRP models with non-conjugate likelihoods, demonstrated here on both discrete and continuous observation models.

Unbiased likelihood estimation of the Langevin diffusion for animal movement modelling

2026-05-15T10:21:23Z

An ongoing challenge in animal ecology is developing movement models that account for the autocorrelation, and often temporal irregularity, in telemetry data. Continuous-time Langevin diffusion models have been proposed to model temporally autocorrelated and irregularly sampled data. However, current estimation techniques obtain increasingly biased parameter estimates as the time between observations increases. In this paper, we propose using Brownian bridges in an importance sampling scheme to improve the likelihood approximation of the Langevin diffusion model. In a series of simulation studies, we showed that our approach effectively removed the bias under various scenarios. We found that the precision of the estimated habitat coefficients increased for data spanning a longer duration at a lower frequency than for shorter, more frequently sampled tracks. This suggests that the model may be well suited for modelling tracking data sampled at a coarser resolution, as is common in datasets collected with older generations of animal tags. We illustrated the application of our model using tracking data from Steller sea lions, \textit{Eumetopias jubatus}. We found that the coefficient estimates converged to values significantly different than those estimated in previous studies, suggesting that bias in conventional estimation methods may meaningfully affect ecological conclusions about habitat preference. Together, these improvements broaden the applicability of Langevin diffusion models, thereby improving ecological insight into habitat selection.

Mean-field Variational Bayes for Sparse Probit Regression

2026-05-15T09:56:48Z

We consider Bayesian variable selection for binary outcomes under a probit link with a spike-and-slab prior on the regression coefficients. Motivated by the computational challenges encountered by Markov chain Monte Carlo (MCMC) samplers in high-dimensional regimes, we develop a mean-field variational Bayes approximation in which all variational factors admit closed-form updates, and the evidence lower bound is available in closed form. This, in turn, allows the development of an efficient coordinate ascent variational inference algorithm to find the optimal values of the variational parameters. The approach produces posterior inclusion probabilities and parameter estimates, enabling interpretable selection and prediction within a single framework. As shown in both simulated and real data applications, the proposed method successfully identifies the important variables and is orders of magnitude faster than MCMC, while maintaining comparable accuracy.

Generalized raking and stabilized weights for regression modeling in two-phase samples

2026-05-15T09:56:34Z

In regression models fitted to data from complex survey designs, sampling weights often incorporate non-essential variation, inflating variance estimates. Stabilized weights mitigate this issue by adjusting sampling weights to account for variation explained by covariates. In the context of two-phase sampling, we evaluate the performance of optimal stabilized weights and propose combining the stabilized weight estimator with generalized raking, a class of efficient design-based estimators. This combination improves efficiency by reducing unnecessary weight variation and leveraging information from auxiliary variables. We show this combination can be implemented using the standard statistical package that handles two-phase samples and generalized raking. Simulation studies demonstrate that the proposed estimator enhances precision under realistic two-phase designs, though efficiency gains may be limited in highly informative designs. The developed methods were applied to a large multinational two-phase study of Kaposi sarcoma among people living with HIV.

NMF-FFB: Non-negative matrix factorization with feedforward-feedback structure

2026-05-15T09:03:18Z

Non-negative matrix factorization (NMF) approximates a non-negative endogenous data matrix as $Y_1 \approx XB$, with non-negative latent components $X$ and coefficients $B$. Standard covariate-aware NMF is feedforward: $B$ depends only on exogenous variables $Y_2$, with no latent feedback among endogenous variables. We propose NMF-FFB (NMF with feedforward-feedback structure), an exploratory data-fitting framework that embeds the simultaneous equation $B = Θ_1 Y_1 + Θ_2 Y_2$ in NMF, where $Θ_1$ is non-negative latent feedback and $Θ_2$ non-negative exogenous pathways. NMF-FFB is positioned within data-fitting structural equation modeling (SEM): it fits $Y_1$ directly rather than a model-implied covariance, and is not a confirmatory measurement model or a replacement for maximum-likelihood SEM under standard confirmatory factor analysis assumptions. When $ρ(XΘ_1)<1$, the reduced form $Y_1 \approx (I-XΘ_1)^{-1} XΘ_2 Y_2$ defines a latent Leontief inverse separating direct from cumulative feedback-amplified effects. Estimation uses regularized multiplicative updates with orthogonality and sparsity penalties; an $X$-fixed bootstrap summarizes uncertainty for the feedback spectral radius, the amplification ratio, and path coefficients. Unlike conventional SEM, NMF-FFB requires only the latent rank $Q$ and lets $X$ group endogenous indicators into latent factors. This suits non-negative additive data, automatic loading discovery, Leontief-type cumulative effects, and small samples where covariance-based maximum-likelihood fitting is ill-conditioned. Applications to Holzinger-Swineford, Los Angeles pollution-mortality, and Mississippi county-level health data demonstrate interpretable parts-based representations across distinct latent-feedback regimes.

Bayesian inference for the learning rate in Generalised Bayesian inference

2026-05-15T08:45:58Z

In Generalised Bayesian Inference (GBI), the learning rate and hyperparameters of the loss must be estimated. These inference-hyperparameters can't be estimated jointly with the other parameters, from the data, by giving them a prior. However, in some settings there exist unknown ``true'' hyperparameter-values about which it is meaningful to have prior belief. It is then possible to use Bayesian inference with held-out data to get hyperparameter-posteriors. We define two hyperparameter posteriors, one based on an ELPPD-utility and one aiming to cover the pseudo-true parameter. The new framework supports estimation and uncertainty quantification for multiple hyperparameters jointly. Experiments show that the resulting GBI-posteriors out-perform Bayesian inference on simulated test data and select optimal or near optimal hyperparameter values in a large real problem of text analysis. Generalised Bayesian inference is particularly useful for combining multiple data sets and most of our examples belong to that setting. We also give asymptotic results for some of the special ``multi-modular'' Generalised Bayes posteriors which we use in our examples.

Re-examining and calibrating weighted survival analysis for causal inference

2026-05-15T07:47:33Z

Causal inference with time-to-event outcomes is fundamental in various scientific studies. In a static setup with fitted propensity scores, weighted Kaplan-Meier estimation for survival probabilities and weighted Breslow-Peto estimation for hazard ratios have been widely used, but their statistical properties have been overlooked or studied only to a limited extent. We re-examine the weighted Kaplan-Meier method by formally linking it with the general framework of augmented inverse probability weighted estimation including both point and variance estimation. Furthermore, to address limitations of existing weighted methods for survival analysis, we develop new methods and associated theory through calibrated estimation in both low-dimensional and high-dimensional settings. We present a simulation study and an empirical application on the effectiveness of adjunctive psychotropic treatments for patients with schizophrenia. The calibrated methods yield coverage proportions closer to target ones in the simulation study, and produce shorter confidence intervals in both simulation and empirical studies.