https://arxiv.org/api/SU1tktm1aRcAYAj2bFbeHxP8ELk2026-03-31T10:03:23Z3481321015http://arxiv.org/abs/2603.22408v1Spline Quantile Regression with Cubic and Linear Smoothing Splines2026-03-23T18:00:15ZSpline quantile regression (SQR) is a method introduced recently by Li and Megiddo (2026) for linear quantile regression where the regression coefficients are treated as smooth functions of the quantile level. With the coefficients represented by cubic splines with fixed knots on a given set of quantiles, the SQR method produces an estimate for the functional coefficients by solving a penalized quantile regression problem. The $\ell_1$-norm of the second derivatives of the coefficients is employed as the penalty for regulating the roughness of the functional coefficients. This extends the SQR method by introducing additional pairings of the functional representation for the regression coefficients and the penalty for their roughness. The resulting cubic and linear SQR solutions are shown to be smoothing splines which are optimal in a functional space larger than the respective spline space with fixed knots. It is shown that the cubic SQR can be reformulated and solved as a quadratic program and the linear SQR as a linear program. A simulation study demonstrates that the SQR solutions not only offer a concise functional representation of the regression coefficients with distinct smoothness characteristics, but also provide a capability of producing more accurate estimates of the regression coefficients when the underlying functions are suitably smooth. Application of the SQR solutions is demonstrated by real-data examples, including a Granger causality analysis of stock market indices.2026-03-23T18:00:15ZTa-Hsin Lihttp://arxiv.org/abs/2603.22215v1Multiview Graph Fusion with Covariates2026-03-23T17:12:48ZJoint modeling of multiview graphs with a common set of nodes between views and auxiliary predictors is an essential, yet less explored, area in statistical methodology. Traditional approaches often treat graphs in different views as independent or fail to adequately incorporate predictors, potentially missing complex dependencies within and across graph views and leading to reduced inferential accuracy. Motivated by such methodological shortcomings, we introduce an integrative Bayesian approach for joint learning of a multiview graph with vector-valued predictors. Our modeling framework assumes a common set of nodes for each graph view while allowing for diverse interconnections or edge weights between nodes across graph views, accommodating both binary and continuous valued edge weights. By adopting a hierarchical Bayesian modeling approach, our framework seamlessly integrates information from diverse graphs through carefully designed prior distributions on model parameters. This approach enables the estimation of crucial model parameters defining the relationship between these graph views and predictors, as well as offers predictive inference of the graph views. Crucially, the approach provides uncertainty quantification in all such inferences. Theoretical analysis establishes that the posterior predictive density for our model asymptotically converges to the true data-generating density, under mild assumptions on the true data-generating density and the growth of the number of graph nodes relative to the sample size. Simulation studies validate the inferential advantages of our approach over predictor-dependent tensor learning and independent learning of different graph views with predictors. We further illustrate model utility by analyzing functional connectivity graphs in neuroscience under cognitive control tasks, relating task-related brain connectivity with phenotypic measures.2026-03-23T17:12:48Z46 pagesSharmistha GuhaJose Rodriguez-AcostaIvo Dinovhttp://arxiv.org/abs/2206.02088v3LOCO Feature Importance Inference without Data Splitting via Minipatch Ensembles2026-03-23T16:56:50ZFeature importance inference is critical for the interpretability and reliability of machine learning models. There has been increasing interest in developing model-agnostic approaches to interpret any predictive model, often in the form of feature occlusion or leave-one-covariate-out (LOCO) inference. Existing methods typically make limiting distributional assumptions, modeling assumptions, and require data splitting. In this work, we develop a novel, mostly model-agnostic, and distribution-free inference framework for feature importance in regression or classification tasks that does not require data splitting. Our approach leverages a form of random observation and feature subsampling called minipatch ensembles; it utilizes the trained ensembles for inference and requires no model-refitting or held-out test data after training. We show that our approach enjoys both computational and statistical efficiency as well as circumvents interpretational challenges with data splitting. Further, despite using the same data for training and inference, we show the asymptotic validity of our confidence intervals under mild assumptions. Additionally, we propose theory-supported solutions to critical practical issues including vanishing variance for null features and inference after data-driven tuning for hyperparameters. We demonstrate the advantages of our approach over existing methods on a series of synthetic and real data examples.2022-06-05T03:14:48ZLuqin GanLili ZhengGenevera I. Allenhttp://arxiv.org/abs/2502.04122v2How many unseen species are in multiple areas?2026-03-23T16:06:27ZIn ecology, the description of species composition and biodiversity calls for statistical methods that involve estimating features of interest in unobserved samples based on an observed one. In the last decade, the Bayesian nonparametrics literature has thoroughly investigated the case where data arise from a homogeneous population. In this work, we propose a novel framework to address heterogeneous populations, specifically dealing with scenarios where data arise from two areas. This setting significantly increases the mathematical complexity of the problem and, as a consequence, it has received limited attention in the literature. While early approaches leverage computational methods, we provide a distributional theory for the in-sample analysis of any observed sample and enable out-of-sample prediction for the number of unseen distinct and shared species in additional samples of arbitrary sizes. The latter also extends the frequentist estimators, which solely deal with one-step-ahead prediction. Furthermore, our results can be applied to address sample size determination in sampling problems aimed at detecting distinct and shared species. Our results are illustrated in a real-world dataset concerning a population of ants in the city of Trieste.2025-02-06T14:54:26ZAlessandro ColombiRaffaele ArgientoFederico CamerlenghiLucia Pacihttp://arxiv.org/abs/2603.22071v1Detecting change regions on spheres2026-03-23T15:04:04ZWhile change point detection in time series data has been extensively studied, little attention has been given to its generalisation to data observed on spheres or other manifolds, where changes may occur within spatially complex regions with irregular boundaries, posing significant challenges. We propose a new class of estimators, namely, Change Region Identification and SeParation (CRISP), to locate changes in the mean function of a signal-plus-noise model defined on $d$-dimensional spheres. The CRISP estimator applies to scenarios with a single change region, and is extended to multiple change regions via a newly developed generic scheme. The convergence rate of the CRISP estimator is shown to depend on the VC dimension of the hypothesis class that characterises the change regions in general. We also carefully study the case where change regions have the geometry of spherical caps. Simulations confirm the promising finite-sample performance of this approach. The CRISP estimator's practical applicability is further demonstrated through two real data sets on global temperature and ozone hole.2026-03-23T15:04:04ZDi SuYining ChenTengyao Wanghttp://arxiv.org/abs/2603.22024v1Cost-Aware Optimized Front-Door Experimental Design2026-03-23T14:31:18ZCausal effect estimation often succeeds cost-constrained sequential data collection. This work considers multivariate linear front-door models with arbitrary unobserved confounding on treatment and response. We optimize the experimental design by balancing the statistical efficiency and measurement costs through partial data. The full-data efficient influence function for the causal effect is derived, together with the geometry of all observed-data influence functions. This characterization yields a closed-form optimal sampling policy and an estimator to minimize the asymptotic variance of regular asymptotically linear (RAL) estimators within a class of augmented full-data influence functions. The resulting design also covers back-door estimation. In simulations and applications to biological, medical, and industrial datasets, the optimized designs achieve substantial efficiency gains ($5.3\%$ to $31.9\%$) over naive full-sampling strategies.2026-03-23T14:31:18ZThis article will be published in the proceedings of CLeaR 2026Leopold MareisMathias Drtonhttp://arxiv.org/abs/2603.22006v1A plug-and-play approach with fast uncertainty quantification for weak lensing mass mapping2026-03-23T14:12:48ZUpcoming stage-IV surveys such as Euclid and Rubin will deliver vast amounts of high-precision data, opening new opportunities to constrain cosmological models with unprecedented accuracy. A key step in this process is the reconstruction of the dark matter distribution from noisy weak lensing shear measurements.
Current deep learning-based mass mapping methods achieve high reconstruction accuracy, but either require retraining a model for each new observed sky region (limiting practicality) or rely on slow MCMC sampling. Efficient exploitation of future survey data therefore calls for a new method that is accurate, flexible, and fast at inference. In addition, uncertainty quantification with coverage guarantees is essential for reliable cosmological parameter estimation.
We introduce PnPMass, a plug-and-play approach for weak lensing mass mapping. The algorithm produces point estimates by alternating between a gradient descent step with a carefully chosen data fidelity term, and a denoising step implemented with a single deep learning model trained on simulated data corrupted by Gaussian white noise. We also propose a fast, sampling-free uncertainty quantification scheme based on moment networks, with calibrated error bars obtained through conformal prediction to ensure coverage guarantees. Finally, we benchmark PnPMass against both model-driven and data-driven mass mapping techniques.
PnPMass achieves performance close to that of state-of-the-art deep-learning methods while offering fast inference (converging in just a few iterations) and requiring only a single training phase, independently of the noise covariance of the observations. It therefore combines flexibility, efficiency, and reconstruction accuracy, while delivering tighter error bars than existing approaches, making it well suited for upcoming weak lensing surveys.2026-03-23T14:12:48ZHubert LetermeAndreas TersenovJalal FadiliJean-Luc Starckhttp://arxiv.org/abs/2504.10881v3A Nonparametric Bayesian Local-Global Model for Enhanced Adverse Event Signal Detection in Spontaneous Reporting System Data2026-03-23T14:07:16ZSpontaneous reporting system databases are key resources for post-marketing surveillance, providing real-world evidence (RWE) on the adverse events (AEs) of regulated drugs or other medical products. Various statistical methods have been proposed for AE signal detection in these databases, flagging drug-specific AEs with disproportionately high observed counts compared to expected counts under independence. However, signal detection remains challenging for rare AEs or newer drugs, which receive small observed and expected counts and thus suffer from reduced statistical power. Principled information sharing on signal strengths across drugs/AEs is crucial in such cases to enhance signal detection. However, existing methods typically ignore complex between-drug associations on AE signal strengths, limiting their ability to detect signals. We propose novel local-global mixture Dirichlet process (DP) prior-based nonparametric Bayesian models to capture these associations, enabling principled information sharing between drugs while balancing flexibility and shrinkage for each drug, thereby enhancing statistical power. We develop efficient Markov chain Monte Carlo algorithms for implementation and employ a false discovery rate (FDR)-controlled, false negative rate (FNR)-optimized hypothesis testing framework for AE signal detection. Extensive simulations demonstrate our methods' superior sensitivity -- often surpassing existing approaches by a twofold or greater margin -- while strictly controlling the FDR. An application to FDA FAERS data on statin drugs further highlights our methods' effectiveness in real-world AE signal detection. Software implementing our methods is provided as supplementary material.2025-04-15T05:22:01ZXin-Wei HuangSaptarshi Chakrabortyhttp://arxiv.org/abs/2603.17866v2Bayesian multilevel step-and-turn models for evaluating player movement in American football2026-03-23T14:05:21ZIn sports analytics, player tracking data have driven significant advancements in the task of player evaluation. We present a novel generative framework for evaluating the observed frame-by-frame player positioning against a distribution of hypothetical alternatives. We illustrate our approach by modeling the within-play movement of an individual ball carrier in the National Football League (NFL). Specifically, we develop Bayesian multilevel models for frame-level player movement based on two components: step length (distance between successive locations) and turn angle (change in direction between successive steps). Using the step-and-turn models, we perform posterior predictive simulation to generate hypothetical ball carrier steps at each frame during a play. This enables comparison of the observed player movement with a distribution of simulated alternatives using common valuation measures in American football. We apply our framework to tracking data from the first nine weeks of the 2022 NFL season and derive novel player performance metrics based on hypothetical evaluation.2026-03-18T15:54:28ZQuang NguyenRonald Yurkohttp://arxiv.org/abs/2603.21992v1Pair-based estimators of infection and removal rates for stochastic epidemic models2026-03-23T13:59:12ZStochastic epidemic models can estimate infection and removal rates, and derived quantities such as the basic reproductive number ($R_0$), when both infection and removal times are observed. In practice, however, removal times are often available while infection times are not, and existing methods that rely only on removal times can become unstable or biased. We study inference for stochastic SIR/SEIR models in a partial--observation setting. We develop imputation--based estimators that use a small calibration sample of fully observed infectious periods, derive closed--form expressions for the pairwise exposure terms they require, and use a studentized parametric bootstrap for bias correction and uncertainty quantification. In simulations, removal time--only methods performed poorly in moderate to large $R_0$ scenarios, while observing even tens of complete infectious periods substantially improved the estimation of the infection rate. A reanalysis of the 1861 Hagelloch measles outbreak under simulated missingness recovered stable qualitative differences in transmission between school classes. Based on our results, we advocate for the targeted collection of a modest number of complete infectious periods as a means of improving surveillance in the early stages of an epidemic.2026-03-23T13:59:12ZSeth D. TempleJonathan Terhorsthttp://arxiv.org/abs/2603.21967v1Unified implementation and comparison of Bayesian shrinkage methods for treatment effect estimation in subgroups2026-03-23T13:32:32ZEvaluating treatment effect heterogeneity across patient subgroups is a fundamental aspect of clinical trial analysis. Yet, these analyses have inherent limitations due to small sample sizes and the substantial number of subgroups investigated. Statisticians in regulatory agencies and pharmaceutical companies have begun considering shrinkage methods grounded in Bayesian statistical theory. These methods incorporate priors on treatment effect heterogeneity, which operationally shrink raw subgroup treatment effect estimates towards the overall treatment effect. Various shrinkage estimators and priors have been proposed, yet it remains unclear which methods perform best. This work provides a unified presentation, software implementation (in the R package bonsaiforest2), and simulation comparison of one-way and global shrinkage methods for continuous, binary, count, and time-to-event endpoints. One-way models fit a separate shrinkage model for each subgrouping variable, whereas global models fit a model including all subgroup indicators at once. Both can derive standardized subgroup-specific treatment effects. Across all simulation scenarios, shrinkage methods outperformed the standard subgroup estimator without shrinkage in terms of mean squared error. They were also more efficient in identifying a non-efficacious subgroup. Global shrinkage models tended to have smaller mean squared error and less dependence on hyperprior parameters than one-way models, but also exhibited slightly larger bias and worse frequentist coverage of associated credible intervals. For both models, hyperprior choices anchored in trial assumptions about the anticipated size of the overall treatment effect performed well. We conclude that some degree of shrinkage is preferable to none and advocate for the routine inclusion of shrunken estimates in clinical forest plots to facilitate more robust decision-making.2026-03-23T13:32:32Z26 pages (23 main, 3 supplementary), 5 figures (4 main, 1 supplementary), 8 tables (4 main, 4 supplementary)Marcel WolbersMiriam Pedrera GómezAlex OcampoIsaac Gravestockhttp://arxiv.org/abs/2603.21952v1Parsimonious Subset Selection for Generalized Linear Models with Biomedical Applications2026-03-23T13:09:57ZHigh-dimensional biomedical studies require models that are simultaneously accurate, sparse, and interpretable, yet exact best subset selection for generalized linear models is computationally intractable. We develop a scalable method that combines a continuous Boolean relaxation of the subset problem with a Frank--Wolfe algorithm driven by envelope gradients. The resulting method, which we refer to as COMBSS-GLM, is simple to implement, requires one penalized generalized linear model fit per iteration, and produces sparse models along a model-size path. Theoretically, we identify a curvature-based parameter regime in which the relaxed objective is concave in the selection weights, implying that global minimizers occur at binary corners. Empirically, in logistic and multinomial simulations across low- and high-dimensional correlated settings, the proposed method consistently improves variable-selection quality relative to established penalised likelihood competitors while maintaining strong predictive performance. In biomedical applications, it recovers established loci in a binary-outcome rice genome-wide association study and achieves perfect multiclass test accuracy on the Khan SRBCT cancer dataset using a small subset of genes. Open-source implementations are available in R at https://github.com/benoit-liquet/COMBSS-GLM-R and in Python at https://github.com/saratmoka/COMBSS-GLM-Python.2026-03-23T13:09:57ZAnant MathurBenoit LiquetSamuel MullerSarat Mokahttp://arxiv.org/abs/2402.01491v3Moving Aggregate Modified Autoregressive Copula-Based Time Series Models (MAGMAR-Copulas)2026-03-23T12:28:17ZCopula-based time series models can model univariate and stationary time series in a flexible way by decomposing the joint distribution of consecutive observations into a copula and the stationary distribution. Implicitly this approach assumes a finite Markov order. In reality a time series may not follow the Markov property. We modify the copula-based time series models by introducing a moving aggregate (MAG) part into the model updating equation. The functional form of the MAG-part is given as the conditional quantile function corresponding to a copula. The resulting MAG-modified Autoregressive Copula-Based Time Series model (MAGMAR-Copula) is discussed in detail and distributional properties are derived in a D-vine framework. We show that the stationary distribution implied by the model is not standard-uniform. Hence we propose an adjustment transformation that recovers the desired standard-uniformity. The model nests the classical ARMA model and can be interpreted as a non-linear generalization of the ARMA model. The modeling performance is evaluated by modeling US inflation. Our model is competitive with benchmark models in terms of information criteria.2024-02-02T15:18:45ZSven Papperthttp://arxiv.org/abs/2603.21844v1On the Number of Conditional Independence Tests in Constraint-based Causal Discovery2026-03-23T11:33:43ZLearning causal relations from observational data is a fundamental problem with wide-ranging applications across many fields. Constraint-based methods infer the underlying causal structure by performing conditional independence tests. However, existing algorithms such as the prominent PC algorithm need to perform a large number of independence tests, which in the worst case is exponential in the maximum degree of the causal graph. Despite extensive research, it remains unclear if there exist algorithms with better complexity without additional assumptions. Here, we establish an algorithm that achieves a better complexity of $p^{\mathcal{O}(s)}$ tests, where $p$ is the number of nodes in the graph and $s$ denotes the maximum undirected clique size of the underlying essential graph. Complementing this result, we prove that any constraint-based algorithm must perform at least $2^{Ω(s)}$ conditional independence tests, establishing that our proposed algorithm achieves exponent-optimality up to a logarithmic factor in terms of the number of conditional independence tests needed. Finally, we validate our theoretical findings through simulations, on semi-synthetic gene-expression data, and real-world data, demonstrating the efficiency of our algorithm compared to existing methods in terms of number of conditional independence tests needed.2026-03-23T11:33:43ZMarc Franquesa MonésJiaqi ZhangCaroline Uhlerhttp://arxiv.org/abs/2506.12771v2Machine-Learning-Powered Specification Testing in Linear Instrumental Variable Models2026-03-23T10:48:10ZThe linear instrumental variable (IV) model is widely used in observational studies, yet its validity hinges on strong assumptions. Classical specification tests such as the Sargan-Hansen J test are limited to overidentified settings and are therefore not applicable in the common just-identified case, where the number of instruments is equal to the number of endogenous variables. We propose a novel test for the well-specification of the linear IV model under the assumption that the structural error is mean independent of the instruments. This assumption enables specification testing even in the just-identified setting. Our approach uses the idea of residual prediction: if the two-stage least squares residuals can be predicted from the instruments better than chance, this indicates misspecification. The resulting test employs sample splitting and a user-chosen machine learning method, and we show asymptotic type I error control and consistency against a broad class of alternatives. We further show how the proposed testing principle can be adapted to settings with weak or many instruments via an Anderson-Rubin-type inversion, thereby substantially extending the applicability. The tests accommodate heteroskedasticity- and cluster-robust inference and are implemented in the R package RPIV and the ivmodels software package for Python.2025-06-15T08:42:48ZCyrill ScheideggerMalte LondschienPeter Bühlmann