https://arxiv.org/api/NhJ6vPRVfKHF1gbNLzG/thJKInk 2026-06-11T05:48:30Z 36146 450 15 http://arxiv.org/abs/2605.30485v1 M-estimation with e-statistics 2026-05-28T19:04:55Z

We present a theory of point estimation with e-statistics (e-values and e-processes) by introducing the "ME-estimator": the parameter that minimizes the corresponding e-statistic, or the evidence against it. Our approach is based on the intuitive idea of e-statistics as a measure of evidence and betting pay-off, and naturally generalizes the classical method of maximum likelihood estimation. First, we establish the consistency as well as the almost sure convergence rate for ME-estimators relating to the high-probability bounds on the size of the confidence set derived from thresholding the e-statistics, an approach that sets ME-estimators apart from traditional M-estimators. Second, we conduct classical M-estimator-style analysis on the consistency and asymptotic normality of ME-estimators in the bounded mean estimation setting, discussing the notion of efficiency (or lack thereof) from various choices of betting strategy. Our work brings e-statistics, a fundamental tool for inference and uncertainty quantification, to the space of estimation.

2026-05-28T19:04:55Z Hongjian Wang Aaditya Ramdas http://arxiv.org/abs/2605.30471v1 Multidimensional Item Response Theory under General Latent Distributions 2026-05-28T18:41:56Z

Multidimensional item response theory (MIRT) provides an important psychometric framework for modeling how multiple latent traits jointly influence observed item responses. In most existing estimation procedures, the latent trait distribution is assumed to be Gaussian. Although computationally convenient, this assumption can be restrictive in many applications where the latent distribution exhibits skewness, heavy tails, or multimodality. More importantly, misspecifying the latent distribution may bias the estimation of item parameters and latent traits. To address this limitation, we propose a data-driven flow-based framework for MIRT models that can capture a broad class of non-Gaussian latent distributions. The proposed approach represents the latent distribution as an invertible transformation of a simple base distribution. For efficient estimation, we further introduce a conditional flow as a function of both the observed response and the noise to approximate the posterior distribution. Under this framework, the item parameters, latent distribution, and posterior approximation can be learned jointly. Comprehensive simulation studies show that the proposed method improves item-parameter and latent-trait recovery when the true latent distribution is non-normal. An application to a personality dataset further illustrates the practical utility of the proposed framework for modeling complex latent trait distributions in large-scale data.

2026-05-28T18:41:56Z Chengyu Cui Taoyi Chen Chun Wang Gongjun Xu http://arxiv.org/abs/2605.30287v1 MoSAIC: Multi-Resolution Spatial Regression Analysis of Cellular Colocalizations in Cancer Imaging 2026-05-28T17:40:35Z

Hierarchical multiplex imaging approaches generate spatially resolved single-cell measurements across multiple, spatially organized fields of view (FOVs) within patient tumor specimens, thereby enabling systematic investigation of how the organization of the tumor microenvironment varies along biologically meaningful intratumoral gradients. Existing approaches fail to jointly address this multi-resolution data structure needed to recover true biological signals. We propose MoSAIC: multi-resolution spatial regression analysis of cell colocalizations, a hierarchical Bayesian spatial regression model designed for multi-resolution spatial data. MoSAIC decomposes the joint variation into three model components: (i) global tumor-gradient effects, (ii) patient-specific effects to capture inter-patient variability, and (iii) Gaussian process models to account for spatial dependence between FOVs within each patient tumor tissue. Simulations demonstrate MoSAIC has improved prediction and model fit compared to existing spatial and non-spatial model alternatives. Our method is motivated by and applied to a renal cell carcinoma multiplex imaging cohort to investigate immune-tumor colocalization patterns across the epithelial-to-mesenchymal transition (EMT) gradient. MoSAIC identifies increased macrophage-tumor colocalization and decreased cytotoxic T-tumor colocalization progressing across the increasing EMT gradient, consistent with EMT-associated immune suppression and spatially varying immune engagement. Overall, MoSAIC provides an interpretable, multi-resolution framework for quantifying spatial tumor-gradient effects in cancer imaging studies. Software is available on GitHub at jcaldous/MoSAIC.

2026-05-28T17:40:35Z 45 pages (30 before supplement), 6 figures, submitted to ISBA and JSM Jessica Aldous Michele Peruzzi Maria Masotti Aaron Udager Allison May Evan Keller Veerabhadran Baladandayuthapani http://arxiv.org/abs/2511.12732v3 Scalable and Communication-Efficient Varying Coefficient Mixed Effect Models: Methodology, Theory, and Applications 2026-05-28T16:32:54Z

Human migration exhibits complex spatiotemporal dependence driven by environmental and socioeconomic forces. Modeling such patterns at scale requires methods that accommodate many random effects while remaining feasible when raw data or large design matrices cannot be freely shared across distributed nodes. We develop a communication-efficient inference framework for Varying Coefficient Mixed Models (VCMMs) with flexible mean structures and large correlated random-effect components. Using a Bayesian hierarchical representation of penalized splines, we derive sufficient statistics that preserve each node's likelihood contribution and recover the estimator from the full data under unrestricted communication. Under communication constraints, these statistics support a one-step communication-efficient estimator with first-order efficiency. An SVD-enhanced implementation stabilizes large or ill-conditioned random-effect covariance operators. Theory establishes likelihood preservation, convergence, asymptotic efficiency, and finite-sample concentration. Simulations and U.S. migration-flow data demonstrate accuracy, scalability, and recovery of dynamic spatial patterns.

2025-11-16T18:58:31Z 3 Figures Lida Chalangar Jalili Dehkharghani Li-Hsiang Lin http://arxiv.org/abs/2605.30178v1 Cellwise Robust Discriminant Analysis 2026-05-28T16:25:00Z

Classical discriminant analysis (DA) is based on the mean and empirical covariance matrix of each class, both of which are sensitive to outliers in the data. In the past the focus was on casewise outliers, that is, datapoints that lie far away. But nowadays there is increasing interest in cellwise outliers, that are unexpected entries in the data matrix. Removing an entire case because it has one or a few outlying cells would lose much information. Cellwise robust methods aim to detect the outlying cells and to preserve the information in the other cells. We propose a DA method that is trained by estimating the location and covariance of each class by cellwise and casewise robust estimators, that can also handle NA's. The main novelty of our approach is in the prediction on test data, that may contain outlying cells and NA's themselves. The new robust discriminant function is derived from a novel statistical model by penalized maximum likelihood. We focus on quadratic DA, but also cover the setting of linear DA. The new cellQDA and cellLDA methods perform well in simulation. The approach is illustrated on real data, and the results are interpreted with the help of graphical displays.

2026-05-28T16:25:00Z Fabio Centofanti Can Hakan Dagidir Mia Hubert Peter J. Rousseeuw http://arxiv.org/abs/2605.30158v1 High-Dimensional Data with Measurement Error 2026-05-28T16:17:14Z

In many important statistical analyses, the number of covariates $p$ often exceeds the data size $n$, a regime commonly referred to as high-dimensional. While considerable progress has been made in high-dimensional regression under the assumption of error-free covariates, real-world data frequently involve noisy or corrupted measurements. When left unaddressed, measurement errors can silently distort the analysis and mislead the conclusions. This paper reviews and evaluates some advisable statistical inference methods for high-dimensional regression in the presence of mismeasured covariates. We discuss four penalized regression methods -- ridge, lasso, Dantzig selector, and Elastic-net -- alongside their measurement-error-corrected variants, and conduct a comparative study under linear additive and uncorrelated measurement error models. Through simulation studies and a real application to high-dimensional medical genetic data, we illustrate the methods studied, show that the choice of correction procedure is problem-specific, and provide practical recommendations to help practitioners navigate this methodological landscape.

2026-05-28T16:17:14Z 21 pages, 0 figure Herman Tesso Georges Nguefack-Tsague http://arxiv.org/abs/2510.27663v3 Bayesian model selection and misspecification testing in imaging inverse problems only from noisy and partial measurements 2026-05-28T16:08:35Z

Modern imaging techniques heavily rely on Bayesian statistical models to address difficult image reconstruction and restoration tasks. This paper addresses the objective evaluation of such models in settings where ground truth is unavailable, with a focus on model selection and misspecification diagnosis. Existing unsupervised model evaluation methods are often unsuitable for computational imaging due to their high computational cost and incompatibility with modern image priors defined implicitly via machine learning models. We herein propose a general methodology for unsupervised model selection and misspecification detection in Bayesian imaging sciences, based on a novel combination of Bayesian cross-validation and data fission, a randomized measurement splitting technique. The approach is compatible with any Bayesian imaging sampler, including diffusion and plug-and-play samplers. We demonstrate the methodology through experiments involving various scoring rules and types of model misspecification, where we achieve excellent selection and detection accuracy with a low computational cost.

2025-10-31T17:32:11Z Tom Sprunck Marcelo Pereyra Tobias Liaudat http://arxiv.org/abs/2509.24100v2 SpeedCP: Fast Kernel-based Conditional Conformal Prediction 2026-05-28T16:07:33Z

Conformal prediction provides distribution-free prediction sets with finite-sample conditional guarantees. We build upon the RKHS-based framework of Gibbs et al. (2023), which leverages families of covariate shifts to provide approximate conditional conformal prediction intervals, an approach with strong theoretical promise, but with prohibitive computational cost. To bridge this gap, we develop a stable and efficient algorithm that computes the full solution path of the regularized RKHS conformal optimization problem, at essentially the same cost as a single kernel quantile fit. Our path-tracing framework simultaneously tunes hyperparameters, providing smoothness control and data-adaptive calibration. To extend the method to high-dimensional settings, we further integrate our approach with low-rank latent embeddings that capture conditional validity in a data-driven latent space. Empirically, our method provides reliable conditional coverage across a variety of modern black-box predictors, improving the interval length of Gibbs et al. (2023) by 30%, while achieving a 40-fold speedup.

2025-09-28T22:38:33Z Yating Liu Yeo Jin Jung Zixuan Wu So Won Jeong Claire Donnat http://arxiv.org/abs/2212.12435v2 Second-level global sensitivity analysis of numerical simulators with application to an accident scenario in a sodium-cooled fast reactor 2026-05-28T15:36:02Z

Numerical simulators are widely used to model physical phenomena and global sensitivity analysis (GSA) aims at studying the global impact of the input uncertainties on the simulator output. To perform GSA, statistical tools based on inputs/output dependence measures are commonly used. We focus here on the Hilbert-Schmidt independence criterion (HSIC). Sometimes, the probability distributions modeling the uncertainty of inputs may be themselves uncertain and it is important to quantify their impact on GSA results. We call it here the second-level global sensitivity analysis (GSA2). However, GSA2, when performed with a Monte Carlo double-loop, requires a large number of model evaluations, which is intractable with CPU time expensive simulators. To cope with this limitation, we propose a new statistical methodology based on a Monte Carlo single-loop with a limited calculation budget. First, we build a unique sample of inputs and simulator outputs, from a well-chosen probability distribution of inputs. From this sample, we perform GSA for various assumed probability distributions of inputs by using weighted HSIC measures estimators. Statistical properties of these weighted estimators are demonstrated. Subsequently, we define 2 nd-level HSICbased measures between the distributions of inputs and GSA results, which constitute GSA2 indices. The efficiency of our GSA2 methodology is illustrated on an analytical example, thereby comparing several technical options. Finally, an application to a test case simulating a severe accidental scenario on nuclear reactor is provided.

2022-12-21T13:58:57Z This work was intended as a replacement of arXiv:1902.07030 and any subsequent updates will appear there Anouar Meynaoui INSA Toulouse, IMT Amandine Marrel IMT Béatrice Laurent INSA Toulouse, IMT http://arxiv.org/abs/2605.30072v1 Credible rectangles for high-dimensional posterior comparison 2026-05-28T15:21:07Z

We propose a Bayesian framework for uncertainty quantification and comparison in brain connectivity graph analysis. Standard graph-based approaches typically rely on point estimates of correlation matrices, overlooking the uncertainty induced by high-dimensional estimation from limited data. Our methodology constructs and compares credible hyperrectangles derived from posterior distributions, providing interpretable tools for subject-level inference and longitudinal monitoring. We develop scalable algorithms for estimating these regions in high dimensions and establish theoretical guarantees in the inverse-Wishart model for resting-state fMRI data, including a Bernstein--von Mises theorem for correlation matrices and control of a Bayesian family-wise error rate. The proposed framework enables principled detection of significant connectivity differences both globally and locally while preserving joint dependency structures. While demonstrating competitive performance against multiple-testing procedures on synthetic datasets, our approach also facilitates the direct comparison of two distinct scans from a single patient, a capability currently absent from the literature. We leverage this novelty on real datasets to improve interpretability. Beyond fMRI data, the approach provides a general framework for comparison problems in high-dimensional dependent settings.

2026-05-28T15:21:07Z 35 pages, 4 figures Alice Chevaux Julyan Arbel Guillaume Kon Kam King Sophie Achard http://arxiv.org/abs/2605.26964v2 Semiparametric Inference for Causal Effects on Functional Outcomes 2026-05-28T14:21:32Z

Difference-in-differences (DiD) is a cornerstone of causal inference, yet extending it to functional outcomes is not a routine scalar generalization; rather, it entails three fundamental challenges in identification, inference, and observation. This paper develops a comprehensive semiparametric inference framework for functional DiD with discretely observed data. First, we define the functional average treatment effect under parallel trends and derive its efficient influence function (EIF), thereby establishing the semiparametric efficiency bound. Second, leveraging Neyman orthogonality and cross-fitting, we construct a debiased estimator that effectively mitigates regularization bias arising from nonparametric reconstruction. Third, we establish weak convergence of the estimator and propose an asymptotically valid uniform confidence band, enabling a rigorous transition from pointwise to curve-level inference. Finally, we demonstrate that reconstruction error under discrete sampling is asymptotically negligible for semiparametric inference, ensuring practical feasibility. Simulations and empirical applications confirm that the proposed method achieves superior coverage and testing power in finite samples, providing a theoretically grounded and computationally tractable foundation for causal evaluation with functional data.

2026-05-26T12:52:11Z Junzhu Nie Chengxiu Ling Mengfei Ran http://arxiv.org/abs/2305.16842v8 Accounting statement analysis at industry level. A gentle introduction to the compositional approach 2026-05-28T14:19:41Z

Compositional data are contemporarily defined as positive vectors, the ratios among whose elements are of interest to the researcher. Financial statement analysis by means of accounting ratios a.k.a. financial ratios fulfils this definition to the letter. Compositional data analysis solves the major problems in statistical analysis of standard financial ratios at industry level, such as skewness, non-normality, non-linearity, outliers, and dependence of the results on the choice of which accounting figure goes to the numerator and to the denominator of the ratio. Despite this, compositional applications to financial statement analysis are still rare. In this article, we present some transformations within compositional data analysis that are particularly useful for financial statement analysis. We show how to compute industry or sub-industry means of standard financial ratios from a compositional perspective by means of geometric means. We show how to visualise firms in an industry with a compositional principal-component-analysis biplot; how to classify them into homogeneous financial performance profiles with compositional cluster analysis; and how to introduce financial ratios as variables in a statistical model, for instance to relate financial performance and firm characteristics with compositional regression models. We show an application to the accounting statements of Spanish wineries using the decomposition of return on equity by means of DuPont analysis, and a step-by-step tutorial to the compositional freeware CoDaPack.

2023-05-26T11:47:29Z Germà Coenders University of Girona Núria Arimany Serrat University of Vic - Central University of Catalonia http://arxiv.org/abs/2510.05991v2 Robust Inference for Convex Pairwise Difference Estimators 2026-05-28T14:10:03Z

This paper develops distribution theory and bootstrap-based inference methods for a broad class of convex pairwise difference estimators. These estimators minimize a kernel-weighted convex-in-parameter function over observation pairs with similar covariates, where the similarity is governed by a localization (bandwidth) parameter. While classical results establish asymptotic normality under restrictive bandwidth conditions, we show that valid Gaussian and bootstrap-based inference remains possible under substantially weaker assumptions. First, we extend the theory of small bandwidth asymptotics to convex pairwise difference estimation settings, deriving robust Gaussian approximations even when a smaller than standard bandwidth is used. Second, we employ a debiasing procedure based on generalized jackknifing to enable inference with larger bandwidths, while preserving convexity of the objective function. Third, we construct a novel bootstrap method that adjusts for bandwidth-induced variance distortions, yielding valid inference across a wide range of bandwidth choices. Our proposed inference method enjoys demonstrably greater robustness, while retaining the practical appeal of convex pairwise difference estimators.

2025-10-07T14:51:05Z Matias D. Cattaneo Michael Jansson Kenichi Nagasawa http://arxiv.org/abs/2605.29961v1 Modifying causal models to distinguish between transient and lasting causal effects 2026-05-28T14:02:08Z

This paper considers how to classify the effects of interventions in causal models for outcomes and exposures observed over time. First, we demonstrate the limitations of the most common uses of potential outcomes and causal directed acyclic graphs for capturing all possible interventions in a time varying framework, particularly in problems where the key question concerns interventions to maintain or change equilibrium behaviour. Second, we adopt a system and state based approach rather than a measurement-based approach to identify the causal parameters. In particular, we discuss how assumptions about the system's equilibrium and the effects of interventions on that equilibrium can allow for more specific causal interpretations and clarify the goals of design and analysis. Third, we show how the ability to identify the the causal parameters of a time varying system depends on the selection of timepoints for measuring the system's states. We address this by proposing a novel version of the null effect, which is designed to distinguish between transient and lasting causal effects.

2026-05-28T14:02:08Z 18 pages, 7 figures Russell Steele Naftali Weinberger Tess Baker Ian Shrier http://arxiv.org/abs/2605.27265v2 Quantifying Social Inflation in Liability Insurance with Advanced Statistical Methods 2026-05-28T13:57:04Z

Social inflation, which is the rise in liability claim costs beyond general economic inflation, has become a major concern for insurers and reinsurers, yet it is difficult to quantify because litigation outcomes are heavy-tailed and the mix of cases reaching verdict versus settlement changes over time. Using a large database of US jury verdicts and settlements, we develop case-mix-adjusted social inflation measures through multiple channels that matter to reinsurers: plaintiff win rates (a frequency-type channel), settlement propensity (a frequency-type channel), and verdict/settlement severity. The approach combines rolling-window logistic regression for probabilities and quantile (value-at-risk) regression for severities, with uncertainty quantified via a random-weighted bootstrap. We find statistically significant relative increases in plaintiff win probability of approximately 20%-30% from 2009 to 2024, alongside a statistically significant relative decline in settlement probability of more than 10% over the same period. The dominant channel is verdict severity: Even after controlling for explanatory variables, verdict awards show a sharp rise after 2020, increasing by more than 100% from 2020 to 2024, whereas settlement amounts show limited and often statistically insignificant inflation. Therefore, inflation in total amounts payable to plaintiffs closely tracks verdict severity. Social inflation is more pronounced in corporate-defendant and uninsured-defendant cases and in states without tort caps or third-party litigation funding regulation. In addition, we find that social inflation has impacts not only on "nuclear verdicts" but also, in a similar manner, on moderate losses.

2026-05-26T16:42:07Z Tsz Chai Fung Lie Ma Liang Peng Fang Yang