https://arxiv.org/api/s4cRnKK97aFxV0Sx1jgRtksJS7Q 2026-06-18T22:57:04Z 36296 960 15 http://arxiv.org/abs/2601.10252v2 Asymptotic Theory of Tail Dependence Measures for Checkerboard Copula and the Validity of Multiplier Bootstrap 2026-05-19T13:22:53Z

In this paper, we develop a comprehensive asymptotic and bootstrap theory for checkerboard-based estimation of lower and upper tail copulas under unknown marginal distributions. The estimator is constructed via local bilinear (checkerboard) interpolation of the empirical copula and extended to the tail region to obtain nonparametric estimators of extremal dependence. We first establish almost sure uniform consistency of the checkerboard-smoothed copula estimator by decomposing the error into a stochastic empirical process term and a deterministic approximation bias induced by the checkerboard projection. Under mild growth conditions on the grid size, the estimator is shown to be strongly consistent. Next, we derive weak convergence of the centered and scaled checkerboard copula process in $\ell^\infty([0,1]^2)$, showing that the smoothing does not affect the first-order limit. The resulting Gaussian process coincides with that of the empirical copula, augmented by terms arising from marginal estimation. These results extend to the lower and upper tail copula processes, yielding functional central limit theorems and asymptotic normality of the tail dependence coefficient. Since the limiting covariance depends on unknown tail features and partial derivatives rendering direct inference infeasible, we propose a direct multiplier bootstrap adapted to the checkerboard structure. We prove conditional weak convergence of the bootstrap process to the same limit, ensuring valid inference for smooth functionals. Finally, we illustrate the bootstrap methodology through simulations and statistical applications, including goodness-of-fit testing and inference on tail dependence under a range of dependence structures, demonstrating accurate finite-sample performance.

2026-01-15T10:20:07Z Mayukh Choudhury Debraj Das Sujit Ghosh http://arxiv.org/abs/2605.19807v1 Reliable model selection in the presence of parameter non-identifiability 2026-05-19T13:06:23Z

Mathematical models are invaluable for understanding and predicting how biological systems behave, although their construction requires specifying mechanisms and relationships that are often not perfectly known. In the presence of multiple competing models, model uncertainty should be accounted for when performing inference based on available data. Bayesian model selection is a framework for testing mechanistic hypotheses and generating predictions under model uncertainty, which generally requires computation of the model evidence. In this work, we investigate the reliability of evidence computation methods when parameter non-identifiability -- the inability to distinguish between parameter values given available data -- is present, and find that deterministic evidence approximations can produce misleading model selection results because their underlying assumptions are violated. We propose a novel implementation of adaptive multiple importance sampling for evidence estimation, and demonstrate its robustness against non-identifiability. We use ecological case studies to demonstrate how simple model selection methods fail to produce accurate results, whereas our method yields model selection results that are comparable to those obtained by Markov chain Monte Carlo methods at substantially lower computational cost. Given the pervasiveness of parameter non-identifiability in mathematical biology, this work provides a practical approach to reliable model selection in the presence of poorly identified parameters.

2026-05-19T13:06:23Z 33 pages, 8 figures Yong See Foo Torkel E. Loman Alexander P. Browning Ivo Siekmann Ruth E. Baker Jennifer A. Flegg http://arxiv.org/abs/2512.01823v2 The partial K function 2026-05-19T12:38:05Z

The K function and its related statistics have been an enduring tool in the analysis of spatial point processes, providing an easy to compute and interpret summary statistic for characterising the interactions between points of one type, or between two different types of points. In this paper, we introduce a partial K function, enabling us to account for some of the effects of the other point types when analysing point-point interactions. The partial K function we introduce reduces to the usual K function when the other points are independent of the points of interest and has a similar interpretation. Using examples, we demonstrate how the partial K function can unpick dependence between point types that would otherwise be hidden in the usual K function. We also discuss important bias correction steps and hyperparameter selection. In addition, we introduce an extension to account for other spatial covariates, and demonstrate the methodology on the Lansing Woods dataset.

2025-12-01T16:03:51Z Jake P. Grainger Tuomas A. Rajala David J. Murrell Sofia C. Olhede http://arxiv.org/abs/2605.19693v1 Causal treatment effect decompositions with time-to-event outcomes under competing events 2026-05-19T11:26:13Z

Inference about treatment effects for time-to-event outcomes is often obscured by the presence of competing events. A particularly complex situation arises when the treatment influences the occurrence of the competing event. A comprehensive assessment should then account for different mechanisms by which the treatment and the competing event together produce the apparent treatment effect. Here, we propose a decomposition of the treatment's effect on the event of interest (target), characterising how it arises due to four distinct mechanisms involving both the target and competing events. Based on a causal model, the decomposition relies on cross-world estimands reflecting counterfactual scenarios in which the treatment affects the two events as if set to conflicting levels. We specify exchangeability and consistency assumptions under which the decomposition can be estimated from observed data. We discuss how the new decomposition reveals the role of the competing event and serves as a basis for defining causally interpretable estimands in the presence of competing events. Finally, we demonstrate the use of the four-way decomposition with datasets from two randomised trials.

2026-05-19T11:26:13Z 18 pages, 3 figures Mikko Valtanen Tommi Härkänen Jenni Lehtisalo Tiia Ngandu Miia Kivipelto Kari Auranen http://arxiv.org/abs/2510.20035v3 Throwing Vines at the Wall: Structure Learning via Random Search 2026-05-19T11:02:03Z

Vine copulas offer flexible multivariate dependence modeling and have become widely used in machine learning. Yet, structure learning remains a key challenge. Early heuristics, such as Dissmann's greedy algorithm, are still considered the gold standard but are often suboptimal. We propose random search algorithms and a statistical framework based on model confidence sets, to improve structure selection, provide theoretical guarantees on selection probabilities and excess risk, as well as serve as a foundation for ensembling. Empirical results on real-world data sets show that our methods consistently outperform state-of-the-art approaches.

2025-10-22T21:26:18Z Thibault Vatter Thomas Nagler http://arxiv.org/abs/2203.15890v5 Testing the identification of causal effects in observational data 2026-05-19T10:14:25Z

This study demonstrates the existence of a testable condition for the identification of the causal effect of a treatment on an outcome in observational data, which relies on two sets of variables: observed covariates to be controlled for and a suspected instrument. Under a causal structure commonly found in empirical applications, the testable conditional independence of the suspected instrument and the outcome given the treatment and the covariates has two implications. First, the instrument is valid, i.e. it does not directly affect the outcome (other than through the treatment) and is unconfounded conditional on the covariates. Second, the treatment is unconfounded conditional on the covariates such that the treatment effect is identified. We suggest tests of this conditional independence based on machine learning methods that account for covariates in a data-driven way and investigate their asymptotic behavior and finite sample performance in a simulation study. We also apply our testing approach to evaluating the impact of fertility on female labor supply when using the sibling sex ratio of the first two children as supposed instrument, which by and large points to a violation of our testable implication for the moderate set of socio-economic covariates considered.

2022-03-29T20:45:11Z Martin Huber Jannis Kueck http://arxiv.org/abs/2605.19618v1 A Family of Divergence Measures for Evaluating the Reconstruction Quality of Explainable Ensemble Trees 2026-05-19T09:56:04Z

Validating interpretable surrogate models for ensemble learners requires measuring agreement between the ensemble's internal representation and its surrogate approximation, rather than mere association. Correlation-based approaches are scale-invariant and fail to detect systematic discrepancies in co-occurrence structure. We propose a statistical framework grounded in the agreement-association distinction, centered on the normalized Loss of Interpretability (nLoI). Rooted in the Cressie-Read power divergence family with lambda equal to 2, the nLoI admits a closed-form decomposition into within-node and between-node components, providing a unique diagnostic capability to identify precisely where and why reconstruction fails. The framework incorporates four complementary measures capturing distinct structural facets of approximation quality. A unified permutation testing procedure delivers valid inference for all measures within a single resampling pass. Theoretical properties, including boundedness and symmetry, are established for each metric. Monte Carlo simulations and empirical evaluations confirm exact Type I error control and demonstrate that these measures detect reconstruction fidelity gradients invisible to correlation-based alternatives. The framework is developed and illustrated in the context of Explainable Ensemble Trees (E2Tree), and empirical evaluation on three benchmark datasets illustrates the practical utility of the framework.

2026-05-19T09:56:04Z Massimo Aria Agostino Gnasso Carmela Iorio http://arxiv.org/abs/2407.17200v3 Generalization Bounds of Surrogate Policies for Combinatorial Optimization Problems 2026-05-19T09:54:54Z

Many real-world decision problems require solving, again and again, combinatorial optimization instances drawn from a common distribution. A recent line of structured learning methods exploits this regularity by learning policies that pair a statistical model with a tractable combinatorial oracle, instead of solving each instance independently. Training such policies is notoriously difficult, however: the resulting empirical risk is piecewise constant in the model parameters, which hinders gradient-based optimization, and only a few theoretical guarantees have been provided so far. We address this issue by analyzing smoothed (perturbed) policies: adding controlled random perturbations to the direction used by the linear oracle yields a differentiable surrogate risk and improves generalization. Our main contribution is a generalization bound that decomposes the excess risk into $(\mathit{i})$ perturbation bias, $(\mathit{ii})$ statistical estimation error, and $(\mathit{iii})$ optimization error. The perturbation bias is controlled by the \emph{fan-crossing probability}, a new geometric quantity measuring the likelihood that a perturbation changes the oracle solution. We introduce two complementary conditions to bound it--the \emph{Uniformly Bounded Density} (UBD) property, yielding a sharp ${O}(λ)$ bias, and the weaker \emph{Uniform Weak moment} (UW) property, yielding a sub-linear bound--both capturing the geometric interaction between the statistical model and the normal fan of the feasible polytope. The statistical estimation error is controlled via a uniform deviation bound over the policy class, with rate ${O}(1/(λ\sqrt{n}))$ that scales inversely in the smoothing parameter. Concerning the optimization error, we exploit kernel Sum-of-Squares methods to mitigate the curse of dimensionality of global optimization.

2024-07-24T12:00:30Z 29 pages main document, 9 pages supplement Pierre-Cyril Aubin-Frankowski Yohann De Castro Axel Parmentier Alessandro Rudi http://arxiv.org/abs/2604.19265v2 From design of experiments to analysis of variance of multivariate data: a tutorial review on ANOVA simultaneous component analysis 2026-05-19T09:52:26Z

ANOVA Simultaneous Component Analysis (ASCA) is the current state-of-theart chemometric tool for analyzing and interpreting high-dimensional experimental data from a Design of Experiment (DoE). Being a multivariate extension of the ANOVA, ASCA makes a perfect tandem with DoE. This tutorial review recommends best practices for using ASCA, building upon the long-established combination of ANOVA and DoE theory developed over the last century. These recommendations are grounded in a comprehensive literature review and illustrated through a guiding example.

2026-04-21T09:28:45Z Journal of Chemometrics, 2026 José Camacho Jokin Ezenarro Daniel Schorn-García Johan A. Westerhuis 10.1002/cem.70151 http://arxiv.org/abs/2605.19591v1 Uncertainty-Aware Ideal Point Estimation via Variational EM 2026-05-19T09:35:00Z

Roll-call data analysis aims to estimate legislators' ideal points and quantify the associated uncertainty. Existing approaches either rely on Bayesian methods implemented via Markov chain Monte Carlo sampling or focus primarily on point estimation, with uncertainty typically assessed through resampling procedures such as the bootstrap. Consequently, the computational burden of these approaches can become substantial when applied to large roll-call datasets. To address this challenge, we propose a computationally efficient likelihood method for estimating ideal points and their standard errors. Leveraging the Pólya--Gamma identity, we develop a variational expectation--maximization algorithm for estimating ideal points and introduce a variational Louis' method to approximate the observed Fisher information for standard error estimation. Numerical studies and applications to U.S. congressional roll-call data demonstrate that the proposed method produces accurate ideal point estimates and reliable standard errors while being substantially more computationally efficient than existing approaches.

2026-05-19T09:35:00Z Kwangok Seo Youngjo Lee Jong Hee Park Xinlei Wang Johan Lim http://arxiv.org/abs/2605.19519v1 Inference for Fréchet Regression 2026-05-19T08:18:58Z

Linear regression is widely used to model relationships between responses and predictors. In modern applications, one encounters data where the responses are non-Euclidean random objects situated in a metric space, paired with Euclidean predictors. Global Fréchet regression generalizes linear regression to such general settings, however statistical inference has remained largely unexplored. We develop a significance test for the null hypothesis that the Fréchet regression function does not depend on the predictors, addressing the challenge of an absence of linear operations in metric spaces. We also develop a test for the partial effect of a subset of the predictors in analogy to, but quite different from, the partial F-tests commonly used in classical linear regression under Gaussian assumptions. Key ideas are to employ random multipliers to obtain non-degenerate null distributions for the proposed test statistics and the Cauchy combination method. We obtain consistency and convergence results under the null hypothesis and contiguous alternatives and demonstrate the finite sample performance of the proposed tests through simulations on network data represented by graph Laplacians and spherical data with geodesic distances. We further illustrate our method using transport networks arising from New York City taxi trip data and U.S. energy source compositional data.

2026-05-19T08:18:58Z 35 pages, 6 figures Wookyeong Song Paromita Dubey Hans-Georg Müller Alexander Petersen http://arxiv.org/abs/2509.12884v2 Modeling nonstationary spatial processes with normalizing flows 2026-05-19T04:32:33Z

Nonstationary spatial processes can often be represented as stationary processes on a warped spatial domain. Selecting an appropriate spatial warping function for a given application is often difficult and, as a result of this, warping methods have largely been limited to two-dimensional spatial domains. In this paper, we introduce a novel approach to modeling nonstationary, anisotropic spatial processes using neural autoregressive flows (NAFs), a class of invertible mappings capable of generating complex, high-dimensional warpings. Through simulation studies we demonstrate that a NAF-based model has greater representational capacity than other commonly used spatial process models. We apply our proposed modeling framework to a subset of the 3D Argo Floats dataset, highlighting the utility of our framework in real-world applications.

2025-09-16T09:37:18Z Pratik Nag Andrew Zammit-Mangion Ying Sun http://arxiv.org/abs/2605.19313v1 A Unified Framework for Structure-Aware Clustering and Heterogeneous Causal Graph Learning 2026-05-19T03:48:54Z

In complex multivariate systems, interactions among variables are defined by dependency structures, often encoded as directed acyclic graphs ($\text{DAGs}$). However, dependency structures can vary across subjects, and ignoring this structural heterogeneity introduces bias and obscures subpopulation-specific dependencies. To address this, we propose Directed Acyclic Graph-based Dependency Clustering via Alternating Direction Method of Multipliers (DAG-DC-ADMM), a unified framework built upon Structural Equation Modeling (SEM) that jointly learns cluster assignments and cluster-specific dependency structures. We encode acyclicity via a smooth constraint and integrate a groupwise truncated Lasso fusion penalty (gTLP) to cluster subjects based on their structural similarity. This yields a nonconvex optimization problem that incorporates sparsity, acyclicity, and structural consensus constraints. We address the nonconvexity by using the augmented Lagrangian method and solve it with an adapted version of the Alternating Direction Method of Multipliers (ADMM) for difference-of-convex programs. For certain graph structures, such as upper triangular adjacency matrices, our algorithm is guaranteed to converge to a Karush-Kuhn-Tucker (KKT) point. Experiments demonstrate that our method recovers cluster-specific causal dependency structures with a high true positive rate and a low false discovery rate. This capability enables the robust discovery of heterogeneous dependencies across subjects where the subpopulation label is unknown.

2026-05-19T03:48:54Z Honglin Du Muxuan Liang Xiang Zhong http://arxiv.org/abs/2601.18178v2 Asymptotic properties of the multivariate Szász-Mirakyan estimator for cumulative distribution functions on the nonnegative orthant 2026-05-19T02:02:50Z

The asymptotic properties of multivariate Szász-Mirakyan estimators for cumulative distribution functions (cdf) supported on the nonnegative orthant are investigated. Explicit bias and variance expansions are derived on compact subsets of the interior, yielding sharp mean squared error characterizations and optimal smoothing rates. The analysis shows that the proposed Poisson smoothing yields a non-negligible variance reduction relative to the empirical cdf, leading to asymptotic efficiency gains that can be quantified through local and global deficiency measures. The behavior of the estimator near the boundary of its support is examined separately. Under a boundary-layer scaling that preserves nondegenerate Poisson smoothing as the evaluation point approaches the boundary of $[0,\infty)^d$, bias and variance expansions are obtained that differ fundamentally from those in the interior region. In particular, the variance reduction mechanism disappears at leading order, implying that no asymptotically optimal smoothing parameter exists in the boundary regime. Central limit theorems and almost sure uniform consistency are also established. Together, these results provide a unified asymptotic theory for multivariate Szász-Mirakyan cdf estimation and clarify the distinct roles of smoothing in the interior and boundary regions.

2026-01-26T05:56:30Z 40 pages, 3 figures, 3 tables Guanjie Lyu Frédéric Ouimet Cindy Feng http://arxiv.org/abs/2603.07018v2 TEA-Time: Transporting Effects Across Time 2026-05-19T00:17:24Z

Treatment effects estimated from a randomized controlled trial are local not only to the study population but also to the time at which the trial was conducted. The literature on generalizing experimental findings to new populations is extensive, yet transporting effects across time has received far less attention, and even defining the target estimand is nonobvious. We formalize the transported average treatment effect under a separable temporal effects assumption, derive two identification strategies: replicated trials and common arm, and develop doubly robust, semiparametrically efficient estimators for each. Applied to a large archive of headline A/B tests, the common arm strategy is substantially more precise but exhibits systematic bias when the temporal factor depends on the gap between intervention and measurement rather than on measurement time alone, while the replicated trials strategy, which allows this dependence, tracks the ground truth more faithfully. Simulation studies investigate when each strategy is reliable and when it silently fails.

2026-03-07T03:34:13Z Harsh Parikh Gabriel Levin-Konigsberg Dominique Perrault-Joncas Alexander Volfovsky