https://arxiv.org/api/v8r3nDY+jwaNNr6s9LulV6w5ZLc 2026-06-10T04:28:46Z 36124 105 15 http://arxiv.org/abs/2406.03296v2 Multi-relational Network Autoregression Model with Latent Group Structures 2026-06-06T04:13:35Z

Multi-relational networks among entities are frequently observed in the era of big data. Quantifying the effects of multiple networks have attracted significant research interest recently. In this work, we model multiple network effects through an autoregressive framework for tensor-valued time series. To characterize the potential heterogeneity of the networks and handle the high dimensionality of the time series data simultaneously, we assume a separate group structure for entities in each network and estimate all group memberships in a data-driven fashion. Specifically, we propose a group tensor network autoregression (GTNAR) model, which assumes that within each network, entities in the same group share the same set of model parameters, and the parameters differ across networks. An iterative algorithm is developed to estimate the model parameters and the latent group memberships simultaneously. Theoretically, we show that the group-wise parameters and group memberships can be consistently estimated when the group numbers are correctly- or possibly over-specified. An information criterion for group number estimation of each network is also provided to consistently select the group numbers. Lastly, we implement the method on a Yelp dataset to illustrate the usefulness of the method.

2024-06-05T14:04:18Z arXiv admin note: text overlap with arXiv:2212.02107 Yimeng Ren Xuening Zhu Ganggang Xu Yanyuan Ma http://arxiv.org/abs/2508.10331v3 Synthesizing Evidence: Data-Pooling as a Tool for Treatment Selection in Online Experiments 2026-06-06T02:45:12Z

Randomized experiments are the gold standard for causal inference but face significant challenges in business applications, including limited traffic allocation, the need for heterogeneous treatment effect estimation, and the complexity of managing overlapping experiments. These factors lead to high variability in treatment effect estimates, making data-driven policy roll out difficult. To address these issues, we introduce the data pooling treatment roll-out (DPTR) framework, which enhances policy roll-out by pooling data across experiments rather than focusing narrowly on individual ones. DPTR can effectively accommodate both overlapping and non-overlapping traffic scenarios, regardless of linear or nonlinear model specifications. We demonstrate the framework's robustness through a three-pronged validation: (a) theoretical analysis shows that DPTR surpasses the traditional difference-in-mean and ordinary least squares methods under non-overlapping experiments, particularly when the number of experiments is large; (b) synthetic simulations confirm its adaptability in complex scenarios with overlapping traffic, rich covariates and nonlinear specifications; and (c) empirical applications to two experimental datasets from real world platforms, demonstrating its effectiveness in guiding customized policy roll-outs for subgroups within a single experiment, as well as in coordinating policy deployments across multiple experiments with overlapping scenarios. By reducing estimation variability to improve decision-making effectiveness, DPTR provides a scalable, practical solution for online platforms to better leverage their experimental data in today's increasingly complex business environments.

2025-08-14T04:11:09Z Zhenkang Peng Chengzhang Li Ying Rong Renyu Zhang http://arxiv.org/abs/2606.09906v1 An information-geometric framework for mapping maximum potential biodiversity 2026-06-06T02:34:16Z

Biodiversity measures are often used descriptively: one computes a diversity index from an observed or estimated community composition and maps the resulting values across space. Conservation planning, however, also requires a site-specific benchmark against which the observed community can be compared. This chapter develops an information-geometric framework for such \emph{potential diversity} and the associated \emph{diversity gap}. The central object is a pair of probability vectors on the species simplex: an observed or realized composition $p^{\mathrm{obs}}$, and a potential composition $p^{\mathrm{pot}}$ obtained by a constrained variational principle. The gap is then defined by comparing a diversity functional at these two compositions. The framework is developed for both Hill-type diversity, which measures abundance and evenness, and Rao's quadratic entropy, which incorporates trait, phylogenetic, or ecological dissimilarities among species. A spatial point-process interpretation clarifies how local ecological capacities can be defined before passing to the simplex. Escort constraints, capacity constraints, and divergence projections then provide a unified way to define nontrivial benchmarks beyond the uniform distribution. The resulting formulation separates two distinct questions: how diverse a community is, and how far it is from a locally admissible potential benchmark. It also connects the ecological idea of dark diversity with a continuous, abundance-weighted comparison on the probability simplex. We also outline a dynamic extension in which capacities, species migration, and climate-driven shifts vary over time. Empirical implementation with large-scale citizen-science biodiversity data and trait databases is left for future work.

2026-06-06T02:34:16Z 22 pages, 1 figure Shinto Eguchi http://arxiv.org/abs/2606.07947v1 Bayesian Global Fréchet Regression via Weak Conditional Expectations 2026-06-06T02:34:09Z

Fréchet regression provides a versatile framework for modeling responses in metric spaces with Euclidean predictors, yet current methodologies rely almost exclusively on frequentist approaches. We propose a Bayesian framework for Fréchet regression that offers a principled way of incorporating prior information into nonlinear global Fréchet regression. By targeting a novel Fréchet Bayes rule, we reduce the object-valued regression problem to a collection of tractable scalar regression tasks. Our approach allows for a controlled interpolation between the prior and the data-driven frequentist estimate, facilitating effective shrinkage toward informed values. While initially derived under Gaussian assumptions, we demonstrate that our framework is robust to model misspecification by establishing its validity under moment conditions via weak conditional expectations. The numerical properties of the proposed methodology are demonstrated in simulation studies and an application to microbiome compositional data, where we show that leveraging an auxiliary cohort to inform the prior significantly enhances predictive performance in a targeted, small-scale study

2026-06-06T02:34:09Z 34 pages, 4 figures Simon Fontaine Bing Li Lingzhou Xue http://arxiv.org/abs/2506.00149v2 Generalizing causal effects with noncompliance: Application to deep canvassing experiments 2026-06-06T02:25:12Z

Standard approaches in generalizability often focus on generalizing the intent-to-treat (ITT). However, in practice, a more policy-relevant quantity is the generalized impact of an intervention across compliers. While instrumental variable (IV) methods are commonly used to estimate the complier average causal effect (CACE) within samples, standard approaches cannot be applied to a target population with a different distribution from the study sample. This paper makes several key contributions. First, we introduce a new set of identifying assumptions in the form of a population-level exclusion restriction that allows for identification of the target complier average causal effect (T-CACE) in both randomized experiments and observational studies. This allows researchers to identify the T-CACE without relying on standard principal ignorability assumptions. Second, we propose a class of inverse-weighted estimators for the T-CACE and derive their asymptotic properties. We provide extensions for settings in which researchers have access to auxiliary compliance information across the target population. Finally, we introduce a sensitivity analysis for researchers to evaluate the robustness of the estimators in the presence of unmeasured confounding and extend existing tests to evaluate instrument validity in this context. We illustrate our proposed method through extensive simulations and a study evaluating the impact of deep canvassing on reducing exclusionary attitudes.

2025-05-30T18:41:22Z Zhongren Chen Melody Huang http://arxiv.org/abs/2406.09195v6 On the statistical analysis of grouped data: when Pearson $χ^2$ and other divisible statistics are not goodness-of-fit tests 2026-06-06T01:37:09Z

Thousands of experiments are analyzed, and papers are published each year involving the statistical analysis of grouped data. While this area of statistics is often perceived -- somewhat naively -- as saturated, several misconceptions still affect everyday practice, and new frontiers have so far remained unexplored. Researchers must be aware of the limitations affecting their analyses and what new possibilities are at their hands. The article introduces a unifying approach to the analysis of divisible statistics -- that includes Pearson's $χ^2$, the likelihood ratio, and spectral statistics, as special cases -- when a statistician deals with a large number of bins/groups, thus leading to a large number of small or moderate frequencies. Performance of the tests is analyzed against the class of contiguous (local) alternatives. Perhaps the most surprising result here is that, in this `sparse' regime, most of the tests proposed in the literature can be modified to produce more powerful tests, and no single test based on a divisible statistic leads to a goodness-of-fit test. Distribution-free goodness-of-fit tests are also constructed.

2024-06-13T14:55:02Z Sara Algeri Estate V. Khmaladze http://arxiv.org/abs/2601.01830v3 Confounder-robust causal discovery and inference in Perturb-seq using proxy and instrumental variables 2026-06-05T23:08:33Z

Emerging single-cell technologies that combine CRISPR-based genetic perturbations with single-cell RNA sequencing, such as Perturb-seq, offer unprecedented opportunities to uncover cause-and-effect relationships among genes. Nonetheless, Perturb-seq experiments are subject to unobserved factors that, if not properly handled, can severely bias the inferred causal relationships between genes. These latent factors may arise not only from intrinsic molecular features of the regulatory elements, but also from unmeasured genes omitted due to cost-constrained experimental designs. Although methods for analyzing large-scale Perturb-seq data are rapidly maturing, approaches that explicitly account for such unobserved confounders when inferring causal gene networks are still lacking. Here, we propose a novel approach to accurately reconstruct causal gene networks from Perturb-seq data even when important confounders are missing. Our framework leverages proxy and instrumental variable strategies to exploit the rich information embedded in the perturbations, enabling unbiased estimation of the underlying directed acyclic graph (DAG) of gene expression. Applications to both comprehensive synthetic data and real CRISPR interference experiments in K562 cells demonstrate that our method outperforms baseline approaches that lack principled adjustments for unmeasured confounding, yielding more accurate and biologically relevant recovery of the true causal DAGs.

2026-01-05T06:50:07Z Kwangmoon Park Hongzhe Li http://arxiv.org/abs/2407.01765v2 A General Framework for Design-Based Treatment Effect Estimation in Paired Cluster-Randomized Experiments 2026-06-05T21:43:05Z

Paired cluster-randomized experiments (pCRTs) are common in education program impact evaluation trials. Although common, there is surprisingly no clear consensus regarding how to analyze this randomization design to estimate average treatment effects. Variance estimation is also complicated due to the dependency created through pairing clusters. Therefore, we aim to provide an intuitive and practical comparison between different estimation strategies for pCRTs to inform practitioners' choice of strategy. To this end, we present a general framework for design-based estimation of an average individual effect in pCRTs. This framework offers a novel and intuitive view on the bias-variance trade-off between point estimators and emphasizes the benefits of covariate adjustment for estimation with pCRTs. In addition to providing a general framework for estimation with pCRTs, the point and variance estimators we present support fixed-sample unbiased estimation with similar precision to a common regression model and conservative variance estimation. Through simulation studies based on an educational efficacy trial, we compare the performance of the point and variance estimators reviewed. Our analysis and simulation studies inform the choice of point and variance estimators for analyzing pCRTs in practice.

2024-07-01T19:57:31Z Charlotte Z. Mann Adam C. Sales Johann A. Gagnon-Bartsch http://arxiv.org/abs/2605.27237v2 Feasibility Determination for Subjective Probability Constraints 2026-06-05T20:09:59Z

We consider the problem of determining feasible systems from a finite set of simulated alternatives with respect to probability constraints, where the observations from stochastic simulations are Bernoulli distributed. Most statistically valid procedures for feasibility determination focus on constraints on the means of normally distributed observations. Although these procedures can be adapted to Bernoulli-distributed data by treating batch means as basic observations, achieving approximate normality often requires a large batch size, potentially leading to the unnecessary waste of observations in reaching a decision. This paper proposes a procedure that utilizes the Bernoulli-distributed observations directly to determine feasibility. In addition, we incorporate subjective constraints, allowing for multiple thresholds for each constraint. We demonstrate that our proposed procedure is statistically valid and that it outperforms an existing feasibility determination procedure for subjective constraints originally developed for normally distributed observations. Furthermore, we propose two heuristic feasibility check approaches for thresholds that are sequentially added by decision makers, allowing thresholds to be tightened when many systems are feasible or relaxed when no feasible system exists. We show by experiments that the proposed procedures can efficiently provide feasibility decisions to systems with respect to all thresholds considered.

2026-05-26T16:17:17Z Taehoon Kim Sigrun Andradottir Seong-Hee Kim Yuwei Zhou http://arxiv.org/abs/2606.07816v1 High Dimensional Change Point Models for Two-Directional Data 2026-06-05T19:53:40Z

We develop methodology for recovery of change points for data observed on more than one temporal index where changes may occur simultaneous in both indices, where the spatial component may be high dimensional. The work is motivated by climate monitoring problems where long series of data are available, e.g., daily observations (index 1) over several years (index 2). Such data may be evolving over the annual time scale, along with dynamic seasonal changes in the shorter time scale. We model this as a high dimensional mean process observed on a two dimensional grid with change points. Asymptotic estimation and inference results are developed under a single change point setup, including rates of convergence of the proposed method as well the resulting limiting distributions. The method is extended to the case of multiple changes. Theoretical results are supported numerically with monte-carlo simulations. We implement our work on a large scale climate data for the Pacific Northwest region of the United States.

2026-06-05T19:53:40Z arXiv admin note: text overlap with arXiv:2105.10017 Abhishek Kaul Dipesh Baral Stergios B. Fotopoulos Venkata K. Jandhyala Rebecca Killick http://arxiv.org/abs/2606.07809v1 Sensitivity Analysis White Paper 2026-06-05T19:37:27Z

Sensitivity analysis is an important component of simulation-based decision support because it helps analysts determine which inputs most strongly influence model outcomes under uncertainty. This paper organizes the broad sensitivity analysis literature into a coherent framework for use in complex simulation settings, with particular attention to military applications. We review major classes of methods, including local and global approaches, variance-based techniques, screening methods, derivative-based methods, and uncertainty quantification tools, and relate them to common analytical objectives such as factor prioritization, factor fixing, variance reduction, and factor mapping. The paper also discusses sensitivity auditing as a complementary perspective that emphasizes transparency, assumption tracking, and responsible use of models in decision-relevant settings.

2026-06-05T19:37:27Z 12 pages, Nate Bade Lindsay Erickson http://arxiv.org/abs/2605.18741v2 Robust Simulation Based Inference Through Robust Optimal Transport 2026-06-05T19:13:14Z

When a statistical model $\{P_θ : θ\in Θ\}$ lacks analytically tractable likelihoods, parametric statistical inference based on data generated from an unknown underlying distribution $P$ can still be performed as long as simulations from the model are possible. This approach is called Simulation Based Inference (SBI). Statistical models are rarely exactly correct (that is, $P \notin \{P_θ: θ\in Θ\}$), and Robust SBI focuses on inferring a reasonable parameter even under model mis-specification. We focus on the setting where $P$ possesses potentially both geometric and Total Variation type discrepancies from $P_{θ^*}$. For this problem, we use a Kullback-Liebler informed robust Optimal Transport divergence, motivated by Empirical Likelihood considerations. We introduce a stochastic sub-gradient ascent algorithm with a convergence guarantee for estimating the semi-discrete version of this robust Optimal Transport divergence, and design a parallelized SBI algorithm which employs the regular bootstrap on top of minimum semi-discrete robust Optimal Transport for parameter uncertainty quantification. We demonstrate mathematically why the divergence is robust under a joint geometric plus Total Variation type contamination and then illustrate the robustness of inferences on a complex benchmark SBI task.

2026-05-18T17:57:07Z Peter Matthew Jacobs Lekha Patel Anirban Bhattacharya Debdeep Pati http://arxiv.org/abs/2606.07466v1 Covariance-Adaptive Residualization and Stagewise Calibration for Dependent Multiple Testing 2026-06-05T17:18:13Z

In this paper, we study simultaneous hypothesis testing for multivariate Gaussian means under arbitrary covariance dependence. Building on the Maximum Residual Down (MRD) procedure of Cohen et al. (2009), we investigate a new calibration strategy based on the generalized step-down critical constants of Gavrilov et al. (2009). The resulting procedure retains the covariance-adaptive residualization mechanism of MRD while replacing the original model-dependent threshold specification with a simple stagewise calibration rule. Since the proposed procedure belongs to the class of monotone residual-based step-down procedures studied by Ghosh and Chakrabarti (2026), its admissibility follows directly from their theory. We also derive alternative representations of the MRD residual statistics that express all active residuals through a single active precision matrix, substantially reducing computational complexity. Simulation studies across a broad range of dependence structures show that the proposed methodology often achieves a lower normalized misclassification risk than several widely used marginal testing procedures. Under several structured dependence models, the procedure also exhibits strong signal-recovery behavior, attaining false discovery rates near the nominal level, extremely small false non-discovery rates, powers approaching one, and average numbers of rejections close to the expected number of true signals. These findings provide empirical evidence that covariance-adaptive residualization and stagewise calibration may interact in a highly favorable manner for dependent multiple testing.

2026-06-05T17:18:13Z Prasenjit Ghosh Arijit Chakrabarti http://arxiv.org/abs/2602.21132v2 Robust and Sparse Generalized Linear Models for High-Dimensional Data via Maximum Mean Discrepancy 2026-06-05T17:11:04Z

High-dimensional datasets are frequently subject to contamination by outliers and heavy-tailed noise, which can severely bias standard regularized estimators like the Lasso. While Maximum Mean Discrepancy (MMD) has recently been introduced as a ``universal'' framework for robust regression, its application to high-dimensional Generalized Linear Models (GLMs) remains largely unexplored, particularly regarding variable selection. In this paper, we propose a penalized MMD framework for robust estimation and feature selection in GLMs. We introduce an $\ell_1$-penalized MMD objective and develop two versions of the estimator: a full $O(n^2)$ version and a computationally efficient $O(n)$ approximation. To solve the resulting non-convex optimization problem, we employ an algorithm based on the Alternating Direction Method of Multipliers (ADMM) combined with AdaGrad. Through extensive simulation studies involving Gaussian linear regression and binary logistic regression, we demonstrate that our proposed methods are highly competitive with classical penalized GLMs and existing robust benchmarks. Our approach shows particular resilience in maintaining a balance between estimation accuracy and variable selection across diverse contamination scenarios, especially in handling high-leverage points and heavy-tailed error distributions where traditional methods may fluctuate in performance.

2026-02-24T17:28:12Z 22 pages, 5 tables, 2 figures Xiaoning Kang Lulu Kang http://arxiv.org/abs/2606.07447v1 Community Detection on a Randomly Growing Network 2026-06-05T16:48:38Z

We study community detection on Markovian random networks outside of the Stochastic Block Model (SBM) framework. Specifically, we consider a random network growth process which generates $K$ separate preferential attachment trees and connects them with Erdős--Rényi edges, so that each tree represents a community and each node inherits the label of the tree to which it belongs. This model is able to produce many features of real world networks that are improbable under SBM, such as power law degree distribution and the existence of chains and hubs. Given only the final graph, without any knowledge of the growth process, we seek to recover the unobserved community membership of the nodes. We first prove that it is impossible for any algorithm to consistently recover the community label of all the nodes. However, we design algorithms which are provably able to recover the community labels of subsets of central nodes, for several different notions of node centrality such as arrival time or degree. Our procedure consists of two stages where, in the first stage, we classify high degree nodes and then, in the second stage, extend the community assignments to the remaining vertices. Numerical experiments and a real data application on a coauthorship network demonstrate the effectiveness of our proposed approach.

2026-06-05T16:48:38Z 69 pages, 16 figures, 7 tables Jianxiang Wang Min Xu