https://arxiv.org/api/hBUa75OIMBqBsp1aa6CKAXjeXHQ2026-06-14T00:41:52Z3617170515http://arxiv.org/abs/2605.21846v1Causal Discovery in Structural VAR Models Under Equal Noise Variance2026-05-21T00:41:11ZCausal discovery from multivariate time series is challenging when causal effects may occur both across time and within the same sampling interval. This issue is especially important in applications such as neuroscience, where the sampling rate may be coarse relative to the underlying dynamics and contemporaneous effects need not form an acyclic graph. We study causal discovery in linear Gaussian structural VAR models under an equal noise variance assumption, meaning that the structural noise terms have a common variance. Unlike the DAG-based cross-sectional equal noise variance setting, the time-series setting considered here does not generally yield point identification of a unique causal graph. Instead, multiple structural VAR parameterizations can induce the same stationary observed process law. We introduce a notion of observational equivalence tailored to this setting and show that the corresponding equivalence class is characterized by orthogonal transformations of the structural equations together with a global positive scale. This characterization leads to an equivalence-aware model discrepancy, the observational alignment discrepancy, which compares structural models modulo transformations that preserve the observed law. Building on this theory, we propose ENVAR, a sparsity-based procedure that searches over the induced observational equivalence class for a sparse normalized structural representative. We evaluate the proposed methodology on synthetic structural VAR data and on an fMRI dataset.2026-05-21T00:41:11ZSeyedSina Seyedi HasanAbadiFahimeh ArabErfan NozariAmirEmad Ghassamihttp://arxiv.org/abs/2401.00139v3Enhancing Causal Reasoning in Large Language Models: A Causal Attribution Model for Precision Fine-Tuning2026-05-21T00:09:51ZThis paper introduces a causal attribution model to enhance the interpretability of large language models (LLMs) and improve their causal reasoning abilities via precise fine-tuning. Despite LLMs' proficiency in diverse tasks, their reasoning processes often remain black box, and thus restrict targeted enhancement. We propose a novel causal attribution model that utilizes "do-operators" for constructing interventional scenarios, allowing us to quantify the contribution of different components in LLMs's causal reasoning process systematically. By assessing the proposed attribution scores through causal discovery tasks across various domains, we demonstrate that LLMs' effectiveness in causal discovery heavily relies on provided context and domain-specific knowledge but can also utilize numerical data with limited calculations in correlation, not causation. This motivates the proposed fine-tuned LLM for pairwise causal discovery, effectively and correctly leveraging both knowledge and numerical information.2023-12-30T04:51:46ZA Python implementation of our proposed method is available at https://github.com/ncsulsj/Causal_LLMHengrui CaiShengjie LiuRui Songhttp://arxiv.org/abs/2605.21813v1Symbolic Density Estimation for Discrete Distributions2026-05-20T23:22:21ZDiscrete probability laws underpin statistical modeling, yet the catalog of interpretable distributions has expanded only gradually through centuries of case-by-case mathematical derivations. We introduce symbolic density estimation (SDE), an unsupervised framework that automatically recovers closed-form probability mass functions by composing elementary analytic operations within a structured search space. Our method integrates domain-specific structural priors with evolutionary search and a validity-aware inference stage, and it extends to richer distribution families such as zero inflation and finite mixtures. To support systematic evaluation and future research, we contribute a benchmark dataset spanning a broad collection of commonly used discrete distributions. The proposed algorithm recovers all benchmark families with accurate parameter estimates. A real data application shows that it identifies concise and interpretable mixture models that improve goodness-of-fit over standard models.2026-05-20T23:22:21Z28 pages, 5 figures, 22 tablesZiwen LiuMeng Lihttp://arxiv.org/abs/2602.07252v2Beyond Euclidean Summaries: Online Change Point Detection for Distribution-Valued Data2026-05-20T23:05:01ZExisting online change-point detection (CPD) methods rely on fixed-dimensional Euclidean summaries, implicitly assuming that distributional changes are well captured by moment-based or feature-based representations. They can obscure important changes in distributional shape or geometry. We propose an intrinsic distribution-valued CPD framework that treats streaming batch data as a stochastic process on the 2-Wasserstein space. Our method detects changes in the law of this process by mapping each empirical distribution to a tangent space relative to a pre-change Fréchet barycenter, yielding a reference-centered local linearization of 2-Wasserstein space. This representation enables sequential detectors by adapting classical multivariate monitoring statistics to tangent fields. We provide theoretical guarantees and demonstrate, via synthetic and real-world experiments, that our approach detects complex distributional shifts with reduced detection delay at matched $\mathrm{ARL}_0$ compared with moments-based and model-free baselines. The code is available at https://github.com/yyzeng43/IDD-icml .2026-02-06T23:04:37ZProceedings of the 43rd International Conference on Machine Learning, Seoul, South Korea. PMLR 306, 2026. CYingyan ZengYujing HuangXiaoyu Chenhttp://arxiv.org/abs/2605.21793v1Targeted maximum likelihood estimation of vaccine effectiveness and immune correlates in test-negative design studies with missing data2026-05-20T22:38:26ZThe test-negative design (TND) is a resource-efficient observational study design that can assess vaccine effectiveness and exposure-proximal immune correlates of disease. The TND enrolls symptomatic individuals seeking diagnostic testing and compares case status by an exposure variable, such as vaccination status or immune marker level, that is measured at testing. While the TND reduces confounding by healthcare-seeking behavior, other sources of confounding may remain. TND studies may also have missing data in the exposure variable due to incomplete records or two-phase sampling designs. We present a targeted maximum likelihood estimation approach involving a semiparametric logistic regression model that targets a causal conditional risk ratio of symptomatic disease in the healthcare-seeking population. Under causal and missing at random assumptions, our method produces an efficient, asymptotically linear estimator that provides flexible, data-driven confounding control and valid causal inference when analyzing TND studies with missing exposure variable data. We evaluate our method's finite sample properties using plasmode simulations of a two-phase TND immune correlates study. We also apply our method to assess COVID-19 vaccine effectiveness and antibody marker correlates of COVID-19 from TND study cohorts derived from the Moderna Coronavirus Efficacy phase 3 trial.2026-05-20T22:38:26Z52 pages, 14 figuresLeah I. B. AndrewsLars van der LaanPeter B. Gilberthttp://arxiv.org/abs/2605.21782v1A Scalable Parametric Item Calibration Engine (SPICE) for Explanatory IRT with Sparse Data2026-05-20T22:22:06ZWe describe a Bayesian multidimensional explanatory IRT model, and an associated Markov Chain Monte Carlo (MCMC) estimation procedure and the corresponding development of calibration software, designed for psychometric analyses of large numbers of sparsely-linked persons and items. Such data structures can arise, for example, from adaptive assessments using large banks of automatically generated items with individual test takers receiving a very small proportion of the entire bank. We discuss how our choices for model specification, data structures, and algorithm implementation combine to create a scalable method for explanatory IRT that can support a variety of psychometric operations with sparse data.2026-05-20T22:22:06ZSteven W. NydickManqian LiaoJ. R. Lockwoodhttp://arxiv.org/abs/2307.05732v5From Isotonic to Lipschitz Regression: A New Interpolative Perspective on Shape-restricted Estimation2026-05-20T21:49:19ZThis manuscript bridges nonparametric smoothness-based and shape-restricted estimation, which may appear as two disjoint paradigms in the field. The proposed approach is motivated by a conceptually simple observation: every Lipschitz function is a sum of a monotonic and a linear function. This principle is further generalized to the higher-order monotonicity and multivariate settings. A family of estimators is proposed based on a sample-splitting procedure, inheriting desirable methodological, theoretical, and computational properties of shape-restricted estimators. The theoretical analysis provides convergence guarantees of the estimator under heteroscedastic and heavy-tailed errors, as well as adaptivity to the unknown ``complexity" of the true regression function. The generality of the proposed decomposition framework is demonstrated through new approximation results and numerical studies.2023-07-11T18:59:27ZKenta TakatsuTianyu ZhangArun Kumar Kuchibhotlahttp://arxiv.org/abs/2605.21757v1Substantive-Model-Compatible Multiple Imputation for Cox Regression with a Diverging Number of Covariates2026-05-20T21:34:13ZModern biomedical survival studies with high-dimensional genomic and clinical predictors are challenged by missing covariates. Existing methods conduct inference through penalization and debiasing when the number of covariates diverges with sample size, but they are typically developed with fully observed covariates. Conversely, substantive-model-compatible multiple imputation methods, particularly substantive-model-compatible fully conditional specification (SMC-FCS), provide principled handling of missing covariates while preserving compatibility with the Cox model, yet current methodology and theory remain largely restricted to fixed-dimensional settings. To address these limitations, we propose a semiparametric multiple imputation framework for inference in Cox regression with missing covariates of a diverging dimension. Missing covariates are imputed through a high-dimensional SMC-FCS procedure driven by Cox-model likelihood contributions, with rejection sampling used to enforce substantive-model compatibility and ridge-regularized posterior draws used to stabilize the imputation models. The algorithm stabilizes the Cox estimator through an imputation-regularized optimization iteration and then generates multiply imputed datasets from a stabilized chain. Inference for low-dimensional linear functionals or contrasts, $c^\top β$, is obtained by combining debiased estimators and within-imputation variance estimates through Rubin's rules. We establish consistency and asymptotic normality of the resulting pooled estimator under a diverging-dimensional regime. Simulation studies demonstrate favorable finite-sample performance, and an application to the Boston Lung Cancer Survival Cohort illustrates the practical utility of the proposed method for high-dimensional survival studies with incomplete covariates.2026-05-20T21:34:13ZZhilin ZhangYi Lihttp://arxiv.org/abs/2501.07772v4Honest Inference for Stochastic Optimization2026-05-20T21:23:52ZThis manuscript studies a general approach to construct confidence sets for the solution of stochastic optimization, rendering empirical risk minimization as special cases. Statistical inference for stochastic optimization poses significant challenges due to the non-standard limiting behaviors of the corresponding estimator, which arise in settings with increasing dimension of parameters, non-smooth objectives, or constraints. We propose a simple and unified method that guarantees validity in both regular and irregular cases. We provide a unified treatment of validity, conservativeness, and the size of the resulting confidence sets. In particular, the presented width analysis demonstrates the adaptive behavior of the confidence set to the unknown degree of instance-specific regularity. We apply the proposed method to several high-dimensional and irregular statistical problems. Numerical results for all statistical applications are provided.2025-01-14T01:07:30ZKenta TakatsuArun Kumar Kuchibhotlahttp://arxiv.org/abs/2605.16108v2Estimating Association Between Paired Outcomes in Clustered Data with Informative Subgroup Size2026-05-20T20:43:50ZInformative cluster size (ICS) and informative subgroup size (ISS) can distort marginal association estimates when the number of observed units, or their distribution across outcome-defined categories, is related to the outcomes under study. This issue is especially relevant for paired outcomes, where the observed association can depend on cluster size, paired-category composition, and the process by which units become available for analysis. We propose three weighted estimating approaches for marginal association between paired outcomes in clustered data. The weights are derived from within-cluster resampling arguments and extend inverse cluster-size and subgroup-size weighting to paired outcome categories. We also modify an existing ISS testing procedure by utilizing Stouffer's method to reduce computational burden. To evaluate the methods, we develop a simulator for clustered paired outcomes that separates unit-level association, latent cluster-level association, and outcome-dependent retention. Simulations show that pair-based weighting can reduce bias when association arises through unit-level dependence and subgroup composition is informative, but can attenuate association carried by latent cluster-level structure. Typical inverse-cluster weighting remains more stable when the association is primarily cluster-level. Application to NHANES oral-health data shows small positive periodontal and caries associations overall, with filled-surface outcomes showing stronger ISS evidence and greater sensitivity to pair-based weighting than decayed-surface outcomes. These results indicate that marginal association under ICS and ISS should be interpreted in relation to the source of association, observed-unit structure, and assumptions used to choose the weighting scheme.2026-05-15T15:56:03ZOwen VisserSomnath Dattahttp://arxiv.org/abs/2605.21651v1Similarity-Driven Proposals for MCMC Algorithms on Discrete Spaces2026-05-20T19:06:33ZRecent research has led to the development of MCMC algorithms with likelihood-informed proposals when targeting posterior distributions supported on discrete state spaces. Our work is placed within this field and puts forward a new MCMC methodology based upon similarity-driven proposals. Such proposals sway transitions towards states favored by the posterior via use of a data-driven measure of discrepancy between observations and the proposed model. Our approach can naturally cover classes of hierarchical models that involve both discrete variables and additional latent ones, without a requirement of integrating our the latter, in contrast to previous works in this field. The new algorithms are illustrated in simulation settings and in a involved real data scenario with a Dirichlet-Multinomial regression model.2026-05-20T19:06:33ZLuca AielloRaffaele ArgientoAlexandros BeskosMaria De Ioriohttp://arxiv.org/abs/2605.21627v1Distribution-free root cause analysis2026-05-20T18:40:04ZWe study distribution-free root cause analysis in multi-stream data, where an evolving underlying system is observed through multiple data streams that may each undergo distributional changes at unknown timepoints. In such settings, the stream exhibiting the earliest change provides a natural starting point for investigating the underlying cause, which we refer to as the root-cause index. Leveraging conformal $p$-values, we propose a novel framework, Conformal Root Cause Analysis (CROC), which constructs finite-sample valid confidence sets for the root-cause index under minimal assumptions: the data streams are independent, and within each stream the pre- and post-change observations are sampled exchangeably from arbitrary and unknown distributions. We further establish a universality property, showing that any distribution-free method for root cause localization can be represented within the CROC framework. In addition, under mild regularity conditions and principled score design, our method yields asymptotically sharp confidence sets that efficiently isolate the root cause. We further extend CROC to efficiently handle cross-stream dependence when present. Extensive simulations demonstrate accurate localization of the root stream, supporting our theoretical guarantees.2026-05-20T18:40:04Z34 pages, 4 figuresRohan HoreAaditya Ramdashttp://arxiv.org/abs/2605.21458v1Mind the Sim-to-Real Gap & Think Like a Scientist2026-05-20T17:48:14ZSuppose a planner has a pre-trained simulator of a sequential decision problem and the option to run real experiments in the field. The simulator is cheap to query but inherits confounding and drift from its calibration data. Experimentation is unbiased but consumes one real unit per trial. We study when, and how, the planner should supplement the simulator with experiments. We give three results. First, an extended simulation lemma decomposes the simulator's value error into a calibration--deployment shift that randomization can identify and a parametric residual that no further interaction can reduce. Second, the value gap between the simulator-optimal policy and the optimum splits into a local component, on states the deployed policy already visits, and a reachability component, on states it does not. The reachability component stays bounded away from zero at any horizon under purely passive learning. Third, we propose Fisher-SEP, a simulation-aided experimental policy (SEP) that minimizes the posterior predictive variance of a target policy's value, with reward-only and transition-only specializations. Two case studies illustrate the regimes. In a vending-machine supply chain, front-loaded experimentation overtakes posterior updating once the horizon is long enough to amortize the pilot. In an HIV mobile-testing example with a corridor that separates a well-surveilled region from a poorly-surveilled one, only designed exploration reaches the poorly-surveilled region.2026-05-20T17:48:14ZHarsh ParikhGabriel Levin-KonigsbergDominique Perrault-JoncasAlexander Volfovskyhttp://arxiv.org/abs/2605.21416v1Data driven extreme value distribution estimation: Derivation of the Mean Integrated Squared Error, optimal bandwidth selection and stability conditions2026-05-20T17:12:33ZWe introduce the data driven extreme value distribution (DDEVD) estimator, a kernel-based method for estimating extreme value distributions from data. We derive its mean integrated squared error (MISE) in detail, use it to compute the optimal bandwidth and establish stability conditions for the bandwidth optimization procedure.2026-05-20T17:12:33Z37 pages, 5 figuresMichael SandbichlerTobias Hellhttp://arxiv.org/abs/2605.21408v1TCARD: Nearly Balanced Two-Level Designs with Treatment Cardinality Constraints with an Application to LLM Prompt Engineering2026-05-20T17:06:34ZModern experimental designs often face the so-called treatment cardinality constraint, which is the constraint on the number of included factors in each treatment. Experiments with such constraints are commonly encountered in engineering simulation, AI system tuning, and large-scale system verification. This calls for the development of adequate designs to enable statistical efficiency for modeling and analysis within feasible constraints. In this work, we study two-level designs under this $k$-treatment cardinality constraint (TCARD), where the design matrix $\mathbf{X} \in \{0,1\}^{n \times p}$ has constant row sums equal to $k$. Although TCARDs are closely related to balanced incomplete block designs (BIBDs), exact BIBD structure is unavailable for many practical $(n,p,k)$ combinations. This leads to the notion of nearly balanced TCARDs, which we prove minimize the first two components of the generalized word-length pattern. We also show that good projection behavior in this setting is governed by two count-based regularities: balanced factor replications and uniform pairwise concurrences. Motivated by this characterization, we then propose the Balanced Concurrence Deviation ($Φ_{\mathrm{BCD}}$), a model-free objective that jointly penalizes replication imbalance and concurrence dispersion. We further show that this criterion is closely connected to classical optimality principles, including $(M,S)$-optimality, centered $\mathrm{UE}(s^2)$ criterion, and Bayesian $D$-optimality. To construct designs minimizing $Φ_{\mathrm{BCD}}$, we develop a coordinate-exchange (CE) algorithm with efficient incremental updates, together with a simulation-based procedure for calibrating the criterion weights to the intended downstream task. Numerical experiments confirm that the proposed method compares favorably with existing alternatives across a range of problem sizes and constraint strengths.2026-05-20T17:06:34ZKexin XieRyan LekivetzXinwei Deng