https://arxiv.org/api/5LvVRL9RauW9OaFjVnYn+dQ2+BU2026-06-18T18:07:55Z3629688515http://arxiv.org/abs/2605.20828v2Adaptive Test for Jump2026-05-21T04:05:33ZWe develop an adaptive jump test for discretely observed high-frequency semimartingales by combining the A"it-Sahalia--Jacod ratio statistic (A"it-Sahalia and Jacod, 2009) and the Lee--Mykland extreme-return statistic (Lee and Mykland, 2008) with the Cauchy combination rule. Allowing stochastic It^o drift, volatility, and leverage, we show asymptotic independence under the continuous-path null and dense local alternatives, yielding an analytically calibrated test with closed-form power; under finite-activity jumps, the test is consistent. We also extend the method to additive microstructure noise. Simulations show that the combined procedure performs well under both dense and sparse alternatives and is typically best overall.2026-05-20T07:21:06ZHuifang MaLong Fenghttp://arxiv.org/abs/2605.21928v1CausalGuard: Conformal Inference under Graph Uncertainty2026-05-21T02:56:46ZEstimating treatment effects from observational data requires choosing an adjustment set, but valid adjustment depends on an unknown causal graph. Graph misspecification can cause under-coverage, while graph-agnostic conformal wrappers may regain nominal coverage only through large padding. We introduce CausalGuard, a structure-weighted conformal framework that calibrates after aggregating graph-conditional doubly robust pseudo-outcomes. Candidate DAGs are proposed from an LLM-derived edge prior, pruned by conditional-independence tests, and reweighted by Bayesian Information Criterion. A composite nonconformity score then calibrates the posterior-weighted pseudo-outcome. CausalGuard provides distribution-free finite-sample marginal coverage for this aggregated pseudo-outcome; under causal identification, overlap, conditional-mean nuisance stability, and concentration on target-aligned valid adjustment strategies, its conditional mean converges to the true Conditional Average Treatment Effect. Across five benchmarks, CausalGuard attains mean coverage above the nominal 90% level for the directly evaluable target and reduces width when graph-agnostic conformal baselines require large padding. Stress tests show that CausalGuard suppresses invalid collider adjustment and remains stable under misspecified priors when the retained candidate set is data-supported.2026-05-21T02:56:46ZVikash SinghWeicong ChenDebargha GangulyYanyan ZhangNengbo WangSreehari SankarMohsen HaririAlexander NemecekChaoda SongShouren WangBiyao ZhangVan YangErman AydayJing MaVipin Chaudharyhttp://arxiv.org/abs/2605.21884v1Trend and seasonality estimation for point-process time series2026-05-21T01:44:48ZThis article introduces estimators of trend and seasonality for time series of point processes. We assume the point processes follow a temporal or spatial doubly-stochastic Poisson model with log-Gaussian intensity functions. The proposed estimators are computationally simple M-estimators. Their asymptotic distribution is derived, and their finite-sample performance is studied by simulation. As an example of real-data application, we study the patterns of bike demand in the Divvy bike-sharing system of the city of Chicago.2026-05-21T01:44:48ZDaniel GerviniSimon A. Kopischkehttp://arxiv.org/abs/2605.21848v1Block-Independent Likelihood Ratio Testing for High-Dimensional Mean Vectors with Applications to Matrix-Variate Data2026-05-21T00:44:47ZTesting the equality of two high-dimensional mean vectors is a fundamental problem in multivariate analysis. While the classical Hotelling's $T^2$ test is optimal in low-dimensional settings, it fails when the dimension $p$ is comparable to or exceeds the sample size $n$. Several extensions, including the Diagonal Likelihood Ratio Test (DLRT), have been proposed under the working independence assumption among variables. However, such an assumption can lead to a substantial loss of power when correlations are present. In this paper, we propose a new test, the Block Independent Likelihood Ratio Test (BILT), which generalizes DLRT by relaxing the working independence assumption to a block independence assumption. We establish its asymptotic normality of the null distribution of the BILT statistic for 'increasing $p$ with small $n$' under mild regularity conditions. We further analyze the asymptotic power of BILT under a local alternatives. Extensive simulation studies show that BILT maintains Type I error control and achieves substantially higher power than DLRT across a wide range of covariance structures. An application to the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset further demonstrates the application of BILT to testing mean differences between two matrix-variate populations.2026-05-21T00:44:47ZMinsub ShinKwangok SeoSang Han LeeJohan Limhttp://arxiv.org/abs/2605.21846v1Causal Discovery in Structural VAR Models Under Equal Noise Variance2026-05-21T00:41:11ZCausal discovery from multivariate time series is challenging when causal effects may occur both across time and within the same sampling interval. This issue is especially important in applications such as neuroscience, where the sampling rate may be coarse relative to the underlying dynamics and contemporaneous effects need not form an acyclic graph. We study causal discovery in linear Gaussian structural VAR models under an equal noise variance assumption, meaning that the structural noise terms have a common variance. Unlike the DAG-based cross-sectional equal noise variance setting, the time-series setting considered here does not generally yield point identification of a unique causal graph. Instead, multiple structural VAR parameterizations can induce the same stationary observed process law. We introduce a notion of observational equivalence tailored to this setting and show that the corresponding equivalence class is characterized by orthogonal transformations of the structural equations together with a global positive scale. This characterization leads to an equivalence-aware model discrepancy, the observational alignment discrepancy, which compares structural models modulo transformations that preserve the observed law. Building on this theory, we propose ENVAR, a sparsity-based procedure that searches over the induced observational equivalence class for a sparse normalized structural representative. We evaluate the proposed methodology on synthetic structural VAR data and on an fMRI dataset.2026-05-21T00:41:11ZSeyedSina Seyedi HasanAbadiFahimeh ArabErfan NozariAmirEmad Ghassamihttp://arxiv.org/abs/2401.00139v3Enhancing Causal Reasoning in Large Language Models: A Causal Attribution Model for Precision Fine-Tuning2026-05-21T00:09:51ZThis paper introduces a causal attribution model to enhance the interpretability of large language models (LLMs) and improve their causal reasoning abilities via precise fine-tuning. Despite LLMs' proficiency in diverse tasks, their reasoning processes often remain black box, and thus restrict targeted enhancement. We propose a novel causal attribution model that utilizes "do-operators" for constructing interventional scenarios, allowing us to quantify the contribution of different components in LLMs's causal reasoning process systematically. By assessing the proposed attribution scores through causal discovery tasks across various domains, we demonstrate that LLMs' effectiveness in causal discovery heavily relies on provided context and domain-specific knowledge but can also utilize numerical data with limited calculations in correlation, not causation. This motivates the proposed fine-tuned LLM for pairwise causal discovery, effectively and correctly leveraging both knowledge and numerical information.2023-12-30T04:51:46ZA Python implementation of our proposed method is available at https://github.com/ncsulsj/Causal_LLMHengrui CaiShengjie LiuRui Songhttp://arxiv.org/abs/2605.21813v1Symbolic Density Estimation for Discrete Distributions2026-05-20T23:22:21ZDiscrete probability laws underpin statistical modeling, yet the catalog of interpretable distributions has expanded only gradually through centuries of case-by-case mathematical derivations. We introduce symbolic density estimation (SDE), an unsupervised framework that automatically recovers closed-form probability mass functions by composing elementary analytic operations within a structured search space. Our method integrates domain-specific structural priors with evolutionary search and a validity-aware inference stage, and it extends to richer distribution families such as zero inflation and finite mixtures. To support systematic evaluation and future research, we contribute a benchmark dataset spanning a broad collection of commonly used discrete distributions. The proposed algorithm recovers all benchmark families with accurate parameter estimates. A real data application shows that it identifies concise and interpretable mixture models that improve goodness-of-fit over standard models.2026-05-20T23:22:21Z28 pages, 5 figures, 22 tablesZiwen LiuMeng Lihttp://arxiv.org/abs/2602.07252v2Beyond Euclidean Summaries: Online Change Point Detection for Distribution-Valued Data2026-05-20T23:05:01ZExisting online change-point detection (CPD) methods rely on fixed-dimensional Euclidean summaries, implicitly assuming that distributional changes are well captured by moment-based or feature-based representations. They can obscure important changes in distributional shape or geometry. We propose an intrinsic distribution-valued CPD framework that treats streaming batch data as a stochastic process on the 2-Wasserstein space. Our method detects changes in the law of this process by mapping each empirical distribution to a tangent space relative to a pre-change Fréchet barycenter, yielding a reference-centered local linearization of 2-Wasserstein space. This representation enables sequential detectors by adapting classical multivariate monitoring statistics to tangent fields. We provide theoretical guarantees and demonstrate, via synthetic and real-world experiments, that our approach detects complex distributional shifts with reduced detection delay at matched $\mathrm{ARL}_0$ compared with moments-based and model-free baselines. The code is available at https://github.com/yyzeng43/IDD-icml .2026-02-06T23:04:37ZProceedings of the 43rd International Conference on Machine Learning, Seoul, South Korea. PMLR 306, 2026. CYingyan ZengYujing HuangXiaoyu Chenhttp://arxiv.org/abs/2605.21793v1Targeted maximum likelihood estimation of vaccine effectiveness and immune correlates in test-negative design studies with missing data2026-05-20T22:38:26ZThe test-negative design (TND) is a resource-efficient observational study design that can assess vaccine effectiveness and exposure-proximal immune correlates of disease. The TND enrolls symptomatic individuals seeking diagnostic testing and compares case status by an exposure variable, such as vaccination status or immune marker level, that is measured at testing. While the TND reduces confounding by healthcare-seeking behavior, other sources of confounding may remain. TND studies may also have missing data in the exposure variable due to incomplete records or two-phase sampling designs. We present a targeted maximum likelihood estimation approach involving a semiparametric logistic regression model that targets a causal conditional risk ratio of symptomatic disease in the healthcare-seeking population. Under causal and missing at random assumptions, our method produces an efficient, asymptotically linear estimator that provides flexible, data-driven confounding control and valid causal inference when analyzing TND studies with missing exposure variable data. We evaluate our method's finite sample properties using plasmode simulations of a two-phase TND immune correlates study. We also apply our method to assess COVID-19 vaccine effectiveness and antibody marker correlates of COVID-19 from TND study cohorts derived from the Moderna Coronavirus Efficacy phase 3 trial.2026-05-20T22:38:26Z52 pages, 14 figuresLeah I. B. AndrewsLars van der LaanPeter B. Gilberthttp://arxiv.org/abs/2605.21782v1A Scalable Parametric Item Calibration Engine (SPICE) for Explanatory IRT with Sparse Data2026-05-20T22:22:06ZWe describe a Bayesian multidimensional explanatory IRT model, and an associated Markov Chain Monte Carlo (MCMC) estimation procedure and the corresponding development of calibration software, designed for psychometric analyses of large numbers of sparsely-linked persons and items. Such data structures can arise, for example, from adaptive assessments using large banks of automatically generated items with individual test takers receiving a very small proportion of the entire bank. We discuss how our choices for model specification, data structures, and algorithm implementation combine to create a scalable method for explanatory IRT that can support a variety of psychometric operations with sparse data.2026-05-20T22:22:06ZSteven W. NydickManqian LiaoJ. R. Lockwoodhttp://arxiv.org/abs/2307.05732v5From Isotonic to Lipschitz Regression: A New Interpolative Perspective on Shape-restricted Estimation2026-05-20T21:49:19ZThis manuscript bridges nonparametric smoothness-based and shape-restricted estimation, which may appear as two disjoint paradigms in the field. The proposed approach is motivated by a conceptually simple observation: every Lipschitz function is a sum of a monotonic and a linear function. This principle is further generalized to the higher-order monotonicity and multivariate settings. A family of estimators is proposed based on a sample-splitting procedure, inheriting desirable methodological, theoretical, and computational properties of shape-restricted estimators. The theoretical analysis provides convergence guarantees of the estimator under heteroscedastic and heavy-tailed errors, as well as adaptivity to the unknown ``complexity" of the true regression function. The generality of the proposed decomposition framework is demonstrated through new approximation results and numerical studies.2023-07-11T18:59:27ZKenta TakatsuTianyu ZhangArun Kumar Kuchibhotlahttp://arxiv.org/abs/2605.21757v1Substantive-Model-Compatible Multiple Imputation for Cox Regression with a Diverging Number of Covariates2026-05-20T21:34:13ZModern biomedical survival studies with high-dimensional genomic and clinical predictors are challenged by missing covariates. Existing methods conduct inference through penalization and debiasing when the number of covariates diverges with sample size, but they are typically developed with fully observed covariates. Conversely, substantive-model-compatible multiple imputation methods, particularly substantive-model-compatible fully conditional specification (SMC-FCS), provide principled handling of missing covariates while preserving compatibility with the Cox model, yet current methodology and theory remain largely restricted to fixed-dimensional settings. To address these limitations, we propose a semiparametric multiple imputation framework for inference in Cox regression with missing covariates of a diverging dimension. Missing covariates are imputed through a high-dimensional SMC-FCS procedure driven by Cox-model likelihood contributions, with rejection sampling used to enforce substantive-model compatibility and ridge-regularized posterior draws used to stabilize the imputation models. The algorithm stabilizes the Cox estimator through an imputation-regularized optimization iteration and then generates multiply imputed datasets from a stabilized chain. Inference for low-dimensional linear functionals or contrasts, $c^\top β$, is obtained by combining debiased estimators and within-imputation variance estimates through Rubin's rules. We establish consistency and asymptotic normality of the resulting pooled estimator under a diverging-dimensional regime. Simulation studies demonstrate favorable finite-sample performance, and an application to the Boston Lung Cancer Survival Cohort illustrates the practical utility of the proposed method for high-dimensional survival studies with incomplete covariates.2026-05-20T21:34:13ZZhilin ZhangYi Lihttp://arxiv.org/abs/2501.07772v4Honest Inference for Stochastic Optimization2026-05-20T21:23:52ZThis manuscript studies a general approach to construct confidence sets for the solution of stochastic optimization, rendering empirical risk minimization as special cases. Statistical inference for stochastic optimization poses significant challenges due to the non-standard limiting behaviors of the corresponding estimator, which arise in settings with increasing dimension of parameters, non-smooth objectives, or constraints. We propose a simple and unified method that guarantees validity in both regular and irregular cases. We provide a unified treatment of validity, conservativeness, and the size of the resulting confidence sets. In particular, the presented width analysis demonstrates the adaptive behavior of the confidence set to the unknown degree of instance-specific regularity. We apply the proposed method to several high-dimensional and irregular statistical problems. Numerical results for all statistical applications are provided.2025-01-14T01:07:30ZKenta TakatsuArun Kumar Kuchibhotlahttp://arxiv.org/abs/2605.16108v2Estimating Association Between Paired Outcomes in Clustered Data with Informative Subgroup Size2026-05-20T20:43:50ZInformative cluster size (ICS) and informative subgroup size (ISS) can distort marginal association estimates when the number of observed units, or their distribution across outcome-defined categories, is related to the outcomes under study. This issue is especially relevant for paired outcomes, where the observed association can depend on cluster size, paired-category composition, and the process by which units become available for analysis. We propose three weighted estimating approaches for marginal association between paired outcomes in clustered data. The weights are derived from within-cluster resampling arguments and extend inverse cluster-size and subgroup-size weighting to paired outcome categories. We also modify an existing ISS testing procedure by utilizing Stouffer's method to reduce computational burden. To evaluate the methods, we develop a simulator for clustered paired outcomes that separates unit-level association, latent cluster-level association, and outcome-dependent retention. Simulations show that pair-based weighting can reduce bias when association arises through unit-level dependence and subgroup composition is informative, but can attenuate association carried by latent cluster-level structure. Typical inverse-cluster weighting remains more stable when the association is primarily cluster-level. Application to NHANES oral-health data shows small positive periodontal and caries associations overall, with filled-surface outcomes showing stronger ISS evidence and greater sensitivity to pair-based weighting than decayed-surface outcomes. These results indicate that marginal association under ICS and ISS should be interpreted in relation to the source of association, observed-unit structure, and assumptions used to choose the weighting scheme.2026-05-15T15:56:03ZOwen VisserSomnath Dattahttp://arxiv.org/abs/2605.21651v1Similarity-Driven Proposals for MCMC Algorithms on Discrete Spaces2026-05-20T19:06:33ZRecent research has led to the development of MCMC algorithms with likelihood-informed proposals when targeting posterior distributions supported on discrete state spaces. Our work is placed within this field and puts forward a new MCMC methodology based upon similarity-driven proposals. Such proposals sway transitions towards states favored by the posterior via use of a data-driven measure of discrepancy between observations and the proposed model. Our approach can naturally cover classes of hierarchical models that involve both discrete variables and additional latent ones, without a requirement of integrating our the latter, in contrast to previous works in this field. The new algorithms are illustrated in simulation settings and in a involved real data scenario with a Dirichlet-Multinomial regression model.2026-05-20T19:06:33ZLuca AielloRaffaele ArgientoAlexandros BeskosMaria De Iorio