https://arxiv.org/api/9ZYs794q17G1XlOXZCJWf7Sn2/w 2026-06-18T17:00:45Z 36296 870 15 http://arxiv.org/abs/2601.11845v2 Reevaluating Causal Estimation Methods with Data from a Product Release 2026-05-21T15:43:45Z

Recent developments in causal machine learning methods have made it easier to estimate flexible relationships between confounders, treatments and outcomes, making unconfoundedness assumptions in causal analysis more palatable. How successful are these approaches in recovering ground truth baselines? In this paper we analyze a new data sample including an experimental rollout of a new feature at a large technology company and a simultaneous sample of users who endogenously opted into the feature. We find that recovering ground truth causal effects is feasible -- but only with careful modeling choices. Our results build on the observational causal literature beginning with LaLonde (1986), offering best practices for more credible treatment effect estimation in modern, high-dimensional datasets.

2026-01-17T00:25:03Z 53 pages Justin Young Eleanor Wiske Dillon http://arxiv.org/abs/2509.05443v3 Multidimensional constructs and moderated linear and nonlinear factor analysis 2026-05-21T15:30:00Z

Multidimensional factor models with moderations on all model parameters have so far been limited to single-factor and two-factor models. This does not align well with existing psychological measures, which are commonly intended to assess 3-5 dimensions of a latent construct. In this paper, I introduce a multidimensional MNLFA model that permits the moderation of item intercepts, loadings, residual variances, factor means, variances, and correlations across three or more latent factors. I describe efforts to implement the model using Bayesian methods through Stan and penalized maximum likelihood approaches to stabilize estimation and detect partial measurement non-invariance while preserving model interpretability. Closed-form analytic gradients of the likelihood, eliminating the need for costly numerical or MCMC-based approximations. We conclude by discussing the theoretical implications of penalization for measurement invariance, computational considerations, and future directions for extending the framework to categorical indicators, longitudinal data, and applied research contexts.

2025-09-05T18:49:54Z 22 pages, 2 figures R. Noah Padgett http://arxiv.org/abs/2605.22595v1 A new class of functional conditional autoregressive models 2026-05-21T15:12:35Z

We introduce a new class of conditional autoregressive models for spatially dependent functional data, formulated through conditional means given neighboring functional observations and characterized by a covariance operator and a spatial dependence parameter. Our estimation strategy consists of three components: (i) estimating the covariance operator using conditionally centered data, (ii) estimating the spatial dependence parameter by maximizing the likelihood of projected observations, and (iii) applying a novel profile-based approach to obtain the final estimators. Under an expanding lattice framework, we establish two key theoretical results. First, we establish the consistency of the proposed covariance estimator, which is not attainable using naive methods based on marginally centered data. Second, we prove that the spatial dependence parameter estimator is superconsistent and asymptotically normal, where the latter property enables statistical inference for spatial dependence in functional data -- a contribution that is novel in the existing literature. Numerical studies support the theoretical results and demonstrate the computational efficiency of our method. Finally, we illustrate its practical utility by analyzing weekly PM$_{2.5}$ concentration trajectories in 2019 across counties in the Midwestern United States.

2026-05-21T15:12:35Z Sooran Kim http://arxiv.org/abs/2505.18391v4 Bayesian Estimation of Cohort-Time-Stratum Specific Effects in Staggered Difference-in-Differences 2026-05-21T15:10:12Z

Difference-in-Differences designs with staggered treatment adoption are widely used to study heterogeneous treatment effects across cohorts and time periods. We develop a probabilistic framework for estimating potentially high-dimensional ATT arrays that vary across cohorts, periods, and strata defined by baseline covariates. The framework jointly estimates subgroup-specific treatment effects through a unified likelihood-based model, stabilizing inference in sparse cohort-by-time-by-stratum settings. We establish a Bernstein-von Mises theorem for the ATT array, implying asymptotically valid frequentist coverage of posterior credible intervals. Simulations and an application to minimum wage increases and teen employment demonstrate meaningful finite-sample improvements and important subgroup heterogeneity.

2025-05-23T21:38:32Z Siddhartha Chib Kenichi Shimizu http://arxiv.org/abs/2402.13472v3 Generalized linear models with spatial dependence and a functional covariate 2026-05-21T14:17:35Z

We extend generalized functional linear models under independence to a situation in which a functional covariate is related to a scalar response variable that exhibits spatial dependence-a complex yet prevalent phenomenon. For estimation, we apply basis expansion and truncation for dimension reduction of the covariate process followed by a composite likelihood estimating equation to handle the spatial dependency. We establish asymptotic results for the proposed model under a repeating lattice asymptotic context, allowing us to construct a confidence interval for the spatial dependence parameter and a confidence band for the regression parameter function. A binary conditionals model with functional covariates is presented as a concrete illustration and is used in simulation studies to verify the applicability of the asymptotic inferential results. We apply the proposed model to a problem in which the objective is to relate annual corn yield in counties of states in the Midwestern United States to daily maximum temperatures from April to September in those same geographic regions. The extension to an expanding lattice context is further discussed in the supplement.

2024-02-21T02:02:11Z Sooran Kim Mark S. Kaiser Xiongtao Dai http://arxiv.org/abs/2508.12085v2 Unified Conformalized Multiple Testing with Full Data Efficiency 2026-05-21T12:23:48Z

Conformalized multiple testing offers a model-free way to control predictive uncertainty in decision-making. Existing methods typically use only part of the available data to build score functions tailored to specific settings. We propose a unified framework that puts data utilisation at the centre: it uses all available data-null, alternative, and unlabelled-to construct scores and calibrate p-values through a full permutation strategy. This unified use of all available data significantly improves power by enhancing non-conformity score quality and maximising calibration set size while rigorously controlling the false discovery rate. Crucially, our framework provides a systematic design principle for conformal testing and enables automatic selection of the best conformal procedure among candidates without extra data splitting. Extensive numerical experiments demonstrate that our enhanced methods deliver superior efficiency and adaptability across diverse scenarios.

2025-08-16T15:45:29Z Yuyang Huo Xiaoyang Wu Changliang Zou Haojie Ren http://arxiv.org/abs/2605.22354v1 From Volterra Series to Kunchenko Stochastic Polynomials: Half a Century of Non-Gaussian Estimation Methodology 2026-05-21T11:42:32Z

This paper reconstructs the half-century evolution of the scientific school founded by Yuriy P. Kunchenko (1939--2006) as the development of a semiparametric methodology for non-Gaussian estimation. Starting with Kunchenko's 1972/1973 dissertation applying Volterra series to estimate parameters of random processes, the trajectory is followed through 2006--2026. Kunchenko stochastic polynomials are presented as a coherent family of moment-cumulant procedures: the polynomial maximization method (PMM) for parameter estimation, polynomial criteria for hypothesis testing, and decomposition in spaces with a generating element. The paper details the school's structure: a verified genealogy of 15 defended dissertations, collaborations in Poland, Slovakia, and Germany, and the R package EstemPMM. A recent 2026 paper on Volterra-based signal processing is analyzed, showing how Kunchenko's nonlinear formulation reappears in applied radio engineering. We build a formal bridge between finite Volterra models and generalized Kunchenko polynomials, while separating the MMSE/L2 criterion from PMM: the former is a covariance projection for kernel adaptation, whereas PMM is a parameter-dependent moment procedure. PMM efficiency claims are stated conditionally: gains require that moments exist, the centered correlant matrix is nondegenerate, and the variance reduction coefficient is below one. The concluding research program operationalizes the historical reconstruction into testable statistical and signal-processing tasks.

2026-05-21T11:42:32Z Bilingual submission: English followed by Ukrainian translation Serhii Zabolotnii http://arxiv.org/abs/2605.22352v1 Spatiotemporal dynamics and ecological risk factors of highly pathogenic avian influenza A(H5N1) in Canadian wildlife: A One Health surveillance analysis 2026-05-21T11:41:20Z

Highly pathogenic avian influenza A(H5N1) has expanded geographically and ecologically, affecting wild birds, mammalian wildlife, domestic animals, and humans. Wildlife surveillance provides critical early warning for One Health preparedness, yet national-scale analyses integrating host ecology, spatial patterns, seasonality, viral lineage, and risk factors remain limited. This study analysed Canadian wildlife HPAI A(H5N1) surveillance records from 2022 to 2026 to characterise spatiotemporal dynamics and identify factors associated with detection counts. A retrospective analysis of 2,657 detections across 13 provinces and territories was conducted using descriptive epidemiology, spatial clustering methods, and Negative Binomial mixed models. Detections were predominantly avian, with waterfowl and raptors as the major host groups, while mammals accounted for a smaller but epidemiologically important proportion. Detection burden was highest in 2022, with increased activity in autumn and spring. Ontario, Alberta, and British Columbia were identified as major hotspots, with evidence of local clustering in parts of the Prairie region. Reassortant Eurasian-North American lineages dominated detections and were strongly associated with higher detection counts. Modelling results identified year, season, and lineage as key predictors. These findings support risk-based One Health surveillance prioritising high-burden regions, migration-associated periods, key avian host groups, reassortant viral lineages, and continued monitoring of mammalian wildlife.

2026-05-21T11:41:20Z Hammed Olawale Fatoyinbo Hoyeon Jeong http://arxiv.org/abs/2605.22301v1 Chained Markov melding using divide and conquer sequential Monte Carlo 2026-05-21T10:47:41Z

Specifying a full Bayesian model that integrates multiple data sources can be challenging. One natural approach is to specify each individual model separately and join them afterwards. This is the approach adopted in Markov melding. However, when adjacent submodels share common quantities, as in chained Markov melding, posterior inference can be challenging for existing MCMC-based approaches. In this paper, we propose a new multi-stage sampler for chained Markov models involving an arbitrary number of submodels. The proposed sampler adopts a divide-and-conquer sequential Monte Carlo approach for the tree-structured model that fits naturally with the structure of chained Markov melding. The resulting multi-stage sampler provides a flexible alternative for sampling from complex joint models, as its separate sampling scheme for different submodels avoids the need for directly sampling from the full model. We demonstrate applications of the sampler through two examples. The first is a toy example involving 11 submodels of various types. The second example considers an ecologically integrated population model that combines multiple datasets to estimate immigration and reproduction rates.

2026-05-21T10:47:41Z Yixuan Liu Robert J. B. Goudie http://arxiv.org/abs/2605.22253v1 Bayesian Nonparametrics: Principles and Practice 2026-05-21T09:59:21Z

This extended preface [to the Book `Bayesian Nonparametrics', Cambridge University Press, 2010, by NL Hjort, CC Holmes, P Mueller, SG Walker] is meant to explain why you are right to be curious about Bayesian nonparametrics -- why you may actually need it and how you can manage to understand it and use it. The preface also serves as an introductory chapter, giving an overview of the aims and contents of the book. We also explain the background for how the book came into existence, delve briefly on the history of the still relatively young field of Bayesian nonparametrics, and offer some concluding remarks, pertaining to various challenges and likely future developments of the area.

2026-05-21T09:59:21Z 16 pages, no figures. This is the authors' extended preface to and published in modified form in the book Bayesian Nonparametrics, Cambridge University Press, 2010, sketching the history of Bayesian Nonparametrics, pointing to developments and application domains, etc Nils Lid Hjort Chris Holmes Peter Mueller Stephen G. Walker http://arxiv.org/abs/2605.22110v1 Two-stage Ensemble Clustering of Functional Data Using Random Projections 2026-05-21T07:43:03Z

We propose a computationally simple framework for clustering functional data based on Gaussian-process-generated random projections. In this approach, each curve is first projected onto a large collection of independent Gaussian process realizations. The resulting high-dimensional representations are clustered using the Mean Absolute Difference of Distances (MADD), a dissimilarity measure well suited for high-dimensional settings. A population-level analysis of this dissimilarity provides insight into how random projections help capture distributional differences between functional populations. We introduce a second stage of clustering to additionally leverage on data-driven projection directions. Thus, in Stage I, an initial clustering is obtained using a set of prespecified projection families. In Stage II, this partition is refined by constructing Gaussian random projections based on an estimated covariance operator that uses the first stage of cluster labels. Finally, a normalized cost function is used to select the optimal clustering among candidate solutions. The proposed clustering algorithm is broadly applicable to diverse functional data regimes including irregular and partially observed data. Through extensive simulations and real-data applications, we show that the proposed method achieves a high degree of accuracy and outperforms many of the state-of-the-art methods across a wide range of functional data settings.

2026-05-21T07:43:03Z 32 pages, 6 figures, 7 tables Sourav Chakrabarty Anirvan Chakraborty Shyamal K. De http://arxiv.org/abs/2605.22038v1 A Mixed Self-Exciting Process to Model Epileptic Seizures 2026-05-21T06:21:35Z

Epilepsy is a neurological disorder characterized by recurrent seizures affecting more than 70 million people worldwide. Often, an individual with epilepsy is more likely to experience subsequent seizures following an initial seizure, a process we call seizure clustering. Motivated by seizure diary data collected over three years from 407 individuals newly diagnosed with focal epilepsy in the Human Epilepsy Project (HEP), we propose a Bayesian mixed Hawkes process model that addresses seizure clustering and heterogeneity between individuals. In the Hawkes process, the intensity is accelerated each time an event occurs, through the composition of background and excitation intensity functions. The proposed model incorporates a Weibull baseline intensity to model a trend in background seizure rates over time, while the excitation process accounts for seizure clustering within individuals. We model heterogeneity among individuals by including covariates and random effects in both the background and excitation intensities. In the HEP study, the average time between primary and secondary seizures within an individual is 1.57 (95\% CrI: 1.43, 1.70) days, with an average of 2.20 (1.96, 2.47) seizures per cluster. We demonstrate that omitting random effects in the presence of heterogeneity leads to underestimation of the background intensity and overestimation of excitation rates.

2026-05-21T06:21:35Z 35 pages, 5 figures, 33 pages supplementary material Karen Kanaster Giovani L. Silva Peter Mueller Jacob Pellinen Elizabeth Juarez-Colunga http://arxiv.org/abs/2605.22025v1 Testing for Serial Independence via Auto Hilbert-Schmidt Independence Criterion 2026-05-21T05:50:26Z

We develop a Hilbert--Schmidt independence criterion (HSIC)-based framework for testing serial independence in strictly stationary time series. The proposed auto Hilbert--Schmidt independence criterion (AutoHSIC) measures dependence between an observation and its lagged counterpart, providing a kernel-based approach to detecting nonlinear serial dependence. The empirical AutoHSIC statistic is a lagged U-statistic constructed from overlapping observations, and hence inherits temporal dependence even under the i.i.d. null. Its asymptotic analysis therefore differs from standard i.i.d. HSIC theory and must account for degeneracy under the null. We establish the limiting behaviour of the resulting single-lag and portmanteau tests under the null and under fixed alternatives. Since the limiting null distribution is non-pivotal, we develop a wild bootstrap procedure for critical value approximation and prove its asymptotic validity. The framework is further extended to residual-based model diagnostics, where parameter estimation affects the null distribution. Simulations and empirical applications illustrate its ability to detect nonlinear serial dependence in multivariate, functional and matrix time series.

2026-05-21T05:50:26Z Muyi Li Yuqing Xu Zhou Zhou http://arxiv.org/abs/2605.22004v1 Selecting Informative Conformal Prediction Sets with an Optimized FCR-Controlled Approach 2026-05-21T05:02:35Z

Conformal methods provide prediction sets for outcomes with confidence guarantees. We study their use in a selective inference setting, where inference is performed only when the prediction set is informative. The analyst may consider as informative, for example, cases with prediction sets that are sufficiently small, exclude null values, or satisfy other appropriate monotone constraints. Because inference is typically restricted to informative cases in practical applications, accounting for the resulting selection bias is crucial to maintaining false coverage rate (FCR) control. A general framework for constructing such informative conformal prediction sets while controlling the FCR on the selected sample was suggested in Gazin et al. (2025). In this work we focus on oracle-guided procedures. We derive the optimal decision policy under a suitable power objective in the oracle setting where the probability of belonging to each prediction set can be computed. In practice, of course, only estimated probabilities are available. We therefore introduce a calibration procedure that adjusts the oracle policy to maintain finite sample FCR control. We show that this approach can achieve substantially higher power than available alternatives. We demonstrate the effectiveness of our new methods for classification outcomes on both real and simulated data.

2026-05-21T05:02:35Z Israela Solomon Etienne Roquain Saharon Rosset Ruth Heller http://arxiv.org/abs/2606.02589v1 Rashomon-Seeded Annealing for Robust Bayesian Inference in Factorial Designs 2026-05-21T05:01:39Z

Integrating over model uncertainty in factorial designs via Bayesian model averaging is hindered by the combinatorial explosion of interpretable interaction effects, often yielding a multimodal posterior, where standard Markov chain Monte Carlo algorithms encounter significant convergence issues. We propose a general computational framework that repurposes Rashomon sets, collections of high-performing models traditionally valued for prediction and interpretability, as a strategic "warm start" for estimating the full posterior. Our method, Rashomon-seeded annealing, initializes annealed importance sampling (AIS) by anchoring the starting density within these pre-identified, high-evidence regions while preserving global support over the entire model space. Rather than restricting inference to the Rashomon set and understating uncertainty, the AIS correction restores full posterior inference, turning the Rashomon certificate from an inferential truncation into a proposal mechanism. We demonstrate this approach using Rashomon Partition Sets (RPS) as a rigorous, certified seed constructor for factorial designs. The resulting algorithm yields consistent self-normalized posterior summaries, such as model-averaged cell means, credible intervals, and uncertainty summaries without exhaustive enumeration of the complete model space. This bridges the gap between high-evidence model discovery and rigorous Bayesian inference, and outlines a general strategy in which any high-posterior seed set can provide computational leverage for AIS-based model averaging.

2026-05-21T05:01:39Z 28 pages, 8 figures Yiyang Fan Soumyakanti Pan Tyler H. McCormick