https://arxiv.org/api/zzA19ssp8j45ZuGW1NQMeXza3QM 2026-03-20T18:58:33Z 9966 75 15 http://arxiv.org/abs/2110.10296v2 A Bayesian Approach for the Variance of Fine Stratification 2026-03-05T21:55:07Z

Fine stratification is a popular design as it permits the stratification to be carried out to the fullest possible extent. Some examples include the Current Population Survey and National Crime Victimization Survey both conducted by the U.S. Census Bureau, and the National Survey of Family Growth conducted by the University of Michigan's Institute for Social Research. Clearly, the fine stratification survey has proved useful in many applications as its point estimator is unbiased and efficient. A common practice to estimate the variance in this context is collapsing the adjacent strata to create pseudo-strata and then estimating the variance, but the attained estimator of variance is not design-unbiased, and the bias increases as the population means of the pseudo-strata become more variant. Additionally, the estimator may suffer from a large mean squared error (MSE). In this paper, we propose a hierarchical Bayesian estimator for the variance of collapsed strata and compare the results with a nonparametric Bayes variance estimator. Additionally, we make comparisons with a kernel-based variance estimator recently proposed by Breidt et al. (2016). We show our proposed estimator is superior compared to the alternatives given in the literature such that it has a smaller frequentist MSE and bias. We verify this throughout multiple simulation studies and data analysis from the 2007-8 National Health and Nutrition Examination Survey and the 1998 Survey of Mental Health Organizations.

2021-10-19T22:29:33Z Please see the final version (arXiv:2603.03569) Sepideh Mosaferi http://arxiv.org/abs/2603.06715v1 Understanding and Managing Frogeye Leaf Spot through Network-Based Modeling in Soybean 2026-03-05T21:11:40Z

Frogeye Leaf Spot (FLS), caused by Cercospora sojina, poses a significant threat to soybean production, with yield losses of 30-60%. Traditional mass-action models assume homogeneous mixing, which rarely holds in real fields and limits their ability to inform FLS management. To address this, we developed a network-based model that incorporates real-field structure to improve FLS management in soybeans. Using approximate Bayesian computation, we estimated key epidemiological parameters and found that infection origin can shift the balance between transmission routes. Data analyses indicated that tillage and non-tillage plots did not differ significantly in fungal spread, decay, or disease severity. Finally, we show that early, targeted roguing is more effective than delayed or random removal. Together, these findings offer science-based guidance for FLS management and highlight the value of network-based models to inform agricultural disease control.

2026-03-05T21:11:40Z 22 pages, 7 figures, 3 tables Chinthaka Weerarathna Thien-Minh Le Jin Wang http://arxiv.org/abs/2512.12988v2 A Bayesian approach to learning mixtures of nonparametric components 2026-03-04T22:17:28Z

Mixture models are widely used in modeling heterogeneous data populations. A standard approach of mixture modeling assumes that the mixture component takes a parametric kernel form. In many applications, making parametric assumptions on the latent subpopulation distributions may be unrealistic, which motivates the need for nonparametric modeling of the mixture components themselves. In this paper, we study finite mixtures with nonparametric mixture components, using a Bayesian nonparametric modeling approach. In particular, it is assumed that the data population is generated according to a finite mixture of latent component distributions, where each component is endowed with a Bayesian nonparametric prior such as the Dirichlet process mixture. We present conditions under which the individual mixture component's distribution can be identified, and establish posterior contraction behavior for the data population's density, as well as densities of the latent mixture components. We develop an efficient MCMC algorithm for posterior inference and demonstrate via simulation studies and real-world data illustrations that it is possible to efficiently learn complex forms of probability distribution for the latent subpopulations. In theory, the posterior contraction rate of the component densities is nearly polynomial, which is a significant improvement over the logarithmic convergence rates of estimating mixing measures via deconvolution.

2025-12-15T05:27:01Z 80 pages, 9 figures Yilei Zhang Yun Wei Aritra Guha XuanLong Nguyen http://arxiv.org/abs/2603.04632v1 Least trimmed squares regression with missing values and cellwise outliers 2026-03-04T21:46:37Z

Regression is the workhorse of statistics, and is often faced with real data that contain outliers. When these are casewise outliers, that is, cases that are entirely wrong or belong to a different population, the issue can be remedied by existing casewise robust regression methods. It is another matter when cellwise outliers occur, that is, suspicious individual entries in the data matrix containing the regressors and the response. We propose a new regression method that is robust to both casewise and cellwise outliers, and handles missing values as well. Its construction allows for skewed distributions. We show that it obeys the first breakdown result for cellwise robust regression. It is also the first such method that is geared to making robust out-of-sample predictions. Its performance is studied by simulation, and it is illustrated on a substantial real dataset.

2026-03-04T21:46:37Z Jakob Raymaekers Peter J. Rousseeuw http://arxiv.org/abs/2601.20888v2 Latent-IMH: Efficient Bayesian Inference for Inverse Problems with Approximate Operators 2026-03-04T20:28:32Z

We study sampling from posterior distributions in Bayesian linear inverse problems where $A$, the parameters to observables operator, is computationally expensive. In many applications, $A$ can be factored in a manner that facilitates the construction of a cost-effective approximation $\tilde{A}$. In this framework, we introduce Latent-IMH, a sampling method based on the Metropolis-Hastings independence (IMH) sampler. Latent-IMH first generates intermediate latent variables using the approximate $\tilde{A}$, and then refines them using the exact $A$. Its primary benefit is that it shifts the computational cost to an offline phase. We theoretically analyze the performance of Latent-IMH using KL divergence and mixing time bounds. Using numerical experiments on several model problems, we show that, under reasonable assumptions, it outperforms state-of-the-art methods such as the No-U-Turn sampler (NUTS) in computational efficiency. In some cases, Latent-IMH can be orders of magnitude faster than existing schemes.

2026-01-28T03:44:01Z Youguang Chen George Biros http://arxiv.org/abs/2603.04306v1 Theory Discovery in Social Networks: Automating ERGM Specification with Large Language Models 2026-03-04T17:23:02Z

Understanding how social networks form, whether through reciprocity, shared attributes, or triadic closure, is central to computational social science. Exponential Random Graph Models (ERGMs) offer a principled framework for testing such formation theories, but translating qualitative social hypotheses into stable statistical specifications remains a significant barrier, requiring expertise in both network theory and model estimation. We present Forge (Formation-Oriented Reasoning with Guarded ERGMs), a framework that uses large language models to automate this translation. Given a network and an informal description of the social context, Forge proposes candidate formation mechanisms, validates them against feasibility and stability constraints, and iteratively refines specifications using goodness-of-fit diagnostics. Evaluation across twelve benchmark networks spanning schools, organizations, and online communication shows that Forge converges in 10 of 12 cases, and conditional on convergence it achieves the best likelihood-based fit in 9 of 10 while meeting adequacy thresholds. By combining LLM-based proposals with statistical guardrails, Forge reduces the manual effort required for ERGM specification.

2026-03-04T17:23:02Z Yidan Sun Mayank Kejriwal http://arxiv.org/abs/2511.07340v2 Smoothing Out Sticking Points: Sampling from Discrete-Continuous Mixtures with Dynamical Monte Carlo by Mapping Discrete Mass into a Latent Universe 2026-03-04T16:58:44Z

Combining a continuous "slab" density with discrete "spike" mass at zero, spike-and-slab priors provide important tools for inducing sparsity and carrying out variable selection in Bayesian models. However, the presence of discrete mass makes posterior inference challenging. "Sticky" extensions to piecewise-deterministic Markov process samplers have shown promising performance, where sampling from the spike is achieved by the process sticking there for an exponentially distributed duration. As it turns out, the sampler remains valid when the exponential sticking time is replaced with its expectation. We justify this by mapping the spike to a continuous density over a latent universe, allowing the sampler to be reinterpreted as traversing this universe while being stuck in the original space. This perspective opens up an array of possibilities to carry out posterior computation under spike-and-slab type priors. Notably, it enables us to construct sticky samplers using other dynamics-based paradigms such as Hamiltonian Monte Carlo; in fact, original sticky process can be established as a partial position-momentum refreshment limit of our Hamiltonian sticky sampler. Our theoretical and empirical findings suggest these alternatives to be at least as efficient as the original sticky approach.

2025-11-10T17:40:12Z Andrew Chin Akihiko Nishimura http://arxiv.org/abs/2603.04003v1 Efficient Bayesian Estimation of Dynamic Structural Equation Models via State Space Marginalization 2026-03-04T12:47:12Z

Dynamic structural equation models (DSEMs) combine time-series modeling of within-person processes with hierarchical modeling of between-person differences and differences between timepoints, and have become very popular for the analysis of intensive longitudinal data in the social sciences. An important computational bottleneck has, however, still not been resolved: whenever the underlying process is assumed to be latent and measured by one or more indicators per timepoint, currently published algorithms rely on inefficient brute-force Markov chain Monte Carlo sampling which scales poorly as the number of timepoints and participants increases and results in highly correlated samples. The main result of this paper shows that the within-level part of any DSEM can be reformulated as a linear Gaussian state space model. Consequently, the latent states can be analytically marginalized using a Kalman filter, allowing for highly efficient estimation via Hamiltonian Monte Carlo. This makes estimation of DSEMs computationally tractable for much larger datasets -- both in terms of timepoints and participants -- than what has been previously possible. We demonstrate the proposed algorithm in several simulation experiments, showing it can be orders of magnitude more efficient than standard Metropolis-within-Gibbs approaches.

2026-03-04T12:47:12Z Øystein Sørensen http://arxiv.org/abs/2504.11279v2 Simulation-based inference for stochastic nonlinear mixed-effects models with applications in systems biology 2026-03-04T12:41:46Z

The analysis of data from multiple experiments, such as observations of several individuals, is commonly approached using mixed-effects models, which account for variation between individuals through hierarchical representations. This makes mixed-effects models widely applied in fields such as biology, pharmacokinetics, and sociology. In this work, we propose a novel methodology for scalable Bayesian inference in hierarchical mixed-effects models. Our framework first constructs amortized approximations of the likelihood and the posterior distribution, which are then rapidly refined for each individual dataset, to ultimately approximate the parameters posterior across many individuals. The framework is easily trainable, as it uses mixtures of experts but without neural networks, leading to parsimonious yet expressive surrogate models of the likelihood and the posterior. We demonstrate the effectiveness of our methodology using challenging stochastic models, such as mixed-effects stochastic differential equations emerging in systems biology-driven problems. However, the approach is broadly applicable and can accommodate both stochastic and deterministic models. We show that our approach can seamlessly handle inference for many parameters. Additionally, we applied our method to a real-data case study of mRNA transfection. When compared to exact pseudomarginal Bayesian inference, our approach proved to be both fast and competitive in terms of statistical accuracy.

2025-04-15T15:18:58Z 42 pages, 23 figures Stat. Comput. 36, 99 (2026) Henrik Häggström Sebastian Persson Marija Cvijovic Umberto Picchini 10.1007/s11222-026-10850-8 http://arxiv.org/abs/2603.03997v1 Bandwidth Selection for Spatial HAC Standard Errors 2026-03-04T12:40:43Z

Spatial autocorrelation in regression models can lead to downward biased standard errors and thus incorrect inference. The most common correction in applied economics is the spatial heteroskedasticity and autocorrelation consistent (HAC) standard error estimator introduced by Conley (1999). A critical input is the kernel bandwidth: the distance within which residuals are allowed to be correlated. However, this is still an unresolved problem and there is no formal guidance in the literature. In this paper, I first document that the relationship between the bandwidth and the magnitude of spatial HAC standard errors is inverse-U shaped. This implies that both too narrow and too wide bandwidths lead to underestimated standard errors, contradicting the conventional wisdom that wider bandwidths yield more conservative inference. I then propose a simple, non-parametric, data-driven bandwidth selector based on the empirical covariogram of regression residuals. In extensive Monte Carlo experiments calibrated to empirically relevant spatial correlation structures across the contiguous United States, I show that the proposed method controls the false positive rate at or near the nominal 5% level across a wide range of spatial correlation intensities and sample configurations. I compare six kernel functions and find that the Bartlett and Epanechnikov kernels deliver the best size control. An empirical application using U.S. county-level data illustrates the practical relevance of the method. The R package SpatialInference implements the proposed bandwidth selection method.

2026-03-04T12:40:43Z Alexander Lehner http://arxiv.org/abs/2603.03845v1 Steady State Distribution and Stability Analysis of Random Differential Equations with Uncertainties and Superpositions: Application to a Predator Prey Model 2026-03-04T08:50:28Z

We present a computational framework to investigate steady state distributions and perform stability analysis for random ordinary differential equations driven by parameter uncertainty. Using the nonlinear Rosenzweig McArthur predator prey model as a case study, we characterize the non-trivial equilibrium steady state of the system and investigate its complex distribution when the parameter probability densities are multi-modal mixture models with partially overlapping or separated components. In consequence, this application includes both, uncertainties and superpositions, of the system parameters. In addition, we present the stability analysis of steady states based on the eigenvalue distribution of the system's Jacobian matrix in this stochastic regime. The steady state posterior density and stability metrics are computed with a recently published Monte Carlo based numerical scheme specifically designed for random equation systems (Hoegele, 2026). Particularly, the simplicity of this stochastic extension of dynamic systems combined with a broadly applicable computational approach is demonstrated. Numerical experiments show the emergence of multi-modal steady state distributions of the predator prey model and we calculate their stability regions, illustrating the method's applicability to uncertainty quantification in dynamical systems.

2026-03-04T08:50:28Z Wolfgang Hoegele http://arxiv.org/abs/2603.03569v1 Bayesian Estimation of Variance under Fine Stratification via Mean-Variance Smoothing 2026-03-03T22:56:27Z

Fine stratification survey is useful in many applications as its point estimator is unbiased, but the variance estimator under the design cannot be easily obtained, particularly when the sample size per stratum is as small as one unit. One common practice to overcome this difficulty is to collapse strata in pairs to create pseudo-strata and then estimate the variance. The estimator of variance achieved is not design-unbiased, and the positive bias increases as the population means of the paired pseudo-strata become more variant. The resulting confidence intervals can be unnecessarily large. In this paper, we propose a new Bayesian estimator for variance which does not rely on collapsing strata, unlike the previous methods given in the literature. We employ the penalized spline method for smoothing the mean and variance together in a nonparametric way. Furthermore, we make comparisons with the earlier work of Breidt et al. (2016). Throughout multiple simulation studies and an illustration using data from the National Survey of Family Growth (NSFG), we demonstrate the favorable performance of our methodology.

2026-03-03T22:56:27Z arXiv admin note: text overlap with arXiv:2110.10296 Sepideh Mosaferi Shonosuke Sugasawa http://arxiv.org/abs/2603.03154v1 Extending the saemix package for R to fit non Gaussian outcomes 2026-03-03T16:51:06Z

Background and Objectives: Longitudinal data are increasingly collected in clinical trials to provide information on treatment action and disease evolution. The trajectory of continuous biomarkers such as target hormone concentrations or viral loads can then be modelled in relationship to the occurrence of events such as recovery or hospitalisation. Other studies may include repeated measurements of discrete pain scores, number of episodes (count) or occurrence of events (survival). Non-linear mixed-effect models (NLMEM) can handle individual differences in trajectories while modelling the underlying population evolution and are the natural choice for their analysis. The saemix package for R is one of the few open-source solutions and the most flexible. In this paper, we extend it to accommodate a variety of models for non-Gaussian data. Methods: The saemix package estimates parameters through the Stochastic Approximation Expectation-Maximisation (SAEM) algorithm. Within the package, non-Gaussian models are specified by their log-likelihood functions, affording maximal control over model formulation. We extend estimation algorithms as well as exploratory and diagnostic plots for non-Gaussian data. Bootstrap approaches were implemented to estimate parameter uncertainty. To evaluate the performance of saemix, we performed a simulation study based on the toenail dataset, containing repeated binary data from a randomised clinical trial. Results: saemix showed good performance to recover the true parameter values in the simulation study, and was stable across different starting values for the parameters. An algorithm jointly searching for covariate and interindividual variability model was also implemented to build the covariate model and applied to categorical and survival-type data.

2026-03-03T16:51:06Z Main text: 24 pages, 6 figures, 6 tables Emmanuelle Comets Maud Delattre Belhal Karimi http://arxiv.org/abs/2603.03004v1 eTFCE: Exact Threshold-Free Cluster Enhancement via Fast Cluster Retrieval 2026-03-03T13:56:57Z

Threshold-free cluster enhancement (TFCE) is a popular method for cluster extent inference but is computationally intensive. Existing TFCE implementations often rely on discretized approximation that introduces numerical errors. Also, we identified a long-standing scaling error in the FSL implementation of TFCE (version 6.0.7.19 and earlier). As an alternative implementation, we present eTFCE, an efficient framework that computes exact TFCE scores using an optimized cluster retrieval algorithm, which, though exact, reduces computation time by approximately 50% compared to standard approximated implementations. In addition, the proposed framework enables simultaneous computation of TFCE and generalized cluster statistics, formulated similarly to TFCE, within a single nonparametric run, with negligible additional computational cost. This, in turn, facilitates systematic method comparisons, and enables a more complete characterization of spatial activation patterns. As a result, eTFCE establishes a mathematically exact and computationally efficient framework for comprehensive and informative nonparametric inference in neuroimaging.

2026-03-03T13:56:57Z Xu Chen Wouter Weeda Thomas E. Nichols Jelle J. Goeman http://arxiv.org/abs/2603.02928v1 LOO-PIT predictive model checking 2026-03-03T12:34:52Z

We consider predictive checking for Bayesian model assessment using leave-one-out probability integral transform (LOO-PIT). LOO-PIT values are conditional cumulative predictive probabilities given LOO predictive distributions and corresponding left out observations. For a well-calibrated model, LOO-PIT values should be near uniformly distributed, but in the finite sample case they are not independent, due to LOO predictive distributions being determined by nearly the same data (all but one observation). We prove that this dependency is non-negligible in the finite case and depends on model complexity. We propose three testing procedures that can be used for continuous and discrete dependent uniform values. We also propose an automated graphical method for visualizing local departures from the null. Extensive numerical experiments on simulated and real datasets demonstrate that the proposed tests achieve competitive performance overall and have much higher power than standard uniformity tests based on the independence assumption that inevitably lead to lower than expected rejection rate.

2026-03-03T12:34:52Z 30 pages Herman Tesso Aki Vehtari