Causal Inference of Blood Pressure Reduction and Coronary Heart Disease Risk in the Framingham Study

2026-05-07T04:44:51Z

Standard cardiovascular risk calculators, including the Framingham Risk Score and the ACC/AHA Pooled Cohort Equations, estimate the conditional probability P(CHD | SysBP = s) rather than the interventional quantity P(CHD | do(SysBP = s)). When confounding is present, this distinction has direct clinical consequences: observational estimates may systematically overstate the absolute benefit of antihypertensive treatment. We applied Pearl's do-calculus to the Framingham Heart Study Offspring Cohort (n = 4,240; primary analysis on 3,776 complete cases; 574 ten-year coronary heart disease events). A structurally corrected directed acyclic graph (DAG) was specified and evaluated using conditional independence testing. The average causal effect (ACE) of a 20 mmHg systolic blood pressure reduction was estimated by g-computation with bootstrap confidence intervals, corroborated by propensity score matching and inverse probability weighting. G-computation yielded an ACE of 3.40 percent absolute risk reduction (95 percent CI: 2.64 to 4.14), compared with a naive observational estimate of 4.14 percent, corresponding to an approximate 21.8 percent relative overestimation. Conditional average treatment effects were estimated using R-Learner and T-Learner metalearners. These findings suggest that observational cardiovascular risk tools may overestimate the absolute benefit of blood pressure reduction, with implications for clinical risk stratification and prescribing thresholds.

Robust Causal Inference for EHR-based Studies of Point Exposures with Missingness in Eligibility Criteria

2026-05-06T23:59:21Z

Missingness in variables that define study eligibility criteria is a seldom addressed challenge in electronic health record (EHR)-based settings. It is typically the case that patients with incomplete eligibility information are excluded from analysis without consideration of (implicit) assumptions that are being made, leaving study conclusions subject to potential selection bias. In an effort to ascertain eligibility for more patients, researchers may look back further in time prior to study baseline, and in using outdated values of eligibility-defining covariates may inappropriately be including individuals who, unbeknownst to the researcher, fail to meet eligibility at baseline. To the best of our knowledge, however, very little work has been done to mitigate these concerns. We propose a robust and efficient estimator of the causal average treatment effect on the treated, defined in the study eligible population, in cohort studies where eligibility-defining covariates are missing at random. The approach facilitates the use of flexible machine-learning strategies for component nuisance functions while maintaining appropriate convergence rates for valid asymptotic inference. This method is directly motivated by, and applied throughout to EHR data from Kaiser Permanente to analyze differences between two common bariatric surgical interventions for long-term weight and glycemic outcomes among a cohort of severely obese patients with type II diabetes mellitus.

Rate-optimal and computationally efficient nonparametric estimation on the circle and the sphere

2026-05-06T20:45:45Z

We investigate the problem of density estimation on the unit circle and the unit sphere from a computational perspective. Our primary goal is to develop new density estimators that are both rate-optimal and computationally efficient for direct implementation. After establishing these estimators, we derive closed-form expressions for probability estimates over regions of the circle and the sphere. Then, the proposed theories are supported by extensive simulation studies. The considered settings naturally arise when analyzing phenomena on the Earth's surface or in the sky (sphere), as well as directional or periodic phenomena (circle). The proposed approaches are broadly applicable, and we illustrate their usefulness through case studies in zoology, climatology, geophysics, and astronomy, which may be of independent interest. The methodologies developed here can be readily applied across a wide range of scientific domains.

Social Determinants of Health and Fentanyl Overdose Mortality Across US Counties: An XGBoost and SHAP Analysis Identifying Silent Risk Counties and Treatment Deserts

2026-05-06T20:28:33Z

Background: Fentanyl overdose deaths are still increasing across the U.S. We do not fully understand which county-level social and structural conditions lead to higher overdose death rates. Social determinants of health, including disability, treatment access, and behavioral health issues, may help identify vulnerable counties before deaths become severe. No earlier study has used explainable machine learning with SHAP attribution on 2022 CDC WONDER data to study treatment access gaps and silent risk counties. Methods: We combined data from four government sources for 975 U.S. counties, including CDC WONDER (2022) overdose mortality data, CDC Social Vulnerability Index (SVI), CDC PLACES health behavior data, and Area Health Resources Files. An XGBoost model was used to predict overdose mortality risk using Standardized Mortality Ratio (SMR). Five-fold cross-validation was used to test model accuracy, and SHAP values were used to show which factors increase or decrease risk. Results: XGBoost outperformed all tested models (Spearman rho=0.67, R2=0.457, MAE=0.409, high-risk recall=71.1%). Top predictors were disability rate, hypertension, smoking, and lack of vehicle access. Treatment desert counties had 52.6% higher overdose mortality (SMR 1.786 vs 1.170; p<0.0001). K-means identified 143 silent risk counties. Overdose deaths were spatially clustered (Moran's I=0.505, p=0.001) with 75 hotspots and 136 coldspots. Suppressed counties were 58.2% of WONDER counties, mostly rural (72%) and treatment deserts (65%). Conclusions: County-level SDOH factors predict overdose deaths, especially disability, treatment access, and behavioral health burden. MOUD expansion should prioritize treatment desert counties, and silent risk counties need early intervention before mortality worsens.

Bayesian Region Selection and Prediction in Poisson Regression with Spatially Dependent Global-Local Shrinkage Prior

2026-05-06T19:30:18Z

High-dimensional spatially correlated covariates are common in regression models encountered in environmental sciences and other fields. In such models, the regression coefficients often exhibit a sparse structure with spatial dependence. Although standard variable selection approaches can help detect the sparse structure, incorporating the dependence into variable selection helps recover spatially contiguous signals and improves prediction accuracy. Motivated by a real-world challenge in hurricane count prediction, we propose a novel neighborhood-structured global-local shrinkage prior for prediction and region selection in Poisson regression with spatial covariates. The proposed prior combines the Conditional Auto-Regressive (CAR) prior with a Super Heavy-tailed prior to introduce spatial dependence among the coefficients while ensuring appropriate shrinkage effects for covariate selection. We develop an efficient Metropolis-within-Gibbs sampler for computation that accommodates the count data. Extensive simulation studies demonstrate that the proposed model excels when signals are weak and adjacent and the spatial dependence in covariates is strong. In the application of hurricane prediction from the north Atlantic, our method outperforms traditional regression-based approaches and rivals the benchmark oracle model.

Improving Minority Population Sampling with BISG Probabilities: Evidence from a Survey of Jewish Americans

2026-05-06T19:08:55Z

Sampling geographically dispersed minority populations poses substantial challenges when individual group membership cannot be directly observed. Although stratified sampling can offer efficiency gains, these gains are typically modest unless the minority population is highly concentrated within a small number of strata. In this paper, we propose using Bayesian Improved Surname Geocoding (BISG) to enhance the efficiency of minority population sampling. BISG generates individual-level probabilities of minority group membership based on names and residential addresses. We incorporate these probabilities into a stratified Poisson probability sampling design. Applying the proposed approach to a national survey of Jewish Americans, we find that our estimates closely align with those from a large-scale Pew Research Center survey of the same population, which relied on a substantially more expensive sampling strategy involving geographic stratification and screening. At a fraction of the cost, our survey reproduces nearly identical patterns observed by Pew, including estimates of religious denominations and participation in specific religious activities.

Dynamic SIR/SEIR-like models comprising a time-dependent transmission rate: Hamiltonian Monte Carlo approach with applications to COVID-19

2026-05-06T16:47:27Z

A study of changes in the transmission of a disease, in particular, a new disease like COVID-19, requires very flexible models which can capture, among others, the effects of non-pharmacological and pharmacological measures, changes in population behaviour and random events. We favour data-driven approaches over a priori and ad-hoc methods and introduce a generalised family of epidemiologically informed mechanistic models, guided by Ordinary Differential Equations and embedded in a probabilistic model. The mechanistic models SIKR and SEMIKR which divide the population into disjoint compartments for individuals Susceptible to infection, Infectious (K sub-compartments), Exposed (M sub-compartments), and Removed from the pool of susceptible are enriched with a time-dependent transmission rate, parameterised using Bayesian P-splines. Such a parameterisation enables an extensive flexibility in the transmission dynamics, without resorting to ad-hoc specifications. Our probabilistic model relies on the solutions of a mechanistic model and benefits from access to the information about under-reporting of new infected cases, a crucial property when studying diseases with a large fraction of asymptomatic infections. Such a model can be differentiated efficiently, which makes Hamiltonian-based Monte Carlo sampling feasible after a careful initialisation and tuning strategy. This is particularly important in the present setting with weakly identified directions and challenging posterior geometries. Furthermore, we apply our methodology to study the transmission dynamics of COVID-19 in the Basque Country (Spain) from mid February 2020 to the end of January 2021, showing how the framework can recover plausible temporal patterns in transmission while making explicit the dependence of the results on modelling choices and convergence diagnostics.

Building informative materials datasets beyond targeted objectives

2026-05-06T16:39:01Z

Materials science data collection can be expensive, making the reuse and long-term utility of datasets critical important for future discovery campaigns. In practice, researchers prioritize a subset of properties due to research interests. However, ignoring a subset of outcomes in data collection campaigns potentially generate datasets poorly suited for future learning tasks. Here, we present a framework for dataset construction that maximizes informativeness for target properties of interest while preserving performance on untargeted ones. Our approach uses diversity-aware selection to ensure broad coverage of the materials space. In noisy experimental dataset construction, we find that without our diversity-aware framework, prediction performance on untargeted properties can degrade by up to 40% relative to random sampling, whereas applying our framework yields improvements of up to 10% . For targeted properties, performance can degrade with respect to random sampling by up to 12.5% without diversity, while our framework achieves gains of up to 25%. Incorporating diversity into dataset construction not only preserves informativeness for the targeted properties, but also improves materials coverage for potential future objectives. As a result, the constructed datasets remain broadly informative across considered and unconsidered outcomes, ensuring unbiased quality entries and mitigating cold-start limitations in subsequent modeling and discovery campaigns.

Randompack: Cross-Platform Reproducible Random Number Generation and Distribution Sampling

2026-05-06T16:35:08Z

A C library for random number generation, Randompack, is presented. The library implements several modern random number generators (engines), including xoshiro256, PCG64, Philox, ranlux++, and sfc64; 14 continuous distributions including uniform, normal, exponential, gamma, beta, and multivariate normal; raw bit streams, bounded integers, permutations, and sampling without replacement. The engine and the distribution layers are separated so any engine can be used with any distribution. Benchmarks show that Randompack is faster overall than competing libraries, with speedup factors ranging from about 1 to 15 depending on engine, distribution, interface, and platform. A distinguishing feature is reproducibility: with the same seeds Randompack gives compatible results across programming languages, computers, CPU architectures, and compilers. The library includes comprehensive support for parallel simulation. It is accompanied by a comprehensive test suite, benchmarking programs, and example programs. Interfaces to Fortran, Python, Julia, and R have been implemented; their benchmark results are included, although their design and implementation are otherwise outside the scope of the article. Unlike other available C libraries with comparable scope, Randompack is permissively licensed under the MIT license, and it is open source and publicly available through GitHub and conda-forge.

ADAPTS: Agentic Decomposition for Automated Protocol-agnostic Tracking of Symptoms

2026-05-06T14:46:49Z

Modeling latent clinical constructs from unconstrained clinical interactions is a unique challenge in affective computing. We present ADAPTS (Agentic Decomposition for Automated Protocol-agnostic Tracking of Symptoms), a framework for automated rating of depression and anxiety severity using a mixture-of-agents LLM architecture. This approach decomposes long-form clinical interviews into symptom-specific reasoning tasks, producing auditable justifications while preserving temporal and speaker alignment. Generalization was evaluated across two independent datasets ($N=204$) with distinct interview structures. On high-discrepancy interviews, automated ratings approximated expert benchmarks ($\text{absolute error}=22$) more closely than original human ratings ($\text{absolute error}=26$). Implementing an ``extended'' protocol that incorporates qualitative clinical conventions significantly stabilized ratings, with absolute agreement reaching $\text{ICC(2,1)} = 0.877$. These findings suggest that the ADAPTS framework enables promising evaluations of psychiatric severity. While the current implementation is purely text-based, the underlying architecture is readily extensible to multimodal inputs, including acoustic and visual features. By approximating expert-level precision in a protocol-agnostic manner, this framework provides a foundation for objective and scalable psychiatric assessment, especially in resource-limited settings.

A Convolution Process for Sea Surface Temperature Hot-Spot Identification in the Mediterranean Sea

2026-05-06T13:49:12Z

Sea surface temperature (SST) is a fundamental determinant of global climate dynamics and economic activity. Reliable projections of future SST patterns depend critically on a rigorous characterization of the underlying spatial random field. In this study, we introduce a novel convolution-based covariance framework tailored to geostatistical domains constrained by physical barriers and influenced by vector-driven flows. By discretizing the continuous marine domain into a directed linear network that preserves the orientation of ocean currents, we construct a moving-average stochastic process whose dynamic is encoded via a Markovian transition-probability matrix on the network's vertices. The induced covariance structure emerges as a weighted combination of a spatial kernel and flow-dependent weights, giving rise to a complex estimation problem. To stabilize inference, we propose a penalized estimator that regularizes covariance parameters while enforcing consistency with known hydrodynamic properties. We then embed this covariance model into a Monte Carlo simulation framework to refine RCP-based SST projections and to identify thermal 'hot spots' of heightened ecological risk. Our approach delivers a statistically principled framework that prevents physical inconsistencies -- such as correlations across land barriers -- providing a robust basis for quantifying uncertainty in future SST forecasts and for guiding targeted environmental assessments.

Forecasting Oncology Demand Trends with Boosting-Based Bayesian Conjugate Models

2026-05-06T12:55:29Z

Accurate trend forecasting in healthcare time series is essential for planning and resource allocation. This paper proposes a Bayesian framework for predicting oncology demand trends, modeling weekly appointments as a Poisson process with a Gamma prior to the demand rate. To enhance adaptability and capture persistent directional patterns, we incorporate a residual-based boosting mechanism grounded in a Gamma-Log-Normal conjugate structure. This boosting approach allows the model to track both short- and long-term trend shifts while maintaining the analytical tractability of conjugate Bayesian updating. The methodology was evaluated on real oncology service data from Cariri, Ceara, Brazil, and compared against established baselines, including linear regression, ARIMA, naive forecasting, LSTM neural networks, and XGBoost. Results showed that the proposed model outperforms competing methods in trend detection accuracy, with gains in terms of percentage of correct direction of 38.25% in relation to the second best approach in some cases.

Confirmation of Binary Clustering in Gamma-Ray Bursts through an Integrated $p$-value from Multiple Nonparametric Tests of Hypotheses

2026-05-06T10:35:26Z

The paper applies a new, nonparametric, interpoint distance-based measure to confirm the inherent groups prevailing in the brightest source of light in the universe: gamma-ray bursts. Our effective metric, in association with clustering methods like Gaussian-mixture model-based and $K$-means algorithms, resolves the conflict regarding the possibility about existence of more than binary clusters in the gamma-ray burst population. Here we carry out multiple nonparametric statistical tests of hypotheses, as many as the number of bursts available from the `BATSE' catalog. An integrated $p$-value achieved from the aforesaid dependent tests solves our concern confirming two groups of short and long bursts.

Multi-site modelling and reconstruction of past extreme skew surges along the French Atlantic coast

2026-05-06T07:07:54Z

Appropriate modelling of extreme skew surges is crucial, particularly for coastal risk management. Our study focuses on modelling extreme skew surges along the French Atlantic coast, with a particular emphasis on investigating the extremal dependence structure between stations. We employ the peak-over-threshold framework, where a multivariate extreme event is defined whenever at least one location records a large value, though not necessarily all stations simultaneously. A novel method for determining an appropriate level (threshold) above which observations can be classified as extreme is proposed. Two complementary approaches are explored. First, the multivariate generalized Pareto distribution is employed to model extremes, leveraging its properties to derive a generative model that predicts extreme skew surges at one station based on observed extremes at nearby stations. Second, a novel extreme regression framework is assessed for point predictions. This specific regression framework enables accurate point predictions using only the 'angle' of input variables, i.e., input variables divided by their norms. The ultimate objective is to reconstruct historical skew surge time series at stations with limited data. This is achieved by integrating extreme skew surge data from stations with longer records, such as Brest and Saint-Nazaire, which provide over 150 years of observations.

From Video-to-PDE: Data-Driven Discovery of Nonlinear Dye Plume Dynamics

2026-05-06T06:23:59Z

Inferring continuum models directly from video is hampered by two facts: the recorded field is uncalibrated image intensity rather than a physical state, and direct numerical differentiation of noisy frames is unstable. We develop a video-to-PDE pipeline that converts grayscale recordings of an ink plume into a normalised scalar field $u(x,y,t)$, isolates a bulk drift $\mathbf{v}(t)$ from intrinsic spreading via the intensity-weighted centroid, and identifies an effective transport law by weak-form sparse regression. Conditioning, threshold-sweep and random-centre diagnostics show that overcomplete libraries are strongly collinear; the search is therefore restricted to compact gradient-based libraries. Coefficients are refined by an inverse physics-informed network and recalibrated against forward rollouts, with a chronological block bootstrap quantifying uncertainty. The selected reduced model $u_t+\mathbf v(t)\!\cdot\!\nabla u = 9.005\,|\nabla u|^{2}+0.666\,Δu$ outperforms advection--diffusion baselines on held-out frames, retains a positive Laplacian coefficient, and admits a Cole--Hopf reduction to a linear advection--diffusion equation. The framework demonstrates that uncalibrated visual data can yield compact, predictive and structurally interpretable continuum models when discovery, calibration and uncertainty are treated as distinct stages.