Trustworthy AI/ML Regression and Unbiased Causal Inference for Real-World Data

2026-05-31T05:19:25Z

Real-World Data (RWD), with its large sample sizes and rich clinical detail, offers a compelling alternative to randomized controlled trials (RCTs) for studying treatment effects in diverse and complex patient populations. However, its observational nature introduces confounding that prevents straightforward comparative effectiveness research. Target trial emulation leverages RWD to estimate average treatment effects (ATE) at the population scale and diversity that RCTs cannot achieve, yet its validity depends critically on unbiased ATE estimation under high-dimensional confounding. Many causal inference pipelines address high-dimensional confounding through machine learning and artificial intelligence (ML/AI) outcome regression. However, commonly used ML/AI regression models exhibit systematic prediction bias, with predicted outcomes shrinking toward the marginal outcome mean. This structural bias propagates into ATE estimation and cannot be corrected by cross-fitting, ensemble methods, or any standard ML practice. In this work, we first quantitatively characterize how systematic prediction bias in ML/AI outcome regression leads to biased ATE estimates in causal inference models. We further propose an unbiased ML/AI regression-based causal inference framework to ensure unbiased ATE estimation for observational studies. We demonstrate our approach by studying the effects of opioids on cardiovascular health in patients with chronic pain using UK Biobank data.

Efficient Synthetic Network Generation via Latent Embedding Reconstruction

2026-05-31T00:01:13Z

Network data are ubiquitous across the social sciences, biology, and information systems. Generating realistic synthetic network data has broad applications from network simulation to scientific discovery. However, many existing black-box approaches for network generation tend to overfit observed data while overlooking characteristic network structure, and incur substantial computational overhead at scale. These practical challenges call for synthetic network generation methods that are both efficient and capable of capturing structural properties of networks. In this paper, we introduce Synthetic Network Generation via Latent Embedding Reconstruction (SyNGLER), a general and efficient framework for synthetic network generation that builds on latent space network models. Given an observed network, SyNGLER first learns low-dimensional latent node embeddings via a latent space network model and then reconstructs the latent space by building a distribution-free generator over these embeddings. For generation, SyNGLER first samples (or resamples) node embeddings from the generator in the latent space and then produces synthetic networks using the latent space network model. Through the latent space framework, SyNGLER preserves unique characteristics in networks such as sparsity and node degree heterogeneity, while allowing for efficient training with lower computational cost than many existing deep architectures. We provide theoretical guarantees by developing consistency results on the distance between the true and synthetic edge distributions. Empirical studies further demonstrate the effectiveness of SyNGLER, which efficiently produces networks that better preserve key network characteristics such as network moments and degree distributions compared with existing approaches. Code is available at https://github.com/FeifanJiang/syngler.

Evaluating the Impact of COVID-19 Vaccination in the United Kingdom: A Gaussian Process Approach

2026-05-30T22:54:23Z

The rapid rollout of COVID-19 vaccines in the United Kingdom in early 2021 differed markedly from that of many other European countries, providing a natural setting to assess the impact of vaccination speed on public health outcomes. We evaluate the impact of the accelerated UK vaccination rollout and associated policy transition on COVID-19 mortality and transmission dynamics by constructing a probabilistic reference trajectory for the UK under a slower vaccination and reopening trajectory. The proposed framework combines ideas from interrupted time series analysis and synthetic control methods with flexible probabilistic modelling based on multi-output Gaussian processes. These models capture non-linear and heterogeneous dependence structures across countries and over time, while providing uncertainty quantification through predictive distributions. A central feature of the methodology is a design-consistent validation strategy based on predictive performance in held-out pre-intervention periods, which is used both to guide model specification and to assess the plausibility of the reconstructed reference trajectory. The empirical results indicate a substantial reduction in COVID-19 mortality associated with the accelerated vaccination-policy transition, with little evidence of an effect on transmission rates. Generally, the framework illustrates how flexible probabilistic models and predictive validation can support causal and policy evaluation in complex time series settings.

Multi-source land-use emissions reveal rising airborne fraction

2026-05-30T21:25:07Z

The airborne fraction is the share of anthropogenic carbon dioxide emissions that remains in the atmosphere and is a key indicator of carbon-cycle response and remaining carbon budgets under continued emissions. Whether this share is rising remains debated because inference is sensitive to uncertainty in land-use and land-cover change (LULC) emissions. Here we use all available LULC measurement series from Global Carbon Budget 2025 and estimate airborne-fraction trends with a mixed-effects model with random intercepts and slopes by LULC series. We find that the airborne fraction increased over 1959-2024, from about 0.40 to about 0.47, and that this conclusion is robust to excluding the final year and to alternative specifications that explicitly propagate denominator uncertainty. These results clarify why earlier studies reported weak or inconclusive trend evidence and strengthen support for the view that an increasing share of emitted carbon dioxide is accumulating in the atmosphere rather than being taken up by land and ocean sinks, with implications for carbon-budget assessment and near-term mitigation requirements.

Hybrid Probabilistic Forecasting of Under-Five Malaria Admissions in Ghana: A Gaussian Process Regression with Holt-Winters Smoothing

2026-05-30T18:18:36Z

Accurate malaria forecasting remains a major challenge in sub-Saharan Africa, where strong seasonality, reporting uncertainty, and non-stationary transmission dynamics reduce the reliability of conventional models. In Ghana, district-level malaria surveillance requires forecasting frameworks that are probabilistically rigorous and robust under limited data. This study proposes a hybrid framework integrating Gaussian Process Regression (GPR) with Holt-Winters exponential smoothing for modelling monthly under-five malaria admissions. GPR captures non-linear behaviour and predictive uncertainty, while Holt-Winters stabilises long-horizon forecasts and preserves seasonal structure. Using ten years of district-level data (2014-2023), performance was evaluated via rolling-origin expanding-window validation. The hybrid model achieved $R^2 = 0.9906$ versus $0.8213$ for Holt-Winters alone, with $94.2\%$ of residuals within $\pm 2σ$ bounds. Forecasts for 2024-2028 project average monthly admissions from approximately 8{,}000 to 12{,}200 cases. Spatio-temporal analysis revealed pronounced ecological heterogeneity: northern high-burden districts exhibited stable relative patterns despite large absolute fluctuations. The framework provides a scalable probabilistic approach for malaria early warning and operational planning in endemic settings, supporting Ghana's national malaria control strategy.

Robust inference for risk heterogeneity under group imbalance

2026-05-30T16:35:45Z

Population-level heterogeneity is ubiquitous in biomedical data, where differences across demographic or clinical subgroups can substantially alter risk patterns. For example, in intensive care unit (ICU) studies, the mortality risk associated with specific admission diagnoses can vary across ethnic groups. Existing approaches for detecting risk heterogeneity are often sensitive to baseline model misspecification and regularization bias, both of which commonly arise in practice. In this paper, we propose a robust framework for inferring risk heterogeneity between two populations using Neyman orthogonality, which yields estimators that are locally insensitive to nuisance parameter estimation error. The proposed estimator is consistent and asymptotically normal, and simulation studies demonstrate that in finite samples our method substantially reduces bias and improves inferential stability compared with standard likelihood-based approaches. In an application to the eICU Collaborative Research Database, our method reveals clinically meaningful ethnicity-specific heterogeneity in admission diagnoses for in-hospital mortality that standard likelihood-based methods fail to detect.

Bayesian Inference of Nonlinear Malaria Dynamics in Ghana via an Ensemble Markov Chain Monte Carlo Sampler

2026-05-30T16:02:56Z

Reliable quantification of malaria dynamics in sub-Saharan Africa is hindered by short, noisy, and spatially heterogeneous surveillance records. In Ghana, health-facility data from 2014 to 2023 reveal non-linear and age-specific fluctuations in hospital admissions, yet existing approaches struggle to capture stochastic variability or provide credible uncertainty bounds. This study develops a Bayesian nonlinear inference framework that integrates a cubic baseline with a damped oscillatory kernel, estimated via an affine-invariant ensemble Markov Chain Monte Carlo sampler. The framework accommodates limited data, models parameter uncertainty, and generates probabilistic forecasts for children under five years and individuals aged five years or more. Results show strong empirical adequacy ($R^2 = 0.9958$ for $<5$ years; $R^2 = 0.9956$ for $\geq 5$ years) with residual errors below $2\%$ and well-mixed posteriors confirming convergence. District-level analysis reveals pronounced spatial heterogeneity, with coefficients of variation ranging from $<0.07$ in urban centres such as Kumasi to $>3.3$ in peripheral districts such as Mpohor and Bia East. Forecasts for 2024-2026 indicate a gradual resurgence: from 137,000 to 149,000 cases among children under five years and from 348,000 to 375,000 cases among older individuals, with uncertainty widening over time. By producing probabilistic forecasts, this Bayesian framework provides a principled tool for anticipating malaria fluctuations and strengthening data-driven decision-making in Ghana's national malaria control strategy.

Position: Prioritize Identifying Structure, Not Complex Models, for Scientific Discovery

2026-05-30T15:21:58Z

Modern Machine Learning (ML) and Artificial Intelligence (AI) models, especially large language models (LLMs), are increasingly used to generate scientific hypotheses and mechanistic explanations from observational data. This position paper argues that in the high-dimensional proxy regimes where modern ML excels, mechanistic learning is generically underdetermined: many incompatible mechanisms induce essentially the same observational relationships on the support of the data, so predictive success and coherent explanations are insufficient evidence of mechanism discovery. This underdetermination becomes uniquely hazardous with large language models (LLMs), which tend to collapse large equivalence classes of explanations into a single fluent narrative. This paper proposes concrete standards for ``mechanistic ML,'' and argues these norms are necessary if LLM-centered workflows are to support science rather than merely simulate it.

Bayesian estimation of spectral parameters of the 6.7-GHz methanol maser G339.884-1.259 from GRAO observations

2026-05-30T15:19:31Z

Accurate decomposition of methanol maser spectra is essential for understanding high-mass star-forming regions, especially in complex blended spectra where small differences alter physical interpretation. Conventional Gaussian fitting often fails to capture non-Gaussian structure and lacks uncertainty quantification. We develop a Bayesian spectral decomposition framework using Gaussian, Lorentzian, and Voigt profiles with Markov Chain Monte Carlo sampling, enabling model comparison and uncertainty estimation. Applied to the 6.7\,GHz methanol maser G339.884$-$1.259 observed with the Ghana Radio Astronomy Observatory, our method reveals seven velocity-coherent components. The Voigt model is statistically preferred, yielding the lowest AIC and BIC ($\approx 1.98 \times 10^{4}$ and $1.99 \times 10^{4}$), the smallest RMSE ($\approx 11.1$ Jy), and the highest $R^{2}$ (0.985). Purely Gaussian or Lorentzian models leave systematic residuals. Elevated reduced $χ^{2}_ν$ values indicate unresolved substructure and non-ideal noise. Bayesian inference provides a robust framework for maser spectral analysis, extendable to other molecular lines and combinable with high-resolution interferometry.

RISED: A Pre-Deployment Evaluation Framework for High-Stakes AI Decision-Support Systems, with Application to Healthcare

2026-05-30T04:52:59Z

Clinical decision-support systems are expert systems whose recommendations clinicians act on directly, yet they are usually cleared on one aggregate accuracy number from a held-out test set. That number says nothing about input reliability under encoding shifts, subgroup gaps, threshold sensitivity, or operational feasibility. We present RISED, a pre-deployment evaluation framework operationalising five dimensions (Reliability, Inclusivity, Sensitivity, Equity, Deployability) through BCa bootstrap 95% confidence intervals, literature-grounded thresholds, and Holm-Bonferroni-corrected PASS / FAIL / INCONCLUSIVE verdicts; Equity is a proxy-dependence diagnostic rather than a gating test. Applied to seven cohorts spanning 35 years (n from 303 to 99,492), RISED surfaces failures invisible to AUROC: on Diabetes 130, Reliability passes by three orders of magnitude (PSS = 0.0004) while Inclusivity (AUC parity gap = 0.262) and Sensitivity (max threshold-flip rate 49.1%) fail decisively; both NHIS cohorts reproduce this. NHANES 2021-2023, with a complete feature profile, achieves INCONCLUSIVE verdicts; BRFSS 2024 produces the suite's most severe Sensitivity failure (max threshold-flip rate 64.2%) after instrument rotation removed hypertension and cholesterol. The pattern recurs on credit- and income-prediction cohorts, confirming domain-agnosticity; a multi-model check shows the failures are data-driven, not model-specific. RISED ships as an open-source Python package complementing TRIPOD+AI, FUTURE-AI, and Fairlearn with the structured numerical evidence those standards require but do not prescribe.

Stochastic Analysis of Cybersecurity Defense Strategies Under Single Attack Scenario

2026-05-30T02:22:22Z

This research presents a novel stochastic framework for proactive cybersecurity defense timing under a single attack scenario. The approach models the defense process as a continuous observation mechanism in which the defense instant and the subsequent observation slot follow independent exponential distributions. Laplace-Carson transforms combined with first-excess theory yield the joint detection function that brackets the attack moment. Marginalization under Markovian Poisson arrivals then produces the probability density of the defense moment and conditional expectations of pre-attack and post-attack observation times. These closed-form results enable quantitative assessment of defense timing sensitivity to threat intensity and support precise calibration of observation parameters for low-latency proactive measures. Major contributions include the explicit derivation of marginal distributions and expected values, visualization of defense moment density, and the bridging of stochastic duel methodology with practical cybersecurity applications.

Revisiting Marked Galaxy Clustering from a Joint Point Process Perspective

2026-05-29T23:18:32Z

Marked correlation functions, in which galaxy properties such as luminosity or stellar mass are treated as marks, are widely used to test models of galaxy formation. In astronomy, however, these statistics are typically implemented as summary measures that do not preserve the joint structure of mark pairs conditioned on separation. In this work, we formulate galaxies as points $(x,m)$ on the product space $\mathbb{R}^3\times\mathcal{M}$, where $x$ denotes position and $m$ a mark, and introduce the joint pair correlation function $g(r;m_1,m_2)$ as the fundamental quantity describing mark-dependent clustering. We further define a diagnostic quantity $Δ_{\mathrm{ind}}(r;m_1,m_2)$ that locally quantifies deviations from the independence hypothesis relative to spatial clustering alone, thereby providing a projection-free description of which mark pairs are over- or underrepresented at a given separation scale. Within this framework, commonly used diagnostics such as the inhomogeneous cross-$J$ function are naturally interpreted as summary statistics obtained through averaging over mark sets and geometric-event-based reductions of the joint structure. This perspective clarifies that previously discussed marked effects, including assembly bias, correspond to projections of an underlying joint dependence, and that observationally accessible information is the existence of non-factorizable joint structure itself. The present formulation provides both a fundamental quantity and practical diagnostics for its characterization.

A Distribution-Free Framework for Rewrite-Based Human-text Detection via Knockoff Filtering

2026-05-29T22:37:13Z

We propose a distribution-free statistical framework that converts arbitrary rewrite-based detectors into detectors with finite-sample FDR guarantees without retraining. Our key observation is that rewrite-based detection implicitly constructs knockoff samples, enabling LLM-generated text detection to be formulated as a multiple hypothesis testing problem with knockoff structure. This perspective separates the design of detection statistics from the control of false discoveries, allowing existing rewrite detectors to inherit finite-sample false discovery rate (FDR) guarantees through a simple calibration procedure. We demonstrate reliable FDR control with meaningful detection power across three detection models, 19 domains, and four LLMs.

Network knockoffs: controlling false discovery in dyadic space

2026-05-29T20:36:56Z

Phenomena such as epidemiological processes, hydrologic systems, social platforms, utility services, and supply chains can be represented as topological networks. A central question about these networks concerns connectivity and the permeability of edges. Dyadic regression and related approaches have been proposed to identify network features associated with pairwise node-level differences. In high-dimensional settings, it is important to control the number of spuriously selected features. However, controlling the false discovery rate for dyadic outcomes is challenging because dependence among dyads invalidates classic asymptotic procedures and complicates standard data splitting and knockoff approaches. We propose a novel knockoff variable selection procedure that simulates synthetic features directly on the topological network prior to constructing the augmented design matrix in dyadic space. Empirically, our method controls the false discovery rate for both node- and edge-level features. The Benjamini-Hochberg, Benjamini-Yekutieli, Storey Q-value, data-splitting, and standard knockoff procedures were all anticonservative. We applied our network knockoffs to assess the impassability of over 1000 stream barriers in North Carolina for Salvelinus fontinalis. Compared to data splitting and traditional knockoff approaches, our proposed approach selected a higher proportion of barriers previously assessed to impede fish movement.

Cluster Analysis with Resampling for Validation and Exploration (CARVE)

2026-05-29T20:09:20Z

Clustering is widely used across the sciences as the foundation for downstream data-driven scientific discoveries. However, clustering results are highly sensitive to the choice of algorithm, preprocessing, and the number of clusters $k$, producing scientific claims that are often not reproducible. The current state of the art for validating clustering solutions consists of clustering validation indices (CVIs) such as Silhouette, Davies-Bouldin, and Calinski-Harabasz, which rely on geometric assumptions that break down on the heavy-tailed, high-dimensional, and nonlinearly structured data encountered in biomedical research. Resampling-based alternatives - grounded in the ideas of clustering stability and generalizability - have been proposed but remain scattered across specialized tools with no unified, accessible software. We fill this gap with CARVE (Cluster Analysis with Resampling for Validation and Exploration), an open-source Python and R package that jointly evaluates multiple clustering algorithms and hyperparameters, returning stability and generalizability diagnostics at the global, cluster, and sample level together with principled selection rules and consensus-based cluster labels. Across six synthetic benchmarks CARVE consistently recovers near-optimal clusterings where classical indices degrade substantially. On experimental genomics and proteomics data sets, CARVE recovers finer biological structure when classical CVIs collapse entirely. CARVE is available with a scikit-learn-compatible Python API and an analogous R interface compatible with Seurat workflows.