https://arxiv.org/api/+2/RrdpifgOkWcbiXVw17QwY+qA2026-06-21T08:59:29Z2358260015http://arxiv.org/abs/2602.00844v2Multivariate Time Series Data Imputation via Distributionally Robust Regularization2026-05-06T01:11:30ZMultivariate time series imputation is often compromised by mismatch between the observed and true data distributions, a bias induced by the combined effects of time-series non-stationarity and systematic missingness. Standard methods that encourage point-wise reconstruction or direct distributional alignment may overfit these biased observations. We propose the Distributionally Robust Regularized Imputer Objective (DRIO), which jointly minimizes reconstruction error and the worst-case divergence between the imputer distribution and data distributions within a Wasserstein ambiguity set. We derive a tractable upper-bound surrogate that reduces infinite-dimensional optimization over measures to adversarial search over sample trajectories, and develop an alternating learning algorithm compatible with modern deep learning backbones. Comprehensive experiments on diverse real-world datasets show that DRIO consistently provides robust imputation and suggests improved downstream forecasting under various missingness scenarios.2026-01-31T18:15:03ZChe-Yi LiaoZheng DongGian-Gabriel GarciaKamran Paynabarhttp://arxiv.org/abs/2605.04372v1A Zero-Inflated Beta Mixture Model for Marginal Mediation Analysis with Compositional Microbiome Mediators2026-05-06T00:35:25ZThe role of the microbiome in disease pathogenesis is an emerging field with strong evidence suggesting that dysbiosis is associated with precancerous and cancerous states. Microbiome data present substantial challenges for causal mediation analysis due to sparsity, compositional constraints, and latent heterogeneity. To address these issues, we propose a zero-inflated beta mixture (ZIBM) method for mediation analysis with compositional microbiome mediators. The proposed method accommodates excess zeros through a zero-inflation component and captures heterogeneity in non-zero relative abundances using a beta mixture distribution. Within the potential-outcomes framework, the ZIBM provides estimates of marginal microbiome-mediated causal effects, and model parameters are estimated using an expectation-maximization algorithm. Simulation studies demonstrate that the ZIBM yields more accurate estimation and reliable inference under conditions commonly observed in microbiome data, compared with existing approaches. An application to a real microbiome study further illustrates its practical utility. These results indicate that the proposed method provides a more flexible and robust statistical framework for mediation analysis involving compositional microbiome data.2026-05-06T00:35:25Z19 pages including references; 2 figures; Seungjun Ahn, Quran Wu: These authors contributed equallySeungjun AhnQuran WuAlicia YangZhigang Lihttp://arxiv.org/abs/2605.05255v1Prediction of Drought and Flash Drought in Africa at the Seasonal-to-Subseasonal Scale using the Community Research Earth Digital Intelligence Twin Framework2026-05-05T23:52:32ZDroughts and flash droughts (rapidly developing droughts; FDs) remain impactful events that are known to desiccate landscape and destroy crops. In particular, droughts in Africa are often more impactful than in other locations, such as the United States or Europe, due to many regions in Africa heavily depending on local agriculture for sustenance. In recent years, large machine learning (ML) models, such as GraphCast and AIFS, have emerged as effective tools for global weather prediction. However, sparse data observations and few ML studies in Africa have left it unclear if these ML models retain their skill when focused on Africa. As such, this project seeks to examine the predictability of drought and FD in Africa using a CrossFormer model based on the Community Research Earth Digital Intelligence Twin (CREDIT) framework developed by NSF NCAR. Our CrossFormer model, termed DroughtFormer, incorporates variables from the ERA5 and GLDAS2 reanalyses and the IMERG and MODIS satellite observations, and employs dry air mass and moisture conservation, to predict soil moisture, vegetation health, and other drought-related surface variables. While DroughtFormer displayed lower accuracy in predicting precipitation and FD indices, it showed significant skill in predicting the remaining variables, delivering stable and skillful forecasts out to 90-day lead times (either beating out or having comparable skill to climatology). In particular, DroughtFormer skillfully represented climate anomalies for key variables, such as soil moisture (though it struggled with the magnitude of the anomalies). Thus, DroughtFormer showed significant promise in representing and predicting agricultural level drought in a region that is heavily impacted by drought events.2026-05-05T23:52:32ZStuart EdrisAmy McGovernJason Hickeyhttp://arxiv.org/abs/2605.04342v1Adaptive Diagonal Loading for Norm Constrained Beamforming2026-05-05T23:00:06ZReliable adaptive beamforming is critical for large microphone arrays operating in highly dynamic acoustic environments. In scenarios characterized by fast-moving talkers and interferers, the available sample support for estimating the spatial correlation matrix is often snapshot-deficient. This deficiency, coupled with array imperfections, degrades the White Noise Gain (WNG), leading to severe target signal cancellation. To ensure stable and robust beamforming, we propose a novel adaptive diagonal loading method that guarantees the WNG remains strictly within specified bounds. By leveraging the Kantorovich inequality, we map the desired WNG to a strict upper bound on the condition number of the correlation matrix. Furthermore, we present three estimation techniques for the adaptive loading level, ranging from trace-based bounding to exact eigenvalue decomposition, offering scalable computational complexities of $\mathcal{O}(M)$, $\mathcal{O}(M^2)$, and $\mathcal{O}(M^3)$. Our approach demonstrates highly stable beamforming under fast-changing interference.2026-05-05T23:00:06Z5 pages, 5 figuresManan MittalRyan M. CoreyJohn R. BuckAndrew C. Singerhttp://arxiv.org/abs/2605.04286v1Probabilistic Classification and Uncertainty Quantification of Sahara Desert Climate Using Feedforward Neural Networks2026-05-05T20:44:58ZClimate classification plays a vital role in agricultural planning, hydrological studies, and climate science. One of the most widely used systems for classifying global climate zones is the Köppen-Trewartha (KT) classification. However, the KT classification is fundamentally deterministic, offering discrete labels to spatial locations without accounting for uncertainties in classification. In this paper, we provide a framework for probabilistic modeling of climatic zones. We implement a feedforward artificial neural network (ANN) for classification, allowing for efficient, uncertainty-aware categorization of climatic regions, thereby offering a more nuanced understanding of transitional climate zones compared to traditional deterministic methods. We apply this method to the Sahara Desert region over the 30-year period of 1960 - 1989, using data at more than 400,000 space-time locations from the first 11 years to train our model. We assess the model's short- and long-term classification capabilities to evaluate its stability and accuracy over time. We also compare the probabilistic classification from our model with the traditional KT classification. In addition, we use fluctuation analysis methods to highlight the temporal evolution of climatic zones across the Sahara region and identify areas undergoing significant flux of probabilities of their climate classes, providing insights into broader trends in desertification.2026-05-05T20:44:58ZStephen TivenanIndranil SahooYanjun Qianhttp://arxiv.org/abs/2605.16332v1Data-Driven Climate Outage Risk Characterization and Resilience Analysis in Joint Power-Communication Networks2026-05-05T19:16:43ZClimate-driven power outages pose a growing threat to U.S. grid reliability, yet empirical outage studies and interdependency-based resilience analyses are rarely integrated. This paper presents a data-driven framework that integrates empirical outage characterization with cascade failure simulation in joint power-communication networks. Using the EAGLE-I national outage dataset (2015-2023, above 525,000 records), we characterize the climate-outage landscape through descriptive analysis and hypothesis testing, finding that climate-related outages increase by roughly 9,100 events per year and impose a significantly greater severity burden on coastal states. An interpretable logistic regression model then identifies the main predictors of severe outage risk, with Severe Weather emerging as the dominant factor. Guided by these findings, we construct four geographically representative failure scenarios and evaluate them using MIIM-based cascade simulation on the IEEE 118-bus system with a communication network overlay. Coastal scenarios produce substantially larger resilience gaps than the inland case, with the Extreme Coastal Severe Weather scenario reducing post-cascade operability to 17.6 percentage. The results show that aggregate outage statistics alone underestimate coastal risk, as cross-layer cascade propagation amplifies geographic damage in ways revealed only through interdependency-aware simulation.2026-05-05T19:16:43ZYoneke GrahamGelila WebsterTina TranSohini Royhttp://arxiv.org/abs/2605.04212v1BOIN Designs for Dose Escalation With Selected Dose Combinations in Oncology Phase I Trials2026-05-05T18:53:18ZIn phase I dose escalation studies for dual-agent combinations, at least one drug often has an established monotherapy dose. Consequently, substantial prior clinical safety data often exist for one or more monotherapies, allowing the study to focus on a subset of selected dose combinations rather than exhaustively evaluating all possible dose combinations for two agents. The Bayesian Optimal Interval (BOIN) design framework is widely recognized for its robust performance and ease of implementation; however, the BOIN for combination design, abbreviated as BOIN-C in this paper, was originally developed to evaluate full combinations and may not be directly applicable for the subset of selected combinations. In this paper, we propose three extensions to the BOIN-C design to address scenarios involving selected dose combinations: (a) BOIN-CS: a generalized BOIN-C design to accommodate any subset of dose combinations. (b) BOIN-CE: Exploration of new off-diagonal dose combinations when de-escalating. This option provides additional opportunities to treat patients with dose combinations that have not been administered. (c) BOIN-CB: Bayesian logistic regression model (BLRM)-guided BOIN design, which uses the BLRM model to break the tie when two dose combinations have an equal posterior probability of being selected. This can be useful when the dose-toxicity relationship is expected to be reasonably aligned with a logistic relationship. These study design options are motivated by practical considerations, and their operating characteristics are evaluated through extensive simulations under various scenarios, demonstrating satisfactory performance.2026-05-05T18:53:18ZYuxuan ChenHaiming ZhouKeiko NakajimaPhilip Hehttp://arxiv.org/abs/2601.08116v3Learning a Stochastic Differential Equation Model of Tropical Cyclone Intensification from Reanalysis and Observational Data2026-05-05T17:48:08ZTropical cyclones are among the most consequential weather hazards, yet estimates of their risk are limited by the relatively short historical record. To extend these records, researchers often generate large ensembles of synthetic storms using simplified models of cyclone intensification. Developing such models, however, has traditionally required substantial theoretical effort. Here we explore whether equation-discovery methods, a class of data-driven techniques designed to infer governing equations, can accelerate the process of developing simplified intensification models. Using observational storm data (IBTrACS) together with environmental conditions from reanalysis (ERA5), we learn a compact stochastic differential equation describing tropical cyclone intensity evolution. We focus on TCs because their dynamics are well studied and a hierarchy of reduced-order models exist, enabling direct comparison of the learned model to physically-derived counterparts. We find that the learned model simulates synthetic TCs whose intensification statistics and hazard estimates are consistent with observations and competitive with a leading physics-based TC intensification model. Our model also reproduces known nonlinear dynamical behavior of tropical cyclones, including as a saddle node bifurcation as inner core ventilation is increased. This result shows that equation-discovery approaches, when applied directly to storm intensity, can recover not only realistic statistics but also physically meaningful dynamical structure. These findings highlight the potential for data-driven methods to complement existing theory and reduced-order models in the study of extreme weather.2026-01-13T01:11:17ZKenneth GeeSai Ravelahttp://arxiv.org/abs/2602.17043v2Quantifying the limits of human athletic performance: A Bayesian analysis of elite decathletes2026-05-05T15:28:25ZBecause the decathlon tests many facets of athleticism, including sprinting, throwing, jumping, and endurance, many consider it to be the ultimate test of athletic ability. On this view, estimating the maximal decathlon score and understanding what it would take to achieve that score provides insight into the upper limits of human athletic potential. To this end, we develop a Bayesian composition model for forecasting how individual decathletes perform in each of the 10 decathlon events of time. Besides capturing potential non-linear temporal trends in performance, our model carefully captures the dependence between performance in an event and all preceding events. Using our model, we can simulate and evaluate the distribution of the maximal possible scores and identify profiles of decathletes who could realistically attain scores approaching this limit.2026-02-19T03:28:12ZPaul-Hieu V. NguyenJames M. SmoligaBenton LindamanSameer K. Deshpandehttp://arxiv.org/abs/2605.03795v1Graph Convolutional Support Vector Regression for Robust Spatiotemporal Forecasting of Urban Air Pollution2026-05-05T14:24:49ZUrban air quality forecasting is challenging because pollutant concentrations are nonlinear, nonstationary, spatiotemporally dependent, and often affected by anomalous observations caused by traffic congestion, industrial emissions, and seasonal meteorological variability. This study proposes a Graph Convolutional Support Vector Regression (GCSVR) framework for robust spatiotemporal forecasting of urban air pollution. The model combines graph convolutional learning to capture inter-station spatial dependence with support vector regression to model nonlinear temporal dynamics while reducing sensitivity to outlier observations. The proposed framework is evaluated using air quality records from 37 monitoring stations in Delhi and 18 stations in Mumbai, representing inland and coastal metropolitan environments in India. Forecasting performance is assessed across multiple horizons and compared with established temporal and spatiotemporal benchmarks. The results show that GCSVR consistently improves predictive accuracy and maintains stable performance across seasons and outlier-prone pollution episodes. Statistical test further confirms the reliability of the proposed approach across the two cities. Finally, conformal prediction is integrated with GCSVR to generate calibrated prediction intervals, enhancing its practical value for uncertainty-aware air quality monitoring and public health decision-making.2026-05-05T14:24:49ZNourin JahanMadhurima PanjaMuhammed Navas TTanujit Chakrabortyhttp://arxiv.org/abs/2508.14936v3Can synthetic data reproduce real-world findings in epidemiology? A replication study using adversarial random forests2026-05-05T06:01:43ZSynthetic data holds substantial potential to address practical challenges in epidemiology due to restricted data access and privacy concerns. However, many current methods suffer from limited quality, high computational demands, and complexity for non-experts. Furthermore, common evaluation strategies for synthetic data often fail to directly reflect statistical utility and measure privacy risks sufficiently. Against this background, a critical underexplored question is whether synthetic data can reliably reproduce key findings from epidemiological research while preserving privacy. We propose adversarial random forests (ARF) as an efficient and convenient method for synthesizing tabular epidemiological data. To evaluate its performance, we replicated statistical analyses from six epidemiological publications covering blood pressure, anthropometry, myocardial infarction, accelerometry, loneliness, and diabetes, from the German National Cohort (NAKO Gesundheitsstudie), the Bremen STEMI Registry U45 Study, and the Guelph Family Health Study. We further assessed how dataset dimensionality and variable complexity affect the quality of synthetic data, and contextualized ARF's performance by comparison with commonly used tabular data synthesizers in terms of utility, privacy, generalisation, and runtime. Across all replicated studies, results on ARF-generated synthetic data consistently aligned with original findings. Even for datasets with relatively low sample size-to-dimensionality ratios, replication outcomes closely matched the original results across descriptive and inferential analyses. Reduced dimensionality and variable complexity further enhanced synthesis quality. ARF demonstrated favourable performance regarding utility, privacy preservation, and generalisation relative to other synthesizers and superior computational efficiency.2025-08-19T22:51:40ZJan KaparKathrin GüntherLori Ann VallisKlaus BergerNadine BinderHermann BrennerStefanie CastellBeate FischerVolker HarthBernd HolleczekTimm IntemannTill IttermannAndré KarchThomas KeilLilian KristBerit LangeMichael F. LeitzmannKatharina NimptschNadia ObiIris PigeotTobias PischonTamara SchikowskiBörge SchmidtCarsten Oliver SchmidtAnja M. SedlmairJustine TanoeyHarm WienbergenAndreas WienkeClaudia WigmannMarvin N. Wrighthttp://arxiv.org/abs/2605.03240v1On Model-Based Clustering With Entropic Optimal Transport2026-05-05T00:17:27ZWe develop a new methodology for model-based clustering. Optimizing the log-likelihood provides a principled statistical framework for clustering, with solutions found via the EM algorithm. However, because the log-likelihood is nonconvex, only convergence to stationary points can be guaranteed, and practitioners often use multiple starting points in the hope that one will converge to the global solution. We consider a new loss function based on entropic optimal transport that shares the same global optimum as the log-likelihood but has a much better-behaved landscape, thereby avoiding spurious local-optima configurations that are pervasive with the log-likelihood. Similar to the EM algorithm for the log-likelihood, this new loss can be optimized by the Sinkhorn-EM algorithm, which we show converges at a rate comparable to that of EM. By analyzing extensive numerical experiments and two real-world applications in image segmentation in C. elegans microscopy and clustering in spatial transcriptomics, we show that this new loss outperforms log-likelihood optimization, indicating that it represents a valuable clustering methodology for practitioners.2026-05-05T00:17:27ZThis paper extends and substantially revises the preliminary results in [arXiv:2006.16548]Gonzalo Menahttp://arxiv.org/abs/2605.16319v1Forecasting Medium-Horizon Alzheimer's Disease Progression: Residual Gap-Aware Transformers for 24-Month CDR-SB Change from ADNI Clinical and Biomarker Histories2026-05-04T23:05:28ZMedium-horizon Alzheimer's disease progression prediction is difficult because future clinical scores can remain tied to baseline severity, while biomarker histories are irregular and incompletely observed. We develop an anchor-based analysis of 24-month Clinical Dementia Rating Sum of Boxes (CDR-SB) change using harmonized Alzheimer's Disease Neuroimaging Initiative (ADNI) tables. Each labeled sample is anchored at a mild cognitive impairment visit, uses only clinical and biomarker history observed at or before that anchor, and defines the response as CDR-SB at the future visit closest to 24 months within an 18--30 month window minus anchor CDR-SB. The analytic cohort contains 2,600 labeled anchors from 858 participants and 7,276 longitudinal rows. We propose a residual gap-aware transformer that combines a mixed-effects statistical reference with transformer-based residual learning from pre-anchor clinical and biomarker histories. The model uses participant-level random intercepts in the mixed-effects reference, observation-level triplet tokenization for irregular histories, and a learned nonnegative time-gap penalty inside self-attention. We compare the proposed model with a Bayesian-information-criterion-selected linear mixed-effects baseline, GRU-D, and STraTS under repeated participant-level train--test splits. Across five participant-level random seeds, the proposed model achieves the best mean test performance across all reported metrics, reducing MSE by 13.1% and increasing prediction--observation correlation by 26.4% relative to the mixed-effects baseline. It also improves over both GRU-D and STraTS in mean error and correlation. These results show that statistical anchoring and gap-aware residual learning provide a useful structure for medium-horizon Alzheimer's disease progression prediction.2026-05-04T23:05:28ZPreprint; includes appendix, 4 figures, and 6 tablesRan TongTong WangLanruo WangYin Nihttp://arxiv.org/abs/2605.03193v1Evaluating the probative value of forensic gait analysis evidence using empirical data2026-05-04T22:22:39ZForensic gait analysis can aid the investigation of crimes through comparing features of gait captured in video footage. Modelling the probative value of gait evidence requires an understanding of the variation of features of gait between individuals in the population and within the same individuals. We address this question using a previously described population dataset and newly collected datasets with repeated observations of the same individuals on separate occasions. In addition to exploring the level of variability, correlation between features of gait, and the effect of demographic factors, we developed a likelihood ratio model through recoding features of gait as dichotomous variables and dimension reduction using PCA. High correlations between some features were observed, confirming that they should not contribute independently to the weight of evidence. The likelihood ratio model produced misleading likelihood ratios in less than 10% of the comparisons using the first four principal components. However, the risk increases when within-individual variability is mis-specified. Therefore, while the current model provides assistance to the judgement of gait experts, human expertise is indispensable to decide whether or not the difference in walking and/or recording conditions between the reference and questioned footage could have caused any observed differences in the features of gait. We discuss future directions in understanding the sources of the variability, improving statistical modelling and note the need to consider carefully how to select the relevant population for model fitting.2026-05-04T22:22:39ZRuoyun HuiAmy L WilsonColin AitkenIvan BirchNadia AsgeirsdottirGraham Jacksonhttp://arxiv.org/abs/2605.03105v1Pose Tracking with a Foundation Pose Model and an Ensemble Directional Kalman Filter2026-05-04T19:45:23ZThis paper introduces the ensemble directional Kalman filter (EnDKF), an ensemble-based Kalman filtering approach for pose tracking that jointly estimates an object's position and attitude using ideas from directional statistics. The EnDKF integrates a unit-quaternion attitude representation to move beyond canonical Kalman filter mean and covariance assumptions that poorly capture directional uncertainty. Experiments on a synthetic constant-velocity constant-angular-velocity system and a digital-twin head-tracking scenario using the FoundationPose algorithm demonstrate a significant reduction in error as opposed to merely using measurements.2026-05-04T19:45:23ZTianlu LuAsif SijanThomas NohHuaijin ChenAndrey A. Popov