Integrative learning of individualized treatment rules from multiple studies with partially overlapping treatments

2026-06-02T15:35:05Z

An individualized treatment rule (ITR) tailors treatments to a patient's specific characteristics. However, randomized controlled trials (RCTs) are often underpowered to detect the treatment effect heterogeneity needed for reliable ITR estimation. To address this limitation, there is growing interest in leveraging information from multiple studies to improve statistical power and support individualized decision-making. A key challenge in this context is that available RCTs may not evaluate the same set of treatments. In this paper, we propose an integrative learning framework that synthesizes evidence across multiple RCTs that share a common comparator but differ in their alternative treatment arms. Our method integrates information through a regularized weighted misclassification risk function and adaptively determines the contribution of each study to the ITRs of the others. We rigorously study the excess risk of the resulting estimator. Simulation studies demonstrate that the proposed approaches improve the estimation of both value and benefit functions. We illustrate the utility of our methodology using data from two landmark studies of major depressive disorder: the Establishing Moderators and Biosignatures of Antidepressant Response in Clinical Care study and the International Study to Predict Optimized Treatment in Depression study, both of which include a selective serotonin reuptake inhibitor as a common treatment arm. We find that the separate learning method outperforms one-size-fits-all methods, and our integrative methods further improve performance.

FlowSN: Neural Simulation-Based Inference under Realistic Selection Effects applied to Supernova Cosmology

2026-06-02T15:24:01Z

We present FlowSN, a statistical framework using simulation-based inference (SBI) with normalising flows to account for selection effects in observational astronomy. Failure to account for selection effects can lead to biased inference on global parameters. An example is Malmquist bias, where detection limits result in a sample skewed towards brighter objects. In Type Ia supernova (SN Ia) cosmology, these selection effects can systematically shift the inferred posterior distributions of cosmological parameters, necessitating the development of robust statistical frameworks to account for the biases. SBI enables us to implicitly learn probability distributions that are analytically intractable to calculate. In this work, we introduce a novel approach that employs a normalising flow to learn the non-analytic selected SN likelihood for a given survey from forward simulations, independent of the assumed cosmological model. The resulting likelihood approximation is incorporated into a hierarchical Bayesian framework and posterior sampling is performed using Hamiltonian Monte Carlo to obtain constraints on cosmological parameters conditioned on the observed data. The modular learnt likelihood approximation can be reused without retraining to evaluate different cosmological models, providing a key advantage over other SBI approaches. We demonstrate the performance of this methodology by training and testing the SBI technique using realistic LSST-like SNANA simulations for the first time. Our FlowSN approach yields accurate posterior estimates on cosmological parameters, including the dark energy equation of state $w_0$, that are an order of magnitude less biased than those obtained with conventional techniques and also exhibit improved frequentist calibration.

TASTE: A Designer-Annotated Multi-Dimensional Preference Dataset for AI-Generated Graphic Design

2026-06-02T04:39:14Z

Text-to-image models now generate graphic design at production scale, yet their supervision still comes primarily from photo-style preference datasets with a single overall verdict per comparison. Designers evaluate designs along several distinct axes (e.g., typography, layout, color harmony) that a single preference label collapses. We release \emph{TASTE} \textit{(Typography, Aesthetics, Spatial, Tone, Etc.)}, a multi-dimensional preference dataset in which two disjoint cohorts of five professional designers each ranked outputs from four current text-to-image models across nine criteria along with per-image hallucination flags. We pair the dataset with two contributions. First, a criterion-agnostic signal-validation framework based on Kendall's $τ$, majority-vote probability, and Condorcet cycles against exact iid-uniform nulls; the analysis reveals significant but moderate designer agreement, with every TASTE criterion rejecting the random-rater null. Second, we benchmark preference models on TASTE and find that off-the-shelf VLM judges and dedicated T2I scorers fail to reach majority agreement with the designer panel, while a small MLP head trained directly on TASTE substantially narrows the gap to the single-rater ceiling, setting a baseline for future TASTE-trained preference models.

Trans GAN-WT: A Feature Extraction and Interactive Learning-Based Anomaly Detection Model for Wind Turbine Time Series Data

2026-06-02T03:57:08Z

With the increasing scale and number of wind farms, wind turbines' daily operation and maintenance costs are increasing. To reduce operation and maintenance costs and enhance the reliability of wind turbine and system operation data before reaching catastrophic failures, monitoring the operating status of the equipment and detecting failures at an early stage is crucial. It is of great practical significance to utilize the working condition data for abnormal assessment of the operating status of wind turbines to realize abnormal monitoring of the operating status of wind turbines. However, the existing anomaly detection methods can neither perform effective relational modeling in data filled with a large amount of redundant information nor reasonably utilize the valuable anomaly data. For this reason, this paper proposes an anomaly detection model that fuses a Transformer and a generative adversarial network. Firstly, it reduces the leakage detection rate of minor deviation anomalies by amplifying the reconstruction error. Secondly, it uses autoregressive inference to extract multimodal features to enhance the stability and generalization ability of training. Finally, the temporal feature extraction module is constructed to promote the interactive learning between features of different time scales and effectively reduce the time redundancy. The results of multiple sets of experiments conducted on real WTG datasets show that TransGAN-WT achieves an average F1 score of 96.10% across multiple wind turbine datasets, which is 5.84% and 2.89% higher than several other state-of-the-art baseline methods. It also realizes a false positive rate (FPR) of 0.06%, and is verified by the Wilcoxon signed-rank test to have achieved a statistically significant performance enhancement compared to the state-of-the-art baseline methods, effectively ensuring the stable operation of wind turbines.

Marginalised Poisson Hurdle Model for Cross-Sectional Count Data with Excess Zeros

2026-06-02T01:57:05Z

Count data with excess zeros arise frequently in health economics and epidemiology. The standard Poisson Hurdle Model (PHM) parametrises the underlying Poisson rate directly, so its count-component coefficients are log-rate ratios rather than log-ratios of the marginal mean. Consequently, the incidence density ratio (IDR) from the PHM is neither exact nor constant across covariate profiles, complicating applied reporting. We propose the Marginalised Poisson Hurdle Model (MPHM), which reparametrises the count component so that the coefficient vector beta directly governs the marginal mean E[Y]. A nonlinear connector equation links the structural Poisson rate to this parametrised mean. We prove existence and uniqueness of the connector solution, develop a vectorised Brent's-method solver, derive the score equations and block-diagonal Fisher information, establish asymptotic normality, and prove that exp(beta) is exactly constant across all covariate values. A simulation study with n in {100, 250, 500, 1000}, zero proportion pi in {0.2, 0.4, 0.6, 0.8}, and R = 200 replications confirms consistency, near-zero bias, and 95% Wald coverage of 0.905-0.975 across all 16 scenarios. Applied to the NMES1988 physician visit data (n = 4,406), the MPHM yields IDR = 1.163 (95% CI: 1.150-1.177) per additional chronic condition - an exact, population-wide effect not derivable from the PHM. The MPHM resolves the non-constant IDR problem by directly parametrising E[Y]. The resulting IDR holds for every individual and the whole population without further marginalisation, substantially simplifying the reporting of covariate effects in health utilisation research.

Computing the final epidemic size distributions of a multi-type Galton--Watson process

2026-06-02T01:26:51Z

The Galton--Watson process (GWP) is a discrete-time branching process model that provides a powerful tool for analyzing epidemic data and estimating key epidemiological parameters such as the basic reproduction number. When used with surveillance-based cluster size data, the GWP can also elicit information about the extent of transmission heterogeneity, even when each transmission process is not directly observable. When cluster size distribution data are available, the parameters that govern the transmission can be statistically inferred by using the probability mass function that corresponds to the observed cluster size data. For multi-type GWPs, however, real-world applications remain limited, possibly because of the absence of conceptually and practically straightforward approaches for deriving the closed-form solution of the final size distribution. In the present study, we propose a framework for computing the final size distribution of multi-type GWPs, using a method for the choice of the Cauchy integral contour. We provide examples of how our framework can be applied to both simulated data and real-world data of Middle East respiratory syndrome, and discuss potential pitfalls surrounding the identifiability of parameters for statistical inference when using likelihoods that are not conditioned on extinction.

Enhanced Renewable Energy Forecasting using Context-Aware Conformal Prediction

2026-06-01T21:54:40Z

Artificial intelligence (AI) is increasingly used to support renewable energy forecasting and grid operations. As renewable penetration grows, reliable probabilistic forecasting is becoming essential for managing uncertainty and supporting risk-aware operational decision-making. However, these forecasts often suffer from miscalibration due to temporal variability, changing weather conditions, and heterogeneous operating regimes. In many real-world settings, renewable energy forecasts are provided by external sources, vendors, or independently trained systems, making retraining infeasible because of limited model access or computational constraints. This creates a need for efficient and model-agnostic methods that can improve forecast reliability after they are produced. This paper presents Context-Aware Conformal Prediction (CACP), a framework for calibrating renewable energy forecasts. The proposed method relies on a weighting mechanism during the calibration procedure which assigns higher weights to historical observations that are more similar to the target forecasting condition. This enables adaptive prediction intervals that reflect local uncertainty regimes without requiring access to, or retraining of, the underlying forecasting model. Experiments are performed on a large-scale dataset from National Renewable Energy Laboratory (NREL) day-ahead solar forecasting, covering multiple systems including MISO, ERCTO, and SPP. The results show that CACP improves the reliability-efficiency tradeoff at both site and system levels compared to NREL's base forecasting model and the other conformal prediction baselines. These results suggest that CACP can serve as a practical reliability-enhancement layer for trustworthy AI-enabled renewable energy forecasting and operational decision support.

Beyond Empirical Bayes: A Hierarchical Bayesian Approach to Crash Rate Estimation with Missing Traffic Volume

2026-06-01T20:36:56Z

The Empirical Bayes (EB) procedure of Hauer et al. (2002) is the workhorse of highway safety analysis: it combines a Safety Performance Function with observed crash counts to produce shrinkage estimates of segment-level crash rates. EB delivers practicality by holding several quantities fixed at calibration: SPF coefficients, per-type overdispersion, observed ADT, and a fixed exposure exponent. These assumptions strain when ADT is missing on a majority of segments. We present a fully Bayesian hierarchical model that moves beyond EB by relaxing each of these assumptions in a single joint inference. Fit on Ohio's road inventory (408,304 segments, 2.9 million crashes, 2013-2025), the model jointly imputes missing ADT and estimates per-segment crash rates with uncertainty. Posterior predictive checks of an initial fixed-exposure model expose a tail misfit; relaxing the exposure structure to a per-functional-class exposure exponent and an estimated length exponent, in place of a single scalar and a fixed offset, resolves it and improves out-of-sample predictive accuracy (PSIS-LOO $Δ\mathrm{elpd}$ = 9,394, SE 238). Crash count is sublinear in traffic in every class (exposure exponents 0.49-0.70, all $<1$, the safety-in-numbers effect) and sublinear in segment length ($β_{\mathrm{len}} = 0.69$). Partial pooling substantially improves out-of-sample predictive accuracy over complete pooling (PSIS-LOO $Δ\mathrm{elpd}$ = 4,780, SE 225). The Bayesian ADT submodel attains $R^2_{\log} = 0.756$ by encoding county and functional class as hierarchical priors, versus $0.653$ for a LightGBM restricted to the same continuous predictors. The output is a posterior crash rate distribution per segment, replacing the median-by-type point estimates used in our prior risk-aware routing framework.

Benchmarking Waitlist Mortality Prediction in Heart Transplantation Through Time-to-Event Modeling using New Longitudinal UNOS Dataset

2026-06-01T17:57:12Z

Decisions about managing patients on the heart transplant waitlist are currently made by committees of doctors who consider multiple factors, but the process remains largely ad-hoc. With the growing volume of longitudinal patient, donor, and organ data collected by the United Network for Organ Sharing (UNOS) since 2018, there is increasing interest in analytical approaches to support clinical decision-making at the time of organ availability. In this study, we benchmark machine learning models that leverage longitudinal waitlist history data for time-dependent, time-to-event modeling of waitlist mortality. We train on 23,807 patient records with 77 variables and evaluate both survival prediction and discrimination at a 1-year horizon. Our best model achieves a C-Index of 0.94 and AUROC of 0.89, significantly outperforming previous models. Key predictors align with known risk factors while also revealing novel associations. Our findings can support urgency assessment and policy refinement in heart transplant decision making.

Probing Minimalist Phase Structure in LLMs: What Universal Dependencies Cannot Represent

2026-06-01T17:53:13Z

Structural probes train on Universal Dependencies (UD), which does not encode formal-syntactic abstractions such as phase boundaries or phase-internal cohesion. Whether large language models (LLMs) encode these remains an open question that UD-based probing cannot answer by construction. We evaluate structural probes on wh-movement stimuli where UD distances are invariant across conditions by design -- any non-zero effect therefore reflects structure beyond UD. The three conditions -- bare small clause, infinitival, and finite -- are ordered by the number of Minimalist Program (MP) phase boundaries the wh-element crosses. Across 13 LLMs from four families, we find a phase-count gradient on a cross-clause pair (12/13 models) and a 13/13 sign asymmetry on a within-clause pair whose UD distance is identical across conditions -- the latter specifically predicted by phase-internal cohesion, an MP abstraction invisible to UD by construction. Activation patching confirms the representations are causally active in 12/13 models. These findings suggest that distributional pretraining can induce representations aligned with formal-syntactic abstractions beyond the reach of annotation-based probing; UD-grounded probes provide a lower bound on syntactic encoding, not an upper bound.

Probabilistic storyline attribution using machine learning

2026-06-01T17:49:59Z

A fundamental goal in climate attribution is to estimate how forced climate change contributes to observed extreme weather events. The storyline attribution method compares an observed weather event, conditional on its atmospheric dynamic state (i.e., atmospheric circulation), in the current, 'factual' climate to an event with very similar circulation conditions in a hypothetical, 'counterfactual' climate. However, physical climate models cannot directly transfer these storyline counterfactuals across different climate forcing states. Statistical and machine learning techniques may overcome this limitation; yet, emulating circulation-conditional extreme events under different climate states is challenging. Here, we demonstrate distributional autoencoders (DAEs) as a versatile method for generating climate counterfactuals. They model the full distribution of spatially resolved European temperature fields conditional on the atmospheric circulation state and the mean global warming level. These distributions allow for deriving meaningful conditional probability ratios, which is a particular advantage of the DAE-based storyline approach. We train DAEs on fully coupled climate model simulations and we evaluate the modelled distributions across different factual and storyline-based counterfactual climate model simulations. In an illustrative case study, we revisit the 2003 European heatwave and we generate counterfactuals for a hypothetical `2003-like European heatwave' using ERA5 circulation, which we hypothesize to occur a quarter century (2028) and a half century (2053) after 2003. The conditional intensity would increase from 29.3 °C in 2003, to 30.3 °C and 32.1 °C in 2028 and 2053, respectively and conditional probability ratios would be 2.1 and 3.2 when compared to 2003.

AI and physics-based weather forecasting: A comparative study

2026-06-01T17:21:26Z

In the last few years, AI-based models have become the centre of attention in weather forecasting due to their increasing accuracy and efficiency. Pioneering among weather services, ECMWF has developed its Artificial Intelligence Forecasting System (AIFS) model, which was first to provide data-driven ensemble forecasts in June 2024. Since July 2025, the AIFS ensemble model has been operational and runs in parallel with ECMWF's physics-based Integrated Forecasting System (IFS), which is considered the gold standard in weather prediction. The new AIFS model can generate forecasts ten times faster than the classical numerical weather prediction model, while consuming approximately a thousand times less energy. We present the results of our systematic assessment of the performance of the IFS and AIFS models by comparing the accuracy of raw and post-processed medium-range 10-m wind-speed ensemble forecasts generated operationally by the two models for the period between July and November 2025 for more than 9000 synoptic observation stations across the globe. The post-processed case involves the parametric ensemble model output statistics (EMOS) as well as the non-parametric quantile regression (QR) approach to correct any systematic inaccuracies in the raw forecasts. The predictive performance of raw IFS ensemble forecasts proves to be substantially superior to the skill of the raw AIFS predictions for all investigated forecast horizons. As expected, post-processing significantly improves the skill of both IFS and AIFS predictions, and, across most verification metrics, EMOS is superior to QR, especially for short lead times. Compared to the raw ensemble, the differences in skill between the matching IFS and AIFS predictions are substantially decreased by post-processing and are mostly significant at short lead times, when the IFS forecasts outperform their AIFS counterparts.

Optimal sequential two-stage Bayes Factor Design for two-arm clinical Phase II Trials with binary Endpoints

2026-06-01T15:53:21Z

Two-arm phase II clinical trials often benefit from an interim analysis that allows early stopping for futility, but Bayesian calibration of such designs is usually based on computationally intensive Monte Carlo simulation. In this work, a simulation-free methodology is developed to obtain Bayesian optimal two-stage designs in two-arm phase II trials with binary endpoints using Bayes factors as the primary measure of evidence. Building on recent matrix-search methods for fixed-sample two-arm Bayes factor designs and earlier correction formulas for one-arm two-stage designs, the proposed approach derives exact expressions for the operating characteristics of a two-stage two-arm design with a single futility interim. Bayesian power and type-I error are obtained by correcting the corresponding fixed-sample quantities for trajectories that would have been removed by early stopping, yielding a fully numerical calibration procedure that avoids Monte Carlo error entirely. The resulting method searches over admissible interim and final sample sizes to identify the optimal design that satisfies target constraints on Bayesian power, type-I error, and the probability of compelling evidence in favour of the null hypothesis, while minimizing the expected sample size under the null hypothesis. The methodology is illustrated in realistic phase II settings, including a detailed re-analysis of the riociguat trial in systemic sclerosis. Overall, the approach extends simulation-free Bayes factor design methodology to the practically important setting of two-arm two-stage phase II trials and provides a transparent basis for Bayesian design calibration and sensitivity analysis.

Bayesian Mixed Multidimensional Scaling for Auditory Processing

2026-06-01T15:36:21Z

The human brain distinguishes speech sounds by mapping acoustic signals into a latent perceptual space. This space can be estimated via multidimensional scaling (MDS), preserving the similarity structure in lower dimensions. However, individual and group-level heterogeneity, especially between native and non-native listeners, remains poorly understood. Prior approaches often ignore such variability or cannot capture shared structure, limiting principled comparisons. Moreover, the literature often focuses on latent distances rather than the underlying features themselves. To address these issues, we develop a Bayesian mixed MDS method that accounts for both subject- and group-level heterogeneity, allows for the recovery of unique, identifiable latent features, facilitating their biological interpretability, while also determining the effective dimensionality of the latent space in an automated, data-adaptive manner. Simulations and an auditory neuroscience application demonstrate how these features reconstruct observed distances and vary with individual and language background, revealing novel insights.

Bandwidth selection with a frequency-domain version of the AIC

2026-06-01T14:19:41Z

When it comes to estimating an unknown spectral density as simply and reliably as possible, parametric spectral density estimation using AR models and order selection via AIC is the method of choice. In contrast, no standard method has yet emerged for automatic nonparametric spectral density estimation, and there seems to be little willingness to weigh the advantages and disadvantages of different risk functions and the various methods for estimating them on a case-by-case basis, particularly because it is unclear whether the effort is even worthwhile without concrete prior information about the unknown spectral density. As a result, subjective visual methods are still widely used in practice to determine the appropriate smoothing parameter for a nonparametric estimation. This article aims to encourage the increased use of objective automatic methods by presenting evidence that using what is arguably the simplest and most straightforward frequency-domain version of the AIC for the automatic determination of an appropriate bandwidth enables results that are comparable to those obtained using the standard parametric approach. This evidence is based on both real-world time series and synthetic time series with spectral densities of varying complexity.