Evaluating the role of correlation among markers in prediction models

2026-06-01T10:51:27Z

Different methods have been employed to estimate models maximizing the area under the receiver operating characteristic curve (ROC-AUC). Once a model is developed, integrating novel biomarkers may improve its diagnostic ability. However, the discrimination improvement from adding a new biomarker is not always evident, even if the marker itself has good discriminatory power. The sign and magnitude of correlations between biomarkers may impact model performance. In this paper, we assess the effect of such correlations on the discrimination ability of predictive models. Under multivariate normality, we derive an expression for the maximum AUC as a function of the correlations between markers, illustrated graphically using surfaces. Logarithmic folded bivariate normal and Gamma simulations address skewed data cases. Additionally, AUC improvement was assessed combining 1934 blood lipid metabolites determined by liquid chromatography in 44 pancreatic cancer cases and 38 controls from the PanGenMic Study. Our results show that negative correlations consistently maximize the combined AUC, offering the greatest improvements when markers have equal predictive ability, while positive correlations yield the least favorable results. Negative correlations remain optimal for markers with differing abilities, though positive correlations show slight benefits. Simulations with skewed distributions confirm these trends, emphasizing the role of asymmetry in marker selection. Real-world analysis of serum lipid-derived metabolites for detecting pancreatic ductal adenocarcinoma (PDAC) reinforces the influence of correlations on AUC optimization. These findings suggest that the sign and magnitude of inter-biomarker correlations should be considered when incorporating new markers into predictive algorithms.

ICCDesign: An R Package for the Design and Analysis of ICC-Based Reliability Studies with Continuous Responses

2026-06-01T10:49:38Z

The intraclass correlation coefficient (ICC) is among the most widely used statistics in reliability research, playing a central role in medical measurement, psychological assessment, and behavioral science. However, practical application of ICC faces two major obstacles. First, ICC can be organized into multiple forms under the McGraw and Wong (1996) framework -- including six widely reported standard forms and four additional design combinations -- and researchers must select the appropriate form based on their study design, yet existing guidelines are not always operationalized in software interfaces. Second, available R tools are highly fragmented: sample size calculation, ICC estimation with confidence intervals, and reliability evaluation are distributed across separate packages, compelling researchers to switch between tools and increasing the risk of analytical errors. This paper introduces the ICCDesign package, designed specifically to provide an integrated workflow for ICC-based reliability studies with continuous responses, assuming one continuous rating per subject-rater cell. The package integrates four core functionalities: (1) point estimation, ANOVA-based confidence intervals, and implemented hypothesis tests for supported ICC design combinations following the McGraw and Wong (1996) framework, with a built-in four-step decision framework guiding users toward an appropriate ICC form; (2) sample size planning based on Zou's (2012) closed-form formulas, supporting two planning modes and an inverse assurance calculation; (3) automated reliability evaluation based on Koo and Li (2016) criteria, with an uncertainty notification when the confidence interval spans the 0.75 good-reliability threshold; and (4) an interactive Shiny web application covering the main analysis and planning functionalities. ICCDesign is available from GitHub at https://github.com/KlariZhang/ICCDesign.

Convex Distance Operator Transport: A Convex and Geometry-Preserving Formulation

2026-06-01T10:38:09Z

We introduce Convex Distance Operator Transport (CDOT), the first convex optimal transport framework that aligns distributions across heterogeneous domains by jointly preserving feature correspondence and intrinsic geometric structure. Specifically, CDOT employs an operator-based regularization that aligns aggregated distance structures by introducing distance and conditional expectation operators. Consequently, the proposed regularization improves the robustness to local geometric variations. We further prove that the resulting CDOT discrepancy is a valid pseudometric on the space of attributed compact metric-measure spaces. In addition, we characterize the relationship between CDOT and Gromov--Wasserstein (GW) through a new notion of dispersion gap, formally elucidating the geometric source of non-convexity in GW compared to the convexity of CDOT. In the finite-sample regime, we derive a non-asymptotic risk bound decomposed into optimization and statistical errors, establishing risk consistency under a globally convergent Frank--Wolfe algorithm. Experiments on synthetic point clouds, brain connectomes, and graph classification benchmarks demonstrate better performance over existing methods, with stable and reliable behavior in practice.

Simultaneous estimation of the effective reproduction number and the time series of daily infections: Application to Covid-19

2026-06-01T10:21:21Z

The time-varying effective reproduction number is an important parameter for communication and policy decisions during an epidemic. In this paper, we present new statistical methods for estimating the reproduction number based on the popular model of \citet{cori2013new} which defines the effective reproduction number based on self-exciting dynamics of new infections. Such a model is conceptually simple and less susceptible to misspecifications than more complicated multi-compartment models. However, statistical inference is challenging, and the previous literature has either relied on proxy data and/or a two-step approach in which the number of infections is first estimated. In contrast, we present a coherent Bayesian method that approximates the joint posterior of daily new infections and reproduction numbers using a novel Markov chain Monte Carlo (MCMC) algorithm. Comparing our method to the state-of-the-art three-step estimation procedure of \citet{huisman2022estimation}, both using daily confirmed cases from Switzerland in the Covid-19 epidemic and simulated data, we find that our method is more accurate in terms of point estimates and uncertainty quantification, especially near the beginning and end of an observation period.

PliableBVS: A flexible Bayesian variable selection method for modeling interactions with mandatory modifying variables

2026-06-01T10:07:53Z

High-dimensional interaction models are useful for studying, for example, how a large set of variables of interest, such as gene expression or other omics features, interact with a smaller set of modifying variables, such as clinical covariates. In this context, the pliable lasso has recently been proposed as an efficient method for screening large numbers of potential interaction terms under an asymmetric weak hierarchical constraint. In this work, we extend this framework by introducing PliableBVS, a Bayesian variable selection approach that preserves the hierarchical structure of the pliable lasso while inducing sparsity through spike-and-slab priors. The proposed model combines the continuous shrinkage effect of Bayesian lasso with a hierarchical spike-and-slab prior formulation that has two layers of decision variables: one governing the inclusion of main effects and another controlling the inclusion of interaction effects which is conditional on the inclusion of the corresponding main effects. This structure enables simultaneous selection of high-dimensional main and interaction effects within a coherent probabilistic framework. In simulation studies the proposed method outperforms the original pliable lasso in identifying active main and interaction effects, reducing false discoveries, and improving prediction accuracy in most scenarios. Applications with data from a labor onset study and a preeclampsia study demonstrate that PliableBVS selects biologically meaningful features and interactions.

Testing for Single-Population Ancestry in the Admixture Model

2026-06-01T09:47:59Z

The Admixture Model describes genetic marker data by representing each individual's genome as a mixture of contributions from $K$ ancestral populations, with the individual admixture vector summarizing the corresponding ancestry proportions. In population and forensic genetics, a key question is whether an individual's genome supports a predominantly single-ancestry interpretation or whether an admixed interpretation is more appropriate. We propose a statistical test for single-population ancestry in the supervised Admixture Model, where ancestral allele frequencies are treated as known. The test assesses whether the largest admixture component exceeds a practitioner-chosen dominance threshold, giving precise meaning to the notion of a sufficiently strong single-population contribution. To calibrate the test, we develop a constrained parametric bootstrap procedure that generates data under a null-constrained maximum likelihood estimator, accounting for the constrained hypothesis structure, the marker-wise heterogeneity and small sample sizes. Under standard regularity conditions, we prove that the proposed test has asymptotic level $α$ and is consistent, ensuring control of false single-ancestry declarations while reliably detecting dominant ancestry components. Simulation studies demonstrate good finite-sample performance across different numbers of ancestral populations, marker-panel sizes, dominance thresholds, and allele-frequency distributions. We further illustrate the practical utility of the method using data from the 1000 Genomes Project. The proposed framework delivers interpretable, threshold-based ancestry assessment with rigorous error control, and extends constrained bootstrap methodology to the independent but non-identically distributed setting of genetic marker data.

Return-to-Baseline Testing via Empirically Calibrated e-processes

2026-06-01T09:22:38Z

We consider the problem of detecting a Return to Baseline (RtB) in high-frequency monitoring data preceding and following an intervention, where the aim is to identify the time at which the data-generating distribution realigns with its pre-intervention distribution. We propose a sequential, distribution-free testing procedure that does not rely on specifying a null model and provides anytime-valid error control. The method relies on ideas from universal inference to define a discrepancy measure that is aggregated into a non-negative super-martingale, and is then empirically cal- ibrated to form an e-process. The calibration is performed using the baseline data, and is thus subject-specific. We establish finite-sample bounds for the calibration error (under a flexible non-parametric assumption), discuss the impact of tuning parameters and computational complexity, and illustrate through simulations and a clinical case study that the procedure accurately detects RtB from monitoring data.

Spatial Capture-Recapture With Penalized Regression Splines to Flexibly Model Wildlife Density and Distribution

2026-06-01T09:00:41Z

Spatial capture-recapture models are routinely used to estimate the abundance and distribution of wild animal populations and involve a latent spatial point process of animal activity centres that describes the spatial distribution of individuals. While traditional spatial capture-recapture models use a Poisson process, the assumption of conditional independence between points is often violated in practice due to factors not included in the point process, such as social clustering, territoriality, or preferential selection of habitat due to unobserved covariates. Log-Gaussian Cox processes are commonly used in spatial statistics to overcome weaknesses of Poisson processes, but methods to fit them within spatial capture-recapture do not currently exist. Here, we present a spatial capture-recapture framework that allows for the use of penalized regression splines to describe the activity centre distribution, with model fitting via a Laplace-approximate penalized marginal maximum likelihood approach. Our method approximates using a log-Gaussian Cox process for activity centres, and allows flexible modelling of nonlinear effect of covariates on density. We illustrate the use of our method with a simulation study and two case-studies. We demonstrate that, while population size estimates of traditional approaches are robust to density model misspecification, our approach substantially improves the estimation of spatial animal distributions.

Estimating conditional Mann-Whitney effects using pseudo-observation-based regression

2026-06-01T08:27:46Z

The Mann-Whitney effect is an effect measure for the order of two sample-specific outcome variables. It has the interpretation of a probability and also a connection to the area under the ROC curve. In the literature it has been considered for both ordinal and right-censored time-to-event outcomes. For both cases, the present paper introduces a distribution-free regression model that relates the Mann-Whitney effect to a linear combination of covariates. To fit the model, we develop a pseudo-observation-based procedure yielding consistent and asymptotically normal coefficient estimates. In addition, we propose bootstrap-based hypothesis tests to infer the effects of the covariates on the Mann-Whitney effect. A simulation study on the small-sample behavior of the proposed method demonstrates that the novel hypothesis tests keep up with the z-test of a Cox regression model. The new methods are used to analyze progression-free survival in breast cancer patients enrolled for the randomized phase III SUCCESS-A trial.

A longitudinal Bayesian framework for estimating causal dose-response relationships

2026-06-01T08:23:36Z

Existing causal methods for time-varying exposure and time-varying confounding focus on estimating the average causal effect of a time-varying binary treatment on an end-of-study outcome, offering limited tools for characterizing marginal causal dose-response relationships under continuous exposures. We propose a scalable, nonparametric Bayesian framework for estimating marginal longitudinal causal dose-response functions with repeated outcome measurements. Our approach targets the average potential outcome at any fixed dose level and accommodates time-varying confounding through the generalized propensity score. The proposed approach embeds a Dirichlet process specification within a generalized estimating equations structure, capturing temporal correlation while making minimal assumptions about the functional form of the continuous exposure. We apply the proposed methods to monthly metro ridership and COVID-19 case data from major international cities, identifying causal relationships and the dose-response patterns between higher ridership and increased case counts.

A Uniform Improvement of the Benjamini-Hochberg Procedure using e-Closure

2026-06-01T08:06:18Z

This paper presents closed BH, a uniform improvement of the False Discovery Rate controlling method of Benjamini and Hochberg (BH). Closed BH is valid under the same assumption of Positive Regression Dependency on a Subset (PRDS) as BH. As a uniform improvement, closed BH never rejects fewer hypotheses than BH, but it may reject quite a few more. An increase in power is observed especially when the number of false null hypotheses is large. The novel method is constructed using the e-Closure principle, a recently derived general principle for multiple testing.

B-MASTER: Scalable Bayesian Multivariate Regression for Master Predictor Discovery in Colorectal Cancer Microbiome-Metabolite Profiles

2026-06-01T07:48:56Z

Motivation: The gut microbiome shapes cancer therapy response through its influence on host metabolism. While prior studies examine pairwise associations between individual genera and metabolites, there is limited methodology for identifying microbial genera that systematically regulate the overall metabolome. Scalable statistical tools are needed to uncover such system-level 'master predictors' in high-dimensional microbiome-metabolome data. Results: We introduce B-MASTER, a scalable Bayesian multivariate regression framework combining L1 sparsity and L2 group shrinkage to identify essential cross-metabolite regulators. A Gibbs sampler enables near-linear computational scaling, supporting models with millions of parameters. The method is supported by theoretical guarantees, including posterior contraction and selection consistency. Analysis of colorectal cancer microbiome-metabolome data reveals key microbial genera that govern global and cancer-associated metabolite patterns, highlighting system-level regulatory structure. Availability: The B-MASTER code, including demonstration scripts, is available at https://github.com/priyamdas2/B-MASTER. An archived snapshot of the code corresponding to this manuscript is available on Zenodo with DOI: 10.5281/zenodo.20484958.

LoopPerm-CPD: A Robust Loop Permutation Framework for Automatic Multiple Change-Point Detection in Longitudinal Data

2026-06-01T07:14:48Z

Human viral challenge studies, in which participants are deliberately inoculated with influenza strains such as H1N1 or H3N2 and monitored through longitudinal transcriptomic profiling before and after inoculation, are critical for characterizing dynamic biological immune responses to viral infection. A key analytical goal in such settings is to detect critical transition times, or change points, at which an underlying trajectory shifts direction or rate, indicating events such as the onset of an immune response or recovery. However, change-point detection in these longitudinal data is fundamentally challenging because observations are often sparse and irregularly spaced, sample sizes are small, outliers are common, and the number of change points is unknown in advance. To address these challenges, we propose LoopPerm-CPD, a robust change-point detection approach with a built-in loop permutation procedure for automatic multiple change-point detection. The method evaluates candidate slope change points and assesses their significance using within-subject circular permutation combined with binary segmentation, jointly estimating both the number and locations of change points. The accompanying R package, LoopPerm-CPD, implements this framework and flexibly accommodates generalized least squares, quantile regression, and quantile rank-score statistics for different types of longitudinal outcomes. The proposed approach is evaluated through simulations, demonstrating Type I error control and improved power compared with competing methods. Applied to real data, the framework identifies interpretable transition points in multiple human respiratory viral inoculation studies. Together, these results establish LoopPerm-CPD and its companion software as a robust and user-friendly tool for change-point detection in complex human longitudinal cohort data.

A theory of generalised coordinates for stochastic differential equations

2026-06-01T06:48:56Z

Stochastic differential equations are ubiquitous modelling tools in physics and the sciences. In most modelling scenarios, random fluctuations driving dynamics or motion have some non-trivial temporal correlation structure, which renders the SDE non-Markovian; a phenomenon commonly known as ``colored'' noise. Thus, an important objective is to develop effective tools for mathematically and numerically studying (possibly non-Markovian) SDEs. In this report, we formalise a mathematical theory for analysing and numerically studying SDEs based on so-called `generalised coordinates of motion'. Like the theory of rough paths, we analyse SDEs pathwise for any given realisation of the noise, not solely probabilistically. Like the established theory of Markovian realisation, we realise non-Markovian SDEs as a Markov process in an extended space. Unlike the established theory of Markovian realisation however, the Markovian realisations here are accurate on short timescales and may be exact globally in time, when flows and fluctuations are analytic. This theory is exact for SDEs with analytic flows and fluctuations, and is approximate when flows and fluctuations are differentiable. It provides useful analysis tools, which we employ to solve linear SDEs with analytic fluctuations. It may also be useful for studying rougher SDEs, as these may be identified as the limit of smoother ones. This theory supplies effective, computationally straightforward methods for simulation, filtering and control of SDEs; amongst others, we re-derive generalised Bayesian filtering, a state-of-the-art method for time-series analysis. Looking forward, this report suggests that generalised coordinates have far-reaching applications throughout stochastic differential equations.

Sequential Bootstrap for Out-of-Bag Error Estimation: A 100-Seed Replication Study and Variance-Structure Analysis

2026-06-01T06:15:20Z

Out-of-Bag (OOB) estimation is the standard internal diagnostic for bootstrap-aggregated tree ensembles. Under the classical multinomial bootstrap, the number of distinct training observations in each replicate, $U_b$, is itself random, but its contribution to OOB-based variability has rarely been isolated empirically. We use Sequential Bootstrap (SB) -- a resampling scheme that holds $U_b$ at a fixed target $k_n = \lfloor 0.632 n\rfloor$ -- as a controlled perturbation of the bootstrap mechanism, and ask whether stabilizing $U_b$ produces any measurable change in OOB-based diagnostics. We reproduce Breiman's five OOB experimental families on twelve synthetic and real datasets, but unlike the three-seed presentation common in this literature, we run 100 independent random seeds with 50 internal replications per seed, enabling formal paired statistical comparison (Wilcoxon signed-rank, paired-$t$, Pitman--Morgan variance test). We report three findings. First, OOB means are essentially insensitive to stabilization of $U_b$: of 57 (experiment, dataset, metric) cells under 100 seeds, only 6 reach $p<0.05$ on the paired mean comparison, and 4 of those 6 point in the opposite direction from what a 3-seed reading would suggest. Second, a narrow but reproducible effect survives at the variance level: SB reduces the cross-seed standard deviation of node-level classification diagnostics on real datasets while slightly increasing it on synthetic ones (permutation $p=0.026$); the Vehicle dataset exhibits a 21% cross-seed sd reduction (Pitman--Morgan $p=0.017$). Third, several directional claims that appear stable across three seeds flip sign under 100-seed replication, illustrating the cost of underpowered replication protocols. We therefore treat SB as a diagnostic tool for probing the distinct-sample-count term in the variance of OOB estimators, not as an alternative to the classical bootstrap.