https://arxiv.org/api/pCkW1sX9/IMhFeYti6+ATLZOsuw2026-06-10T13:23:00Z168622515http://arxiv.org/abs/2508.10207v1Examining the Association between Estimated Prevalence and Diagnostic Test Accuracy using Directed Acyclic Graphs2025-08-13T21:35:12ZThere have been reports of correlation between estimates of prevalence and test accuracy across studies included in diagnostic meta-analyses. It has been hypothesized that this unexpected association arises because of certain biases commonly found in diagnostic accuracy studies. A theoretical explanation has not been studied systematically. In this work, we introduce directed acyclic graphs to illustrate common structures of bias in diagnostic test accuracy studies and to define the resulting data-generating mechanism behind a diagnostic meta-analysis. Using simulation studies, we examine how these common biases can produce a correlation between estimates of prevalence and index test accuracy and what factors influence its magnitude and direction. We found that an association arises either in the absence of a perfect reference test or in the presence of a covariate that simultaneously causes spectrum effect and is associated with the prevalence (confounding). We also show that the association between prevalence and accuracy can be removed by appropriate statistical methods. In the risk of bias evaluation in diagnostic meta-analyses, an observed association between estimates of prevalence and accuracy should be explored to understand its source and to adjust for latent or observed variables if possible.2025-08-13T21:35:12ZYang LuRobert PlattNandini Dendukurihttp://arxiv.org/abs/2508.09563v1Performances and Correlations of Centrality Measures in Complex Networks2025-08-13T07:30:09ZNumerous centrality measures have been proposed to evaluate the importance of nodes in networks, yet comparative analyses of these measures remain limited. Based on 80 real-world networks, we conducted an empirical analysis of 16 representative centrality measures. In general, there exists a moderate to high level of correlation between node rankings derived from different measures. We identified two distinct communities: one comprising 4 measures and the other 7 measures. Measures within the same community exhibit exceptionally strong pairwise correlations. In contrast, the remaining five measures display markedly different behaviors, showing weak correlations not only among themselves but also with the other measures. This suggests that each of these five measures likely captures unique properties of node importance. Further analysis reveals that the distribution patterns of the most influential nodes identified by different centrality measures vary significantly: some measures tend to cluster influential nodes closely together, while others disperse them across distant locations within the network. Using the epidemic spreading model, we found that LocalRank, Subgraph Centrality, and Katz Centrality perform best in identifying the most influential single node, whereas Leverage Centrality, Collective Influence, and Cycle Ratio excel in identifying the most influential node sets. Overall, measures that identify influential nodes with larger topological distances between them tend to perform better in detecting influential node sets. Interestingly, despite being applied to the same dynamical process, when using two seemingly similar tasks, identifying influential nodes versus identifying influential node sets, to rank the performances of the 16 centrality measures, the resulting rankings are negatively correlated.2025-08-13T07:30:09ZYilin BiXinshan JiaoTao Zhouhttp://arxiv.org/abs/2508.09328v1Dynamic Survival Prediction using Longitudinal Images based on Transformer2025-08-12T20:31:55ZSurvival analysis utilizing multiple longitudinal medical images plays a pivotal role in the early detection and prognosis of diseases by providing insight beyond single-image evaluations. However, current methodologies often inadequately utilize censored data, overlook correlations among longitudinal images measured over multiple time points, and lack interpretability. We introduce SurLonFormer, a novel Transformer-based neural network that integrates longitudinal medical imaging with structured data for survival prediction. Our architecture comprises three key components: a Vision Encoder for extracting spatial features, a Sequence Encoder for aggregating temporal information, and a Survival Encoder based on the Cox proportional hazards model. This framework effectively incorporates censored data, addresses scalability issues, and enhances interpretability through occlusion sensitivity analysis and dynamic survival prediction. Extensive simulations and a real-world application in Alzheimer's disease analysis demonstrate that SurLonFormer achieves superior predictive performance and successfully identifies disease-related imaging biomarkers.2025-08-12T20:31:55ZBingfan LiuHaolun ShiJiguo Caohttp://arxiv.org/abs/2508.07864v1A Review and Classification of Model Uncertainty2025-08-11T11:26:44ZModel uncertainty is a crucial issue in statistics, econometrics and machine learning, yet its definition remains ambiguous and is subject to various interpretations in the literature. So far, there has not been a universally accepted definition of model uncertainty. We review different understandings of model uncertainty and categorize them into three distinct types: uncertainty about the true model, model selection uncertainty, and model selection instability. We further offer interpretations and examples for a better illustration of these definitions. We also discuss the potential consequences of neglecting model uncertainty in the process of conducting statistical inference, and provide effective solutions to these problems. Our aim is to help researchers better understand the concept of model uncertainty and obtain valid statistical inference results on the premise of its existence.2025-08-11T11:26:44ZGuangyuan CuiYuting WeiXinyu Zhanghttp://arxiv.org/abs/2508.07474v1The p-value from a fuzzy point of view2025-08-10T20:17:44ZThe purpose of the paper is to provide a new way of seeing the p-value in terms of a fuzzy membership function. According to the ASAs statement, we aim at removing the arbitrary choice of the significance level and at demonstrating that the p-value can be profitably interpreted from a fuzzy point of view. In particular, we propose a new class of membership functions by viewing the p-value as a function of the null hypothesis and we apply our approach to compare two independent binomial proportions. The proposed membership functions can also be employed to assess the precision of confidence intervals and the power of statistical tests.2025-08-10T20:17:44ZPiero Quattohttp://arxiv.org/abs/2312.13619v2The many routes to the ubiquitous Bradley-Terry model2025-08-07T10:50:50ZThe rating of items based on pairwise comparisons has been a topic of statistical investigation for many decades. Numerous approaches have been proposed. One of the best known is the Bradley-Terry model. This paper seeks to assemble and explain a variety of motivations for its use. Some are based on principles or on maximising an objective function; others are derived from well-known statistical models, or stylised game scenarios. They include both examples well-known in the literature as well as what are believed to be novel presentations.2023-12-21T07:14:19ZTo be published in Statistical ScienceIan HamiltonNick TawnDavid Firthhttp://arxiv.org/abs/2406.10612v2Producing treatment hierarchies in network meta-analysis using probabilistic models and treatment-choice criteria2025-08-06T14:05:25ZA key output of network meta-analysis (NMA) is the relative ranking of treatments; nevertheless, it has attracted substantial criticism. Existing ranking methods often lack clear interpretability and fail to adequately account for uncertainty, over-emphasizing small differences in treatment effects. We propose a novel framework to estimate treatment hierarchies in NMA using a probabilistic model, focusing on a clinically relevant treatment-choice criterion (TCC). Initially, we formulate a mathematical expression to define a TCC based on smallest worthwhile differences (SWD), converting NMA relative treatment effects into treatment preference format. This data is then synthesized using a probabilistic ranking model, assigning each treatment a latent 'ability' parameter, representing its propensity to yield clinically important and beneficial true treatment effects relative to the rest of the treatments in the network. Parameter estimation relies on the maximum likelihood theory, with standard errors derived asymptotically from Fisher's information matrix. To facilitate the use of our methods, we launched the R package mtrank. We applied our method to two clinical datasets: one comparing 18 antidepressants for major depression and another comparing 6 antihypertensives for the incidence of diabetes. Our approach provided robust, interpretable treatment hierarchies that account for a concrete TCC. We further examined the agreement between the proposed method and existing ranking metrics in 153 published networks, concluding that the degree of agreement depends on the precision of the NMA estimates. Our framework offers a valuable alternative for NMA treatment ranking, mitigating over-interpretation of minor differences. This enables more reliable and clinically meaningful treatment hierarchies.2024-06-15T12:26:09ZTheodoros EvrenoglouAdriani NikolakopoulouGuido SchwarzerGerta RückerAnna Chaimani10.1017/rsm.2026.10071http://arxiv.org/abs/2508.04080v1GeoSR: Cognitive-Agentic Framework for Probing Geospatial Knowledge Boundaries via Iterative Self-Refinement2025-08-06T04:45:34ZRecent studies have extended the application of large language models (LLMs) to geographic problems, revealing surprising geospatial competence even without explicit spatial supervision. However, LLMs still face challenges in spatial consistency, multi-hop reasoning, and geographic bias. To address these issues, we propose GeoSR, a self-refining agentic reasoning framework that embeds core geographic principles -- most notably Tobler's First Law of Geography -- into an iterative prediction loop. In GeoSR, the reasoning process is decomposed into three collaborating agents: (1) a variable-selection agent that selects relevant covariates from the same location; (2) a point-selection agent that chooses reference predictions at nearby locations generated by the LLM in previous rounds; and (3) a refine agent that coordinates the iterative refinement process by evaluating prediction quality and triggering further rounds when necessary. This agentic loop progressively improves prediction quality by leveraging both spatial dependencies and inter-variable relationships. We validate GeoSR on tasks ranging from physical-world property estimation to socioeconomic prediction. Experimental results show consistent improvements over standard prompting strategies, demonstrating that incorporating geostatistical priors and spatially structured reasoning into LLMs leads to more accurate and equitable geospatial predictions. The code of GeoSR is available at https://github.com/JinfanTang/GeoSR.2025-08-06T04:45:34Z16 pages, 9 figuresJinfan TangKunming WuRuifeng GongxieYuya HeYuankai Wuhttp://arxiv.org/abs/2508.03952v1A Blueprint to Design Curriculum and Pedagogy for Introductory Data Science2025-08-05T22:38:28ZAs the demand for jobs in data science increases, so does the demand for universities to develop and facilitate modernized data science curricula to train students for these positions. Yet, the development of these courses remains challenging, especially at the introductory level. To help instructors to meet this demand, we present a flexible blueprint that supports the development of a modernized introductory data science curriculum. This blueprint is narrated through the lens and experience in teaching the introductory data science course at \university{}. This is a large course that serves both STEM and non-STEM majors and includes the incorporation and facilitation of technologies such as R, RStudio, Quarto, Git, and GitHub. We identify and provide discussion around common challenges in teaching a modernized introductory data science course, detail a learning model for students to grow their understanding of data science concepts, and provide reproducible materials to help empower teachers to adopt and adapt such curriculum at their universities.2025-08-05T22:38:28Z33 pages, 4 figuresElijah MeyerMine Çetinkaya-Rundelhttp://arxiv.org/abs/2508.02966v1Measuring Human Leadership Skills with Artificially Intelligent Agents2025-08-05T00:05:54ZWe show that the ability to lead groups of humans is predicted by leadership skill with Artificially Intelligent agents. In a large pre-registered lab experiment, human leaders worked with AI agents to solve problems. Their performance on this 'AI leadership test' was strongly correlated with their causal impact on human teams, which we estimate by repeatedly randomly assigning leaders to groups of human followers and measuring team performance. Successful leaders of both humans and AI agents ask more questions and engage in more conversational turn-taking; they score higher on measures of social intelligence, fluid intelligence, and decision-making skill, but do not differ in gender, age, ethnicity or education. Our findings indicate that AI agents can be effective proxies for human participants in social experiments, which greatly simplifies the measurement of leadership and teamwork skills.2025-08-05T00:05:54ZBen WeidmannYixian XuDavid J. Deminghttp://arxiv.org/abs/2405.10453v2Expected Points Above Average: A Novel NBA Player Metric Based on Bayesian Hierarchical Modeling2025-08-01T20:45:17ZIn this paper, we propose two novel basketball metrics: ``expected points'' for team-based comparisons and ``expected points above average (EPAA)'' as a player-evaluation tool. Established within the Bayesian hierarchical model framework, teams and players are clustered based on their shooting propensities and abilities using posterior predictive distributions. We illustrate the concepts for the top 100 shot takers over the last decade and offer our metric as an additional metric for evaluating players. We compare our metrics to two traditional NBA player evaluation metrics: player efficiency rating and box plus/minus. Finally, we develop a Shiny web application that allows interested readers to make additional team and player comparisons.2024-05-16T21:40:42ZBenjamin WilliamsErin M. SchliepBailey FosdickRyan Elmorehttp://arxiv.org/abs/2310.13826v3A p-value for Process Tracing and other N=1 Studies2025-07-31T21:53:51ZWe introduce a method for calculating \(p\)-values to test causal hypotheses in qualitative research \emph{a la} process tracing. As in an experiment, our \(p\)-value tells us how often one would make the same or more compelling observations favoring one theory while entertaining a rival theory. We adapt Fisher's (1935) randomization-based urn model to the reality of qualitative researchers, who cannot randomize history, but can make observations about historical processes. Our test includes a method of sensitivity analysis which allows researchers to account for the possibility of observation bias, as well as a framework for representing the varying strenght of individual pieces of evidence, altoguether informing the robustness of qualitative causal inefernce. We provide simulations and replications of previously published work to illustrate how to execute our test using any type of qualitative data about events that took place within one case. This approach adds to the pluralistic turn in the use of probability theory in theory-testing process tracing by offering a simple model with provable conservatism, while relying on few assumptions the consequences of which can be directly assessed.2023-10-20T21:47:24ZMatias LopezJake Bowershttp://arxiv.org/abs/2507.23106v1Efficient inference of dynamic gene regulatory networks using discrete penalty2025-07-30T21:13:26ZGene regulatory networks (GRNs) orchestrate cellular decision making and survival strategies. Inferring the structure of these networks from high-dimensional transcriptomics data is a central challenge in systems biology. Traditional approaches to GRN inference, such as the graphical lasso and its joint extensions, rely on $\ell_1$ penalty to induce sparsity but can bias network recovery and require extensive hyperparameter tuning. Here, we present a scalable framework for the joint inference of dynamic GRNs using a discrete $\ell_0$ penalty, enabling direct and unbiased control over network sparsity. Leveraging recent algorithmic advances, we efficiently solve the resulting mixed-integer optimization problem for populations structured as arbitrary tree hypergraphs, accommodating both continuous and categorical distinctions among biological samples. After validating our method on synthetic benchmarks, we apply it to single-cell and spatial transcriptomics data from glioblastoma (GBM) patient tumors. Our approach reconstructs gene networks across tumor clusters, maps network rewiring along hypoxia gradients, and reveals niche-specific differences between primary and recurrent tumors. By providing a robust and interpretable tool for GRN inference in complex tissues, our work facilitates high-resolution dissection of tumor heterogeneity and adaptation, with broad applicability to emerging large-scale transcriptomic datasets.2025-07-30T21:13:26ZVisweswaran RavikumarAaresh BhathenaWajd N Al-HolouSalar FattahiArvind Raohttp://arxiv.org/abs/2409.05764v2Jackknife Empirical Likelihood Ratio Test for Cauchy Distribution2025-07-30T15:21:38ZHeavy-tailed distributions, such as the Cauchy distribution, are acknowledged for providing more accurate models for financial returns, as the normal distribution is deemed insufficient for capturing the significant fluctuations observed in real-world assets. Data sets characterized by outlier sensitivity are critically important in diverse areas, including finance, economics, telecommunications, and signal processing. This article addresses a goodness-of-fit test for the Cauchy distribution. The proposed test utilizes empirical likelihood methods, including the jackknife empirical likelihood (JEL) and adjusted jackknife empirical likelihood (AJEL). Extensive Monte Carlo simulation studies are conducted to evaluate the finite sample performance of the proposed test. The application of the proposed test is illustrated through the analysing two real data sets.2024-09-09T16:27:22Z15 pagesGanesh Vishnu AvhadAnanya LahiriSudheesh K. Kattumannilhttp://arxiv.org/abs/2507.22679v1An alternative method of adjusting for multiple comparison in medical research2025-07-30T13:42:36ZBackground Most methods of adjusting for multiplicity focus primarily on controlling type I errors and rarely consider type II errors. We propose a new method that considers controlling for false-positive findings while ensuring sufficient statistical power.
Methods We proposed a new method for multiple corrections called (Beta-exponential Adjustment, BEA) that considered the statistical power to control for type I errors while also considering the probability of type II errors. We conducted simulation studies to evaluate the performance characteristic of multiple testing correction procedures. We calculated sensitivity, specificity, and power separately for different sample sizes and number of biomarkers and compared them with the Bonferroni, Holm, and Benjamini-Hochberg (BH) correction methods.
Results The results demonstrated that our proposed BEA correction method exhibited the highest sensitivity at different sample sizes and biomarkers (e.g., sensitivity: BEA 0.8 versus BH 0.62 at sample size at 1000, tested biomarkers at 1000 and positive rate at 30%). With different sample sizes and number of biomarkers, the BEA correction method demonstrated comparable specificity compared with traditional methods. Moreover, we observed that the BEA-corrected had the highest statistical power than other methods, when the outcome was relatively rare.
Conclusion We proposed the BEA multiple correction method to adjust for multiple comparisons while considering statistical power. The BEA method demonstrated a higher sensitivity, comparable specificity, and higher statistical power, compared with traditional correction methods in different conditions. The BEA correction method can be an alternative of traditional methods of adjusting for multiplicity, especially in studies with small sample size, rare outcomes, or substantial number of biomarkers.2025-07-30T13:42:36Z20 pages, 5 figuresJiale LiZimu Wei