https://arxiv.org/api/v/gI01mV9v4OwzVA7UTGmKHIq8I 2026-06-18T17:13:14Z 23571 375 15 http://arxiv.org/abs/2605.23208v1 A Direct Variance Estimation (DiVE) for Meta-Analysis of Median Differences 2026-05-22T03:52:33Z

Meta-analyses of two-group studies that report median differences typically rely on methods that require, in addition to the median difference and sample size, summary measures of dispersion such as quartiles or ranges. Studies that do not report such statistics are often excluded from the meta-analysis. Existing two-stage approaches first estimate the asymptotic variance of the median difference within each study under parametric assumptions, and then combine these study-specific estimates to obtain the pooled median difference and its variance. We propose Direct Variance Estimation (DiVE), a method that directly estimates the variance of the pooled difference using only study-level median differences and their sample sizes. A comprehensive simulation study across a wide range of distributional scenarios shows that DiVE performs comparably to or better than conventional two-stage methods, with clear advantages when the number of studies is small. A re-analysis of published meta-analyses demonstrates that DiVE enables the inclusion of studies lacking dispersion statistics, leading to a more comprehensive and potentially less biased synthesis of evidence.

2026-05-22T03:52:33Z Tadahisa Okuda Masataka Taguri Kenichi Hayashi http://arxiv.org/abs/2605.24056v1 Possession-Level Player Impact in the Pre-Play-by-Play NBA Era: A Video-Reconstructed RAPM Database, 1984--1996 2026-05-22T01:08:28Z

Regularized Adjusted Plus-Minus (RAPM) is the standard framework for estimating individual player impact in basketball. Its application requires possession-level stint data -- records of which five players shared the court for each contiguous sequence of possessions -- a form of data the NBA did not systematically record until the late 1990s. This paper describes the construction, methodology, and validation of the first possession-level player impact database for the pre-play-by-play NBA era, covering the regular seasons from 1984--85 through 1995--96, spanning twelve published seasons. As of this writing, 2,179 regular-season games have been reconstructed across twelve published seasons, comprising 435,760 total logged possessions and 1,012 distinct player-seasons. Every game was manually reconstructed from broadcast video: lineup changes were logged at every dead-ball substitution, possessions were tallied directly from footage, and points scored by each lineup were recorded. RAPM is estimated via weighted ridge regression applied to the reconstructed stint data, using the identical mathematical framework applied to modern play-by-play records. We provide a rigorous treatment of the reconstruction protocol, the formal properties of the estimation procedure, uncertainty quantification through posterior credible intervals, a multi-criterion validation framework, and an analysis of sampling properties at partial coverage. The resulting database is the only possession-level individual impact record for this era and provides a foundation for historical analysis that has until now been technically inaccessible.

2026-05-22T01:08:28Z Justin Jacobs http://arxiv.org/abs/2605.24050v1 More Skills, Worse Agents? Skill Shadowing Degrades Performance When Expanding Skill Libraries 2026-05-21T23:33:31Z

Skill libraries allow LLM agents to load task-specific instructions on demand, letting non-expert users solve domain-specific tasks through natural language without knowing which skills exist or how they work. However, performance degrades as libraries grow -- by up to 21\% when scaling from a small set of helpful skills to a 202-skill library. In this work, we formulate this performance degradation as the pass rate drop between loading a library of known-helpful skills and the full library. Moreover, we propose to decompose the pass rate drop by conditioning on the skill(s) invocation -- which skills the agent selects during a trajectory -- into two effects: \emph{skill shadowing}, where the agent selects wrong skills more often as the library expands, and \emph{context overhead}, where the enlarged context degrades execution even when selection is correct. We derive upper bounds on both effects to characterize their magnitudes of impacts to the pass rate drop. Our empirical estimates of the effects and their upper bounds both show that the \emph{skill shadowing} effect grows with library size and significantly contributes to the performance degradation, whereas the \emph{context overhead} effect remains small and indistinguishable from zero. This observed asymmetry establishes that the skill selection failure, not the enlarged context, is the primary bottleneck when expanding the skill libraries.

2026-05-21T23:33:31Z Hongwen Song Vinson Song Vinson Wei http://arxiv.org/abs/2605.23048v1 StanBKT: Rethinking Parameter Estimation in Bayesian Knowledge Tracing 2026-05-21T21:27:10Z

Bayesian Knowledge Tracing (BKT) is a widely used and interpretable student modeling approach in intelligent tutoring systems and educational data mining. However, most implementations rely on expectation-maximization or related optimization methods that yield only point estimates, limiting uncertainty quantification and principled comparisons across learners and conditions. We introduce StanBKT, an open-source Python package for estimating BKT models using Bayesian inference in Stan. StanBKT provides a unified framework supporting Hamiltonian Monte Carlo, variational inference, Pathfinder, and optimization-based estimation while preserving the hidden Markov structure and interpretability of classical BKT. It supports standard, grouped, and hierarchical BKT models, flexible prior specification, posterior predictive inference, and utilities for visualization and diagnostics. We evaluate StanBKT on large-scale observational and controlled educational datasets. On the ASSISTments 2020 dataset, we show that supported inference methods achieve comparable predictive performance while differing in computational efficiency and posterior fidelity. We further demonstrate how posterior inference enables principled comparison of condition-specific parameters in an educational intervention involving perceptual cue manipulations. Results illustrate how uncertainty quantification facilitates more reliable interpretation of differences in learning, forgetting, guessing, and slipping parameters across experimental conditions. Overall, StanBKT extends BKT beyond point estimation by providing a flexible framework for probabilistic student modeling, uncertainty quantification, and hierarchical inference in educational data mining.

2026-05-21T21:27:10Z 5 figures, 7 tables Siddhartha Pradhan Yanping Pei Morgan Lee Puyuan Zhang Erin Ottmar Adam C. Sales http://arxiv.org/abs/2605.22676v1 Comparison of probabilistic nowcasts and forecasts of SARS-CoV-2 variant proportions made by hierarchical multinomial linear regression models 2026-05-21T16:19:22Z

Nowcasting and forecasting of infectious diseases have become increasingly important since the SARS-CoV-2 pandemic. In particular, methods for modeling the composition of circulating variants at a given time have seen more use in part due to a large increase in the frequency of genomic sequencing conducted as a part of routine surveillance. However, methods must take into account that locations have different amounts of data and sometimes have different trends. We discuss hierarchical multinomial logistic regression (HMLR), a commonly used method for forecasting SARS-CoV-2 variants, which allows for data sharing across locations. We show how it has been used in the literature, and define a class of HMLR models for SARS-CoV-2 variant nowcasting and forecasting. We rigorously test a subset of this class of models using the framework of the US SARS-CoV-2 Variant Nowcast Hub, a collaborative modeling project that launched in 2024. We created two years of weekly predictions based on retrospective datasets, with the prediction dates ranging from Wednesday, August 3, 2022, to Wednesday, August 7, 2024. We tested 12 HMLR models against a baseline model on these datasets. We found that the HMLR models outperformed the baseline both in terms of probabilistic accuracy, as measured by the energy score, as well as point accuracy, as measured by the Brier score. Overall, we find that HMLR models perform best with respect to the baseline model in locations with more data, and more complex HMLR models also showed more improvement in those high-data locations; however, there was no one best model across all metrics, and simpler HMLR models perform better in low-data locations. We find that HMLR models perform well in practice for nowcasting and forecasting SARS-CoV-2 variants.

2026-05-21T16:19:22Z 24 pages, 8 figures Isaac MacArthur Thomas Robacker Evan L. Ray Benjamin W. Rogers Nicholas G. Reich Maryclare Griffin http://arxiv.org/abs/2606.02592v1 Tracking Urban Atmospheric Pollutants using Sentinel-5P Satellite Data 2026-05-21T15:59:04Z

Urban nitrogen dioxide ($NO_2$) is a key indicator of combustion-related air pollution and exhibits strong spatial and temporal variability in cities. This study presents a satellite-based framework for tracking urban $NO_2$ pollution using tropospheric column observations from Sentinel-5P/TROPOMI over Guayas Province, Ecuador. Rather than estimating surface concentrations, the methodology emphasizes robust distributional metrics, including the median and upper-tail percentiles ($P_{90}$, $P_{95}$, and $P_{99}$), to characterize background conditions and localized pollution extremes at the canton scale. Multi-year satellite observations are aggregated annually and analyzed using unsupervised K-means clustering to identify characteristic pollution regimes without predefined thresholds. Results show that highly urbanized cantons consistently exhibit elevated extreme $NO_2$ values and greater variability, while less urbanized areas display lower and more homogeneous patterns. The proposed approach provides an interpretable and scalable tool for urban air-quality assessment in data-scarce regions using satellite observations alone. The implementation is publicly available on GitHub https://hvelesaca.github.io/sentinel-5P-clustering/.

2026-05-21T15:59:04Z Alice Gomez-Cantos Henry O. Velesaca http://arxiv.org/abs/2509.05443v3 Multidimensional constructs and moderated linear and nonlinear factor analysis 2026-05-21T15:30:00Z

Multidimensional factor models with moderations on all model parameters have so far been limited to single-factor and two-factor models. This does not align well with existing psychological measures, which are commonly intended to assess 3-5 dimensions of a latent construct. In this paper, I introduce a multidimensional MNLFA model that permits the moderation of item intercepts, loadings, residual variances, factor means, variances, and correlations across three or more latent factors. I describe efforts to implement the model using Bayesian methods through Stan and penalized maximum likelihood approaches to stabilize estimation and detect partial measurement non-invariance while preserving model interpretability. Closed-form analytic gradients of the likelihood, eliminating the need for costly numerical or MCMC-based approximations. We conclude by discussing the theoretical implications of penalization for measurement invariance, computational considerations, and future directions for extending the framework to categorical indicators, longitudinal data, and applied research contexts.

2025-09-05T18:49:54Z 22 pages, 2 figures R. Noah Padgett http://arxiv.org/abs/2605.22352v1 Spatiotemporal dynamics and ecological risk factors of highly pathogenic avian influenza A(H5N1) in Canadian wildlife: A One Health surveillance analysis 2026-05-21T11:41:20Z

Highly pathogenic avian influenza A(H5N1) has expanded geographically and ecologically, affecting wild birds, mammalian wildlife, domestic animals, and humans. Wildlife surveillance provides critical early warning for One Health preparedness, yet national-scale analyses integrating host ecology, spatial patterns, seasonality, viral lineage, and risk factors remain limited. This study analysed Canadian wildlife HPAI A(H5N1) surveillance records from 2022 to 2026 to characterise spatiotemporal dynamics and identify factors associated with detection counts. A retrospective analysis of 2,657 detections across 13 provinces and territories was conducted using descriptive epidemiology, spatial clustering methods, and Negative Binomial mixed models. Detections were predominantly avian, with waterfowl and raptors as the major host groups, while mammals accounted for a smaller but epidemiologically important proportion. Detection burden was highest in 2022, with increased activity in autumn and spring. Ontario, Alberta, and British Columbia were identified as major hotspots, with evidence of local clustering in parts of the Prairie region. Reassortant Eurasian-North American lineages dominated detections and were strongly associated with higher detection counts. Modelling results identified year, season, and lineage as key predictors. These findings support risk-based One Health surveillance prioritising high-burden regions, migration-associated periods, key avian host groups, reassortant viral lineages, and continued monitoring of mammalian wildlife.

2026-05-21T11:41:20Z Hammed Olawale Fatoyinbo Hoyeon Jeong http://arxiv.org/abs/2605.22243v1 Explainable AI for Data-Driven Design of High-Dimensional Predictive Studies 2026-05-21T09:50:14Z

Predictive modelling is important for health data analysis and data-driven clinical decision-making. However, predictive studies are challenging to design optimally by hand when tens or even hundreds of features require selection, transformation, or interaction modelling. While complex machine learning models offer high performance, their "black-box" nature limits the clinical trust, transparency, and interpretability required for decision-making. We developed and evaluated an Exploratory AI Recommender that provides data-driven recommendations to improve predictive performance of existing interpretable statistical models. The developed framework uses flexible AI modelling to capture complex data patterns and explainable AI techniques to translate the patterns into three recommendation types: feature exclusion, non-linear terms, and feature interactions. We evaluated the framework by comparing predictive performance of a baseline (i.e., no interactions or non-linear terms) Cox Proportional Hazards (CPH) model against an augmented CPH incorporating recommendations suggested by our method. The primary analysis predicts the time to the first occurrence of a fall or related injury in 245,614 patients. Our method recommended excluding 23 features, including non-linear terms for two features, and including 221 suggested feature interactions. The C-index improved from 0.805 (95% CI 0.798-0.812) to 0.815 (95% CI 0.809-0.822), and so did calibration (intercept: -0.006 to 0.003; slope: 1.063 to 0.950). All recommendations were supported by existing literature. The method also proved effective on two additional public datasets, demonstrating wider applicability. The proposed Exploratory AI Recommender demonstrates the potential of explainable AI and data-driven study design to improve the process of developing, and the performance of high-dimensional transparent predictive models.

2026-05-21T09:50:14Z 41 pages, 7 figures Junyu Yan Damian Machlanski Kurt Butler Panagiotis Dimitrakopoulos Ewen M Harrison Bruce Guthrie Sotirios A Tsaftaris http://arxiv.org/abs/2509.24087v3 A penalized distributed lag non-linear Lee-Carter framework for regional weekly mortality forecasting 2026-05-21T09:10:39Z

Accurate forecasts of weekly mortality are essential for public health and the insurance industry. We develop a forecasting framework that extends the Lee-Carter model with age- and region-specific seasonal effects and penalized distributed lag non-linear components that capture the delayed and non-linear effects of heat, cold, and influenza on mortality. The model accommodates overdispersed mortality rates via a negative binomial distribution. We model the temporal dynamics of the latent factors in the model using SARIMA processes and capture cross-regional dependencies through a copula-based approach. Using regional French mortality data (1990-2019), we demonstrate that the proposed framework yields well-calibrated forecast distributions and improves predictive accuracy relative to benchmark models. The results further show substantial heterogeneity in temperature- and influenza-related relative risks between ages and regions. These findings underscore the importance of incorporating exogenous drivers and dependence structures into a weekly mortality forecasting framework.

2025-09-28T21:46:55Z Jens Robben Karim Barigou http://arxiv.org/abs/2509.20083v2 Rethinking player evaluation in sports: Goals above expectation and beyond 2026-05-21T09:08:06Z

A popular quantitative approach to evaluating player performance in sports involves comparing an observed outcome to the expected outcome ignoring player involvement, which is estimated using statistical or machine learning methods. In soccer, for instance, goals above expectation (GAX) of a player measure how often shots of this player led to a goal compared to the model-derived expected outcome of the shots. Typically, sports data analysts rely on flexible machine learning models, which are capable of handling complex nonlinear effects and feature interactions, but fail to provide valid statistical inference due to finite-sample bias and slow convergence rates. In this paper, we close this gap by presenting a framework for player evaluation with metrics derived from differences in actual and expected outcomes using flexible machine learning algorithms, which nonetheless allows for valid frequentist inference. We first show that the commonly used metrics are directly related to Rao's score test in parametric regression models for the expected outcome. Motivated by this finding and recent developments in double machine learning, we then propose the use of residualized versions of the original metrics. For GAX, the residualization step corresponds to an additional regression predicting whether a given player would take the shot under the circumstances described by the features. We further relate metrics in the proposed framework to player-specific effect estimates in interpretable semiparametric regression models, allowing us to infer directional effects, e.g., to determine players that have a positive impact on the outcome. Our primary use case are GAX in soccer. We further apply our framework to evaluate goal-stopping ability of goalkeepers, shooting skill in basketball, quarterback passing skill in American football, and injury-proneness of soccer players.

2025-09-24T13:03:15Z Robert Bajons Lucas Kook http://arxiv.org/abs/2511.04106v5 Sub-exponential Growth Dynamics in Complex Systems: A Piecewise Power-Law Model for the Diffusion of New Words and Names 2026-05-21T07:42:33Z

The diffusion of ideas and language in society has conventionally been described by S-shaped models, such as the logistic curve. However, the role of sub-exponential growth -- a slower-than-exponential pattern known in epidemiology -- has been largely overlooked in broader social phenomena. Here, we present a piecewise power-law model to characterize complex growth curves with a few parameters. We systematically analyzed a large-scale dataset of approximately one billion Japanese blog articles linked to Wikipedia vocabulary, and observed consistent patterns in web search trend data (English, Spanish, and Japanese). Our analysis of 2,963 items, selected for reliable estimation (e.g., sufficient duration/peak, monotonic growth), reveals that 1,625 (55%) diffusion patterns without abrupt level shifts were adequately described by one or two segments. For single-segment curves, we found that (i) the mode of the shape parameter $α$ was near 0.5, indicating prevalent sub-exponential growth; (ii) the peak diffusion scale is primarily determined by the growth rate $R$, with minor contributions from $α$ or the duration $T$; and (iii) $α$ showed a tendency to vary with the nature of the topic, being smaller for niche/local topics and larger for widely shared ones. Furthermore, a micro-behavioral model of outward (stranger) vs. inward (community) contact suggests that $α$ can be interpreted as an index of the preference for outward-oriented communication. These findings suggest that sub-exponential growth is a common pattern of social diffusion, and our model provides a practical framework for consistently describing, comparing, and interpreting complex and diverse growth curves.

2025-11-06T06:44:45Z Physical Review E (2026) Hayafumi Watanabe 10.1103/f3d5-2tb8 http://arxiv.org/abs/2604.02700v2 Wasserstein-Based Test for Empirical Measure Convergence of Dependent Sequences 2026-05-21T03:12:47Z

We develop Wasserstein-based hypothesis tests for empirical-measure convergence in stationary dependent sequences. For a known candidate invariant measure, $μ$, we study the statistic $T_n=\sqrt{n}\,W_1(\hatμ_n,μ)$ and establish asymptotic level-$α$ validity under the null, together with consistency under fixed alternatives. When the invariant measure is unknown, we derive the asymptotic law of the pairwise statistic $\sqrt{n}\,W_1(\hatμ_n^{(i)},\hatμ_n^{(j)})$ for independent trajectories and obtain a corresponding pairwise test, including Bonferroni control for multiple comparisons. To make this estimation feasible when the long-run covariance is unavailable in closed form, we introduce a finite-grid plug-in estimator and show that Gaussian critical values based on the estimated covariance consistently recover the corresponding oracle fixed-grid estimation. Simulation experiments in both linear and nonlinear dynamical settings illustrate the oracle and plug-in regimes, along with the resulting coverage probability and power.

2026-04-03T03:49:40Z Alexander Yordanov Peter Hristov http://arxiv.org/abs/2605.21793v1 Targeted maximum likelihood estimation of vaccine effectiveness and immune correlates in test-negative design studies with missing data 2026-05-20T22:38:26Z

The test-negative design (TND) is a resource-efficient observational study design that can assess vaccine effectiveness and exposure-proximal immune correlates of disease. The TND enrolls symptomatic individuals seeking diagnostic testing and compares case status by an exposure variable, such as vaccination status or immune marker level, that is measured at testing. While the TND reduces confounding by healthcare-seeking behavior, other sources of confounding may remain. TND studies may also have missing data in the exposure variable due to incomplete records or two-phase sampling designs. We present a targeted maximum likelihood estimation approach involving a semiparametric logistic regression model that targets a causal conditional risk ratio of symptomatic disease in the healthcare-seeking population. Under causal and missing at random assumptions, our method produces an efficient, asymptotically linear estimator that provides flexible, data-driven confounding control and valid causal inference when analyzing TND studies with missing exposure variable data. We evaluate our method's finite sample properties using plasmode simulations of a two-phase TND immune correlates study. We also apply our method to assess COVID-19 vaccine effectiveness and antibody marker correlates of COVID-19 from TND study cohorts derived from the Moderna Coronavirus Efficacy phase 3 trial.

2026-05-20T22:38:26Z 52 pages, 14 figures Leah I. B. Andrews Lars van der Laan Peter B. Gilbert http://arxiv.org/abs/2605.21782v1 A Scalable Parametric Item Calibration Engine (SPICE) for Explanatory IRT with Sparse Data 2026-05-20T22:22:06Z

We describe a Bayesian multidimensional explanatory IRT model, and an associated Markov Chain Monte Carlo (MCMC) estimation procedure and the corresponding development of calibration software, designed for psychometric analyses of large numbers of sparsely-linked persons and items. Such data structures can arise, for example, from adaptive assessments using large banks of automatically generated items with individual test takers receiving a very small proportion of the entire bank. We discuss how our choices for model specification, data structures, and algorithm implementation combine to create a scalable method for explanatory IRT that can support a variety of psychometric operations with sparse data.

2026-05-20T22:22:06Z Steven W. Nydick Manqian Liao J. R. Lockwood