https://arxiv.org/api/Ju+W1q4Ji5KuQR2yN3xvCLY+39k2026-03-24T08:27:41Z163212015http://arxiv.org/abs/2510.22550v1Regularization method in the variable selection for logistic regression on BRFSS data2025-10-26T06:23:27ZStroke remains a leading cause of death and disability worldwide, yet effective prediction of stroke risk using large-scale population data remains challenging due to data imbalance and high-dimensional features. In this study, we develop and evaluate regularized logistic regression models for stroke prediction using data from the 2022 Behavioral Risk Factor Surveillance System (BRFSS), comprising 445132 U.S. adult respondents and 328 health-related variables. To address data imbalance, we apply several resampling techniques including oversampling, undersampling, class weighting, and the Synthetic Minority Oversampling Technique (SMOTE). We further employ Lasso, Elastic Net, and Group Lasso regularization methods to perform feature selection and dimensionality reduction. Model performance is assessed using ROC-AUC, sensitivity, and specificity metrics. Among all methods, the Lasso-based model achieved the highest predictive performance (AUC = 0.761), while the Group Lasso method identified a compact set of key predictors: Age, Heart Disease, Physical Health, and Dental Health. These findings demonstrate the potential of regularized regression techniques for interpretable and efficient prediction of stroke risk from large-scale behavioral health data.2025-10-26T06:23:27ZJinbo Niuhttp://arxiv.org/abs/2510.20738v1Optimizing Feature Ordering in Radar Charts for Multi-Profile Comparison2025-10-23T16:56:32ZRadar charts are widely used to visualize multivariate data and compare multiple profiles across features. However, the visual clarity of radar charts can be severely compromised when feature values alternate drastically in magnitude around the circle, causing areas to collapse, which misrepresents relative differences. In the present work we introduce a permutation optimization strategy that reorders features to minimize polygon ``spikiness'' across multiple profiles simultaneously. The method is combinatorial (exhaustive search) for moderate numbers of features and uses a lexicographic minimax criterion that first considers overall smoothness (mean jump) and then the largest single jump as a tie-breaker. This preserves more global information and produces visually balanced arrangements. We discuss complexity, practical bounds, and relations to existing approaches that either change the visualization (e.g., OrigamiPlot) or learn orderings (e.g., Versatile Ordering Network). An example with two profiles and $p=6$ features (before/after ordering) illustrates the qualitative improvement.
Keywords: data visualization, radar charts, combinatorial optimization, minimax optimization, feature ordering2025-10-23T16:56:32ZAlbert Doradorhttp://arxiv.org/abs/2508.12982v3Revisiting Functional Derivatives in Multi-object Tracking2025-10-23T13:42:37ZProbability generating functionals (PGFLs) are efficient and powerful tools for tracking independent objects in clutter. It was shown that PGFLs could be used for the elegant derivation of practical multi-object tracking algorithms, e.g., the probability hypothesis density (PHD) filter. However, derivations using PGFLs use the so-called functional derivatives whose definitions usually appear too complicated or heuristic, involving Dirac delta ``functions''. This paper begins by comparing different definitions of functional derivatives and exploring their relationships and implications for practical applications. It then proposes a rigorous definition of the functional derivative, utilizing straightforward yet precise mathematics for clarity. Key properties of the functional derivative are revealed and discussed.2025-08-18T14:58:50Zsubmitted to SIAM Journal on Control and OptimizationJan KrejčíOndřej StrakaPetr GirgJiří Benedikthttp://arxiv.org/abs/2510.20259v1Unifying Boxplots: A Multiple Testing Perspective2025-10-23T06:30:48ZTukey's boxplot is a foundational tool for exploratory data analysis, but its classic outlier-flagging rule does not account for the sample size, and subsequent modifications have often been presented as separate, heuristic adjustments. In this paper, we propose a unifying framework that recasts the boxplot and its variants as graphical implementations of multiple testing procedures. We demonstrate that Tukey's original method is equivalent to an unadjusted procedure, while existing sample-size-aware modifications correspond to controlling the Family-Wise Error Rate (FWER) or the Per-Family Error Rate (PFER). This perspective not only systematizes existing methods but also naturally leads to new, more adaptive constructions. We introduce a boxplot motivated by the False Discovery Rate (FDR), and show how our framework provides a flexible pipeline for integrating state-of-the-art robust estimation techniques directly into the boxplot's graphical format. By connecting a classic graphical tool to the principles of multiple testing, our work provides a principled language for comparing, critiquing, and extending outlier detection rules for modern exploratory analysis.2025-10-23T06:30:48ZBowen GangHongmei LinTiejun Tonghttp://arxiv.org/abs/2510.20100v1Factors Associated with Unit-Specific Failure in a University-Level Statistics Course2025-10-23T01:00:11ZThis study investigates the factors associated with failure in each of the four thematic units of a General Statistics course offered at a private university in Colombia. Unlike traditional analyses that treat performance as a single outcome, this research disaggregates results by unit: Exploratory Data Analysis, Probability and Random Variables, Statistical Inference, and Linear Regression -- highlighting distinct challenges across content areas. Based on a sample of 186 undergraduate students from Engineering, Geology, and Interactive Design programs, the study combines exam performance data with self-perceived preparedness surveys to develop unit-specific logistic regression models. The findings reveal consistent structural disadvantages for students from non-engineering programs, especially in concept-heavy units such as Inference and Regression. Academic stage and perception of competence also emerged as important predictors, though their effects varied across units. The results align with prior research on statistical thinking and self-efficacy, and support the need for targeted pedagogical interventions and curricular alignment. This disaggregated approach offers a more nuanced understanding of academic vulnerability in statistics education and contributes to the design of evidence-based, context-sensitive strategies to reduce failure and improve learning outcomes.2025-10-23T01:00:11ZBiviana Marcela Suarez Sierrahttp://arxiv.org/abs/2510.18094v1On the Kolmogorov Distance of Max-Stable Distributions2025-10-20T20:42:06ZIn this contribution, we derive explicit bounds on the Kolmogorov distance for multivariate max-stable distributions with Fréchet margins. We formulate those bounds in terms of (i) Wasserstein distances between de Haan representers, (ii) total variation distances between spectral/angular measures - removing the dimension factor from earlier results in the canonical sphere case - and (iii) discrepancies of the Psi-functions in the inf-argmax decomposition. Extensions to different margins and Archimax/clustered Archimax copulas are further discussed. Examples include logistic, comonotonic, independent and Brown-Resnick models.2025-10-20T20:42:06Z17 pagesEnkelejd Hashorvahttp://arxiv.org/abs/2510.17554v1Mendelian randomization in a multi-ancestry world: reflections and practical advice2025-10-20T14:02:13ZMany Mendelian randomization (MR) papers have been conducted only in people of European ancestry, limiting transportability of results to the global population. Expanding MR to diverse ancestry groups is essential to ensure equitable biomedical insights, yet presents analytical and conceptual challenges. This review examines the practical challenges of MR analyses beyond the European only context, including use of data from multi-ancestry, mismatched ancestry, and admixed populations. We explain how apparent heterogeneity in MR estimates between populations can arise from differences in genetic variant frequencies and correlation patterns, as well as from differences in the distribution of phenotypic variables, complicating the detection of true differences in the causal pathway.
We summarize published strategies for selecting genetic instruments and performing analyses when working with limited ancestry-specific data, discussing the assumptions needed in each case for incorporating external data from different ancestry populations. We conclude that differences in MR estimates by ancestry group should be interpreted cautiously, with consideration of how the identified differences may arise due to social and cultural factors. Corroborating evidence of a biological mechanism altering the causal pathway is needed to support a conclusion of differing causal pathways between ancestry groups.2025-10-20T14:02:13Z20 pages, 2 figuresAmy M. MasonVerena ZuberGibran HemaniElena RaffettiYu XuAmanda H. W ChongBenjamin WoolfElias AllaraDipender GillOpeyemi SoremekunStephen Burgesshttp://arxiv.org/abs/2510.16986v1Adaptive Sample Sharing for Linear Regression2025-10-19T20:03:48ZIn many business settings, task-specific labeled data are scarce or costly to obtain, which limits supervised learning on a specific task. To address this challenge, we study sample sharing in the case of ridge regression: leveraging an auxiliary data set while explicitly protecting against negative transfer. We introduce a principled, data-driven rule that decides how many samples from an auxiliary dataset to add to the target training set. The rule is based on an estimate of the transfer gain i.e. the marginal reduction in the predictive error. Building on this estimator, we derive finite-sample guaranties: under standard conditions, the procedure borrows when it improves parameter estimation and abstains otherwise. In the Gaussian feature setting, we analyze which data set properties ensure that borrowing samples reduces the predictive error. We validate the approach in synthetic and real datasets, observing consistent gains over strong baselines and single-task training while avoiding negative transfer.2025-10-19T20:03:48ZHamza CherkaouiHélène HalconruyYohan Petetinhttp://arxiv.org/abs/2510.13270v1Power-laws in phylogenetic trees and the preferential coalescent2025-10-15T08:16:58ZPhylogenetic trees capture evolutionary relationships among species and reflect the forces that shaped them. While many studies rely on branch length information, the topology of phylogenetic trees (particularly their degree of imbalance) offers a robust framework for inferring evolutionary dynamics when timing data is uncertain. Classical metrics, such as the Colless and Sackin indices, quantify tree imbalance and have been extensively used to characterize phylogenies. Empirical phylogenies typically show intermediate imbalance, falling between perfectly balanced and highly skewed trees. This regime is marked by a power-law relationship between subtree sizes and their cumulative sizes, governed by a characteristic exponent. Although a recent niche-size model replicates this scaling, its mathematical origin and the exponent's value remain unclear. We present a generative model inspired by Kingman's coalescent that incorporates niche-like dynamics through preferential node coalescence. This process maps to Smoluchowski's coagulation kinetics and is described by a generalized Smoluchowski equation. Our model produces imbalanced trees with power-law exponents matching empirical and numerical observations, revealing the mathematical basis of observed scaling laws and offering new tools to interpret tree imbalance in evolutionary contexts.2025-10-15T08:16:58Z7 pagesStephan KleinböltingNigel GoldenfeldJohannes Berghttp://arxiv.org/abs/2510.10382v3Examining the Interface Design of Tidyverse2025-10-15T01:52:41ZThe tidyverse is a popular meta-package comprising several core R packages to aid in various data science tasks, including data import, manipulation and visualisation. Although functionalities offered by the tidyverse can generally be replicated using other packages, its widespread adoption in both teaching and practice indicates there are factors contributing to its preference, despite some debate over its usage. This suggests that particular aspects, such as interface design, may play a significant role in its selection. Examining the interface design can potentially reveal aspects that aid the design process for developers. While Tidyverse has been lauded for adopting a user-centered design, arguably some elements of the design focus on the work domain instead of the end-user. We examine the Tidyverse interface design via the lens of human computer interaction, with an emphasis on data visualisation and data wrangling, to identify factors that might serve as a model for developers designing their packages. We recommend that developers adopt an iterative design that is informed by user feedback, analysis and complete coverage of the work domain, and ensure perceptual visibility of system constraints and relationships.2025-10-12T00:43:12ZEmi Tanakahttp://arxiv.org/abs/2502.16988v2A tutorial on optimal dynamic treatment regimes2025-10-10T04:15:43ZA dynamic treatment regime is a sequence of treatment decision rules tailored to an individual's evolving status over time. In precision medicine, much focus has been placed on finding an optimal dynamic treatment regime which, if followed by everyone in the population, would yield the best outcome on average; and extensive investigation has been conducted from both methodological and applications standpoints. The aim of this tutorial is to provide readers who are interested in optimal dynamic treatment regimes with a systematic, detailed but accessible introduction, including the formal definition and formulation of this topic within the framework of causal inference, identification assumptions required to link the causal quantity of interest to the observed data, existing statistical models and estimation methods to learn the optimal regime from data, and application of these methods to both simulated and real data.2025-02-24T09:24:51ZChunyu WangBrian DM Tomhttp://arxiv.org/abs/2510.05857v1Missing Data Imputation in the Context of Propensity Score Analysis: A Systematic Review2025-10-07T12:25:22ZMissing data is a common challenge in observational studies. Another challenge stems from the observational nature of the study itself. Here, propensity score analysis can be used as a technique to replicate conditions similar to those found in clinical trials. With regard to the missing data, a majority of studies only analyze the complete cases, but this has several pitfalls. In this review, we investigate which methods are used for the handling of missing data in the context of propensity score analyses. Therefore, we searched PubMed for the keywords propensity score and missing data, restricting our search to the time between January 2010 and February 2024. The PRISMA statement was followed in this review. A total of 147 articles were included in the analyses. A major finding of this study is that although the usage of multiple imputation (MI) has risen over time, only a limited number of studies describe the mechanism of missing data and the details of the MI algorithm. Keywords Missing data, Propensity Score, Observational Data, Multiple Imputation, Systematic Review2025-10-07T12:25:22ZSaghar GarayemiReza Ali Akbari KhoeiSarah Friedrichhttp://arxiv.org/abs/2510.05800v1Bridging the Gap Between Methodological Research and Statistical Practice: Toward "Translational Simulation Research2025-10-07T11:18:35ZSimulations are valuable tools for empirically evaluating the properties of statistical methods and are primarily employed in methodological research to draw general conclusions about methods. In addition, they can often be useful to applied statisticians, who may rely on published simulation results to select an appropriate statistical method for their application. However, on the one hand, applying published simulation results directly to practical settings is frequently challenging, as the scenarios considered in methodological studies rarely align closely enough with the characteristics of specific real-world applications to be truly informative. Applied statisticians, on the other hand, may struggle to construct their own simulations or to adapt methodological research to better reflect their specific data due to time constraints and limited programming expertise. We propose bridging this gap between methodological research and statistical practice through a translational approach by developing dedicated software along with simulation studies, which should abstract away the coding-intensive aspects of running simulations while still offering sufficient flexibility in parameter selection to meet the needs of applied statisticians. We demonstrate this approach using two practical examples, illustrating that the concept of translational simulation can be implemented in practice in different ways. In the first example - simulation-based evaluation of power in two-arm randomized clinical trials with an ordinal endpoint - the solution we discuss is a Shiny web application providing a graphical user interface for running informative simulations in this context. For the second example - assessing the impact of measurement error in multivariable regression - a less labor-intensive approach is suggested, involving the provision of user-friendly, well-structured, and modular analysis code.2025-10-07T11:18:35ZAnne-Laure BoulesteixPatrick CallahanLuzia HanssumVincent GaertnerEva Hosterhttp://arxiv.org/abs/2510.03512v1Comparison of Parametric versus Machine-learning Multiple Imputation in Clinical Trials with Missing Continuous Outcomes2025-10-03T20:51:50ZThe use of flexible machine-learning (ML) models to generate imputations of missing data within the framework of Multiple Imputation (MI) has recently gained traction, particularly in observational settings. For randomised controlled trials (RCTs), it is unclear whether ML approaches to MI provide valid inference, and whether they outperform parametric MI approaches under complex data generating mechanisms. We conducted two simulations in RCT settings that have incomplete continuous outcomes but fully observed covariates. We compared Complete Cases, standard MI (MI-norm), MI with predictive mean matching (MI-PMM) and ML-based approaches to MI, including classification and regression trees (MI-CART), Random Forests (MI-RF) and SuperLearner when outcomes are missing completely at random or missing at random conditional on treatment/covariate. The first simulation explored non-linear covariate-outcome relationships in the presence/absence of covariate-treatment interactions. The second simulation explored skewed repeated measures, motivated by a trial with digital outcomes. In the absence of interactions, we found that Complete Cases yields reliable inference; MI-norm performs similarly, except when missingness depends on the covariate. ML approaches can lead to smaller mean squared error than Complete Cases and MI-norm in specific non-linear settings, but provide unreliable inference for others. MI-PMM can lead to unreliable inference in several settings. In the presence of complex treatment-covariate interactions, performing MI separately by arm, either with MI-norm, MI-RF or MI-CART, provides inference that has comparable or with better properties compared to Complete Cases when the analysis model omits the interaction. For ML approaches, we observed unreliable inference in terms of bias in the estimated effect and/or its standard error when Rubin's Rules are implemented.2025-10-03T20:51:50ZMia S. TackneyJonathan W. BartlettElizabeth WilliamsonKim May Leehttp://arxiv.org/abs/2510.06238v1Uncertainty Quantification In Surface Landmines and UXO Classification Using MC Dropout2025-10-03T03:01:22ZDetecting surface landmines and unexploded ordnances (UXOs) using deep learning has shown promise in humanitarian demining. However, deterministic neural networks can be vulnerable to noisy conditions and adversarial attacks, leading to missed detection or misclassification. This study introduces the idea of uncertainty quantification through Monte Carlo (MC) Dropout, integrated into a fine-tuned ResNet-50 architecture for surface landmine and UXO classification, which was tested on a simulated dataset. Integrating the MC Dropout approach helps quantify epistemic uncertainty, providing an additional metric for prediction reliability, which could be helpful to make more informed decisions in demining operations. Experimental results on clean, adversarially perturbed, and noisy test images demonstrate the model's ability to flag unreliable predictions under challenging conditions. This proof-of-concept study highlights the need for uncertainty quantification in demining, raises awareness about the vulnerability of existing neural networks in demining to adversarial threats, and emphasizes the importance of developing more robust and reliable models for practical applications.2025-10-03T03:01:22ZThis work has been accepted and presented at IGARSS 2025 and will appear in the IEEE IGARSS 2025 proceedingsSagar LekhakEmmett J. IentilucciDimah DeraSusmita Ghosh