https://arxiv.org/api/0uvFTLPRIWKP61D3sB/XvhZGLh8 2026-06-10T18:24:31Z 1686 300 15 http://arxiv.org/abs/2503.14650v1 Resolving Jeffreys-Lindley Paradox 2025-03-18T18:59:13Z Jeffreys-Lindley paradox is a case where frequentist and Bayesian hypothesis testing methodologies contradict with each other. This has caused confusion among data analysts for selecting a methodology for their statistical inference tasks. Though the paradox goes back to mid 1930's so far there hasn't been a satisfactory resolution given for it. In this paper we show that it arises mainly due to the simple fact that, in the frequentist approach, the difference between the hypothesized parameter value and the observed estimate of the parameter is assessed in terms of the standard error of the estimate, no matter what the actual numerical difference is and how small the standard error is, whereas in the Bayesian methodology it has no effect due to the definition of the Bayes factor in the context, even though such an assessment is present. In fact, the paradox is an instance of conflict between statistical and practical significance and a result of using a sharp null hypothesis to approximate an acceptable small range of values for the parameter. Occurrence of type-I error that is allowed in frequentist methodology plays important role in the paradox. Therefore, the paradox is not a conflict between two inference methodologies but an instance of not agreeing their conclusions 2025-03-18T18:59:13Z 10 pages Priyantha Wijayatunga http://arxiv.org/abs/2312.04610v7 Data-Driven Semi-Supervised Machine Learning with Safety Indicators for Abnormal Driving Behavior Detection 2025-03-18T12:13:54Z Detecting abnormal driving behavior is critical for road traffic safety and the evaluation of drivers' behavior. With the advancement of machine learning (ML) algorithms and the accumulation of naturalistic driving data, many ML models have been adopted for abnormal driving behavior detection (also referred to in this paper as "anomalies"). Most existing ML-based detectors rely on (fully) supervised ML methods, which require substantial labeled data. However, ground truth labels are not always available in the real world, and labeling large amounts of data is tedious. Thus, there is a need to explore unsupervised or semi-supervised methods to make the anomaly detection process more feasible and efficient. To fill this research gap, this study analyzes large-scale real-world data revealing several abnormal driving behaviors (e.g., sudden acceleration, rapid lane-changing) and develops a hierarchical extreme learning machine (HELM)-based semi-supervised ML method using partly labeled data to detect the identified abnormal driving behaviors. Moreover, previous ML-based approaches predominantly utilized basic vehicle motion features (such as velocity and acceleration) to label and detect abnormal driving behaviors, while this study seeks to introduce event-level safety indicators as input features for ML models to improve detection performance. Results from extensive experiments demonstrate the effectiveness of the proposed semi-supervised ML model with the introduced safety indicators serving as important features. The proposed semi-supervised ML method outperforms other baseline semi-supervised or unsupervised methods: for example, it delivers the best accuracy at 99.58% and the best F1-score at 0.9913. The ablation study further highlights the significance of safety indicators for advancing the detection performance of abnormal driving behaviors. 2023-12-07T16:16:09Z 16 pages, 10 figures, accepted by the 103rd Transportation Research Board (TRB) Annual Meeting, accepted and published by Transportation Research Record: Journal of the Transportation Research Board Yongqi Dong Lanxin Zhang Haneen Farah Arkady Zgonnikov Bart van Arem 10.1177/03611981241306752 http://arxiv.org/abs/2407.11824v2 The Future of Data Science Education 2025-03-17T19:06:45Z The definition of Data Science is a hotly debated topic. For many, the definition is a simple shortcut to Artificial Intelligence or Machine Learning. However, there is far more depth and nuance to the field of Data Science than a simple shortcut can provide. The School of Data Science at the University of Virginia has developed a novel model for the definition of Data Science. This model is based on identifying a unified understanding of the data work done across all areas of Data Science. It represents a generational leap forward in how we understand and teach Data Science. In this paper we will present the core features of the model and explain how it unifies various concepts going far beyond the analytics component of AI. From this foundation we will present our Undergraduate Major curriculum in Data Science and demonstrate how it prepares students to be well-rounded Data Science team members and leaders. The paper will conclude with an in-depth overview of the Foundations of Data Science course designed to introduce students to the field while also implementing proven STEM oriented pedagogical methods. These include, for example, specifications grading, active learning lectures, guest lectures from industry experts and weekly gamification labs. 2024-07-16T15:11:54Z 12 pages, 5 figures, publish at the 53rd Annual Southeast Decision Science Institute 2024, won best paper for Innovation track Brian Wright Peter Alonzi Ali Rivera http://arxiv.org/abs/2503.13574v1 The Hall of Fame Cut in Major League Baseball 2025-03-17T13:12:22Z I present a simple and transparent standard for career greatness in baseball: any major league player with H > 2500 or HR > 350 or K > 2800 or W > 240 makes my Hall of Fame Cut. Rate statistics are avoided due to small sample issues and to ensure the standard is permanent once achieved. Hits and home runs were chosen to represent the two extremes of batting styles. Strikeouts are chosen as the most fundamental unit of pitching performance whereas wins are included in deference to their historical importance as a benchmark. Most major league batters and pitchers in the elected Hall of Fame also make my Hall of Fame Cut but my quantitative standard shifts attention to several under-appreciated players, such as Johnny Damon and Bartolo Colon, and allows us to celebrate recent and active players without the waiting period (5 years post-retirement) needed for Hall of Fame election. My Hall of Fame Cut is also agnostic to performance enhancement or off-field issues and strongly favors longevity over peak performance. 2025-03-17T13:12:22Z Shane T. Jensen http://arxiv.org/abs/2503.11289v1 A quantile-based bivariate distribution 2025-03-14T10:53:16Z In this paper we present a flexible bivariate distribution specified by a quantile function. The distribution contains as special cases new bivariate exponential, Pareto I, Pareto II, beta, power, log logistic and uniform distributions and also can approximate many other continuous models. Various $L$-moment based properties of the distribution such as covariance, coskewness, cokurtosis, $L$-correlation, etc are discussed. The distribution is used to model two real data sets. 2025-03-14T10:53:16Z Shifna P R N. Unnikrishnan Nair S. M. Sunoj http://arxiv.org/abs/2411.16531v3 Good intentions, unintended consequences: exploring forecasting harms 2025-03-12T19:17:45Z Organizations worldwide that rely on data-driven approaches regularly employ forecasting methods to enhance their planning and decision-making processes. While extensive research has examined the harms associated with traditional machine learning applications, relatively little attention has been given to the ethical implications of time series forecasting. However, forecasting presents distinct ethical challenges due to its diverse organizational applications, varied objectives, and unique data processing, model development, and evaluation workflows. These distinctions complicate the direct application of existing machine learning harm taxonomies to common forecasting scenarios. To address this gap, we conduct multiple interviews with industry experts and academic researchers, systematically identifying and analyzing underexplored domains, use cases, and potential risks associated with forecasting. Our objective is to develop a novel taxonomy of forecasting-specific harms. Drawing inspiration from Microsoft Azure taxonomy for responsible innovation, we integrate a human-led inductive coding approach with AI-driven analysis to extract key categories of harm in forecasting. This taxonomy aims to support researchers and practitioners by fostering ethical reflection on their decision-making throughout the forecasting process. Additionally, we seek to establish a research agenda focused on identifying measures to mitigate potential harms in forecasting. By highlighting unique risks within forecasting, our work contributes to the broader discourse on machine learning ethics. 2024-11-25T16:18:02Z 42 pages Bahman Rostami-Tabar Travis Greene Galit Shmueli Rob J. Hyndman http://arxiv.org/abs/2405.11284v3 The Logic of Counterfactuals and the Epistemology of Causal Inference 2025-03-12T02:08:24Z The 2021 Nobel Prize in Economics recognized an epistemology of causal inference based on the Rubin causal model (Rubin 1974), which merits broader attention in philosophy. This model, in fact, presupposes a logical principle of counterfactuals, Conditional Excluded Middle (CEM), the locus of a pivotal debate between Stalnaker (1968) and Lewis (1973) on the semantics of counterfactuals. Proponents of CEM should recognize that this connection points to a new argument for CEM -- a Quine-Putnam indispensability argument grounded in the Nobel-winning applications of the Rubin model in health and social sciences. To advance the dialectic, I challenge this argument with an updated Rubin causal model that retains its successes while dispensing with CEM. This novel approach combines the strengths of the Rubin causal model and a causal model familiar in philosophy, the causal Bayes net. The takeaway: deductive logic and inductive inference, often studied in isolation, are deeply interconnected. 2024-05-18T13:09:33Z Hanti Lin http://arxiv.org/abs/2503.08743v1 Hard negative sampling in hyperedge prediction 2025-03-11T09:21:56Z Hypergraph, which allows each hyperedge to encompass an arbitrary number of nodes, is a powerful tool for modeling multi-entity interactions. Hyperedge prediction is a fundamental task that aims to predict future hyperedges or identify existent but unobserved hyperedges based on those observed. In link prediction for simple graphs, most observed links are treated as positive samples, while all unobserved links are considered as negative samples. However, this full-sampling strategy is impractical for hyperedge prediction, due to the number of unobserved hyperedges in a hypergraph significantly exceeds the number of observed ones. Therefore, one has to utilize some negative sampling methods to generate negative samples, ensuring their quantity is comparable to that of positive samples. In current hyperedge prediction, randomly selecting negative samples is a routine practice. But through experimental analysis, we discover a critical limitation of random selecting that the generated negative samples are too easily distinguishable from positive samples. This leads to premature convergence of the model and reduces the accuracy of prediction. To overcome this issue, we propose a novel method to generate negative samples, named as hard negative sampling (HNS). Unlike traditional methods that construct negative hyperedges by selecting node sets from the original hypergraph, HNS directly synthesizes negative samples in the hyperedge embedding space, thereby generating more challenging and informative negative samples. Our results demonstrate that HNS significantly enhances both accuracy and robustness of the prediction. Moreover, as a plug-and-play technique, HNS can be easily applied in the training of various hyperedge prediction models based on representation learning. 2025-03-11T09:21:56Z 24 pages, 8 figures Zhenyu Deng Tao Zhou Yilin Bi http://arxiv.org/abs/2503.05963v1 Bayesian Graph Traversal 2025-03-07T22:05:06Z This research considers Bayesian decision-analytic approaches toward the traversal of an uncertain graph. Namely, a traveler progresses over a graph in which rewards are gained upon a node's first visit and costs are incurred for every edge traversal. The traveler knows the graph's adjacency matrix and his starting position but does not know the rewards and costs. The traveler is a Bayesian who encodes his beliefs about these values using a Gaussian process prior and who seeks to maximize his expected utility over these beliefs. Adopting a decision-analytic perspective, we develop sequential decision-making solution strategies for this coupled information-collection and network-routing problem. We show that the problem is NP-Hard and derive properties of the optimal walk. These properties provide heuristics for the traveler's problem that balance exploration and exploitation. We provide a practical case study focused on the use of unmanned aerial systems for public safety and empirically study policy performance in myriad Erdos-Renyi settings. 2025-03-07T22:05:06Z 26 pages, 7 tables, 2 figures William N. Caballero Phillip R. Jenkins David Banks Matthew Robbins http://arxiv.org/abs/2503.03484v1 The impact of the storytelling fallacy on real data examples in methodological research 2025-03-05T13:21:08Z The term "researcher degrees of freedom" (RDF), which was introduced in metascientific literature in the context of the replication crisis in science, refers to the extent of flexibility a scientist has in making decisions related to data analysis. These choices occur at all stages of the data analysis process. In combination with selective reporting, RDF may lead to over-optimistic statements and an increased rate of false positive findings. Even though the concept has been mainly discussed in fields such as epidemiology or psychology, similar problems affect methodological statistical research. Researchers who develop and evaluate statistical methods are left with a multitude of decisions when designing their comparison studies. This leaves room for an over-optimistic representation of the performance of their preferred method(s). The present paper defines and explores a particular RDF that has not been previously identified and discussed. When interpreting the results of real data examples that are most often part of methodological evaluations, authors typically tell a domain-specific "story" that best supports their argumentation in favor of their preferred method. However, there are often plenty of other plausible stories that would support different conclusions. We define the "storytelling fallacy" as the selective use of anecdotal domain-specific knowledge to support the superiority of specific methods in real data examples. While such examples fed by domain knowledge play a vital role in methodological research, if deployed inappropriately they can also harm the validity of conclusions on the investigated methods. The goal of our work is to create awareness for this issue, fuel discussions on the role of real data in generating evidence in methodological research and warn readers of methodological literature against naive interpretations of real data examples. 2025-03-05T13:21:08Z Maximilian M. Mandl Frank Weber Tobias Wöhrle Anne-Laure Boulesteix http://arxiv.org/abs/2503.02645v1 A Generalized Theory of Mixup for Structure-Preserving Synthetic Data 2025-03-03T14:28:50Z Mixup is a widely adopted data augmentation technique known for enhancing the generalization of machine learning models by interpolating between data points. Despite its success and popularity, limited attention has been given to understanding the statistical properties of the synthetic data it generates. In this paper, we delve into the theoretical underpinnings of mixup, specifically its effects on the statistical structure of synthesized data. We demonstrate that while mixup improves model performance, it can distort key statistical properties such as variance, potentially leading to unintended consequences in data synthesis. To address this, we propose a novel mixup method that incorporates a generalized and flexible weighting scheme, better preserving the original data's structure. Through theoretical developments, we provide conditions under which our proposed method maintains the (co)variance and distributional properties of the original dataset. Numerical experiments confirm that the new approach not only preserves the statistical characteristics of the original data but also sustains model performance across repeated synthesis, alleviating concerns of model collapse identified in previous research. 2025-03-03T14:28:50Z Proceedings of the 28th International Conference on Artificial Intelligence and Statistics (AISTATS) 2025 Chungpa Lee Jongho Im Joseph H. T. Kim http://arxiv.org/abs/2409.16527v2 Approximation of Smooth Numbers for Harmonic Samples A Stein method Approach 2025-03-03T05:27:19Z We present a de Bruijn type approximation for quantifying the content of m smooth numbers, derived from samples obtained through a probability measure over the set of integers less than or equal to n, with point mass function at k inversely proportional to k. Our analysis is based on a stochastic representation of the measure of interest, utilizing weighted independent geometric random variables. This representation is analyzed through the lens of Stein method for the Dickman distribution. A pivotal element of our arguments relies on precise estimations concerning the regularity properties of the solution to the Dickman Stein equation for heaviside functions, recently developed by Bhattacharjee and Schulte. Remarkably, our arguments remain mostly in the realm of probability theory, with Mertens first and third theorems standing as the only number theory estimations required. 2024-09-25T00:38:07Z Arturo Jaramillo Xiaochuan Yang http://arxiv.org/abs/2001.04110v3 Resolving the induction problem: Can we state with complete confidence via induction that the sun rises forever? 2025-02-28T09:02:54Z Induction is a form of reasoning that starts with a particular example and generalizes to a rule, namely, a hypothesis. However, establishing the truth of a hypothesis is problematic due to the potential occurrence of conflicting events, also known as the induction problem. The sunrise problem, first introduced by Laplace (1814), is a quintessential example of the probability-based induction. In his solution, a zero probability is always assigned to the hypothesis that the sun rises forever, regardless of the number of observations made. This is a symptom of fundamental deficiency of probability-based induction: A hypothesis can never be accepted via the Bayes-Laplace approach. Alternative priors have been proposed to address this issue, but they have failed to fully overcome the deficiency. We investigate why this occurs and demonstrate that the confidence does not exhibit such a deficiency, as it is not a probability and therefore does not adhere to Bayes' rule. The confidence is neither a likelihood to allow not only a reconciliation between epistemic and aleatory interpretations of probability but also a resolution in agreement with the evidence by enabling us to accept a hypothesis with complete confidence as a rational decision. 2020-01-13T09:20:28Z Youngjo Lee http://arxiv.org/abs/2502.20281v1 Data Jamboree: A Party of Open-Source Software Solving Real-World Data Science Problems 2025-02-27T17:08:38Z The evolving focus in statistics and data science education highlights the growing importance of computing. This paper presents the Data Jamboree, a live event that combines computational methods with traditional statistical techniques to address real-world data science problems. Participants, ranging from novices to experienced users, followed workshop leaders in using open-source tools like Julia, Python, and R to perform tasks such as data cleaning, manipulation, and predictive modeling. The Jamboree showcased the educational benefits of working with open data, providing participants with practical, hands-on experience. We compared the tools in terms of efficiency, flexibility, and statistical power, with Julia excelling in performance, Python in versatility, and R in statistical analysis and visualization. The paper concludes with recommendations for designing similar events to encourage collaborative learning and critical thinking in data science. 2025-02-27T17:08:38Z The New England Journal of Statistics in Data Science 2025 Lucy D'Agostino McGowan Shannon Tass Sam Tyner HaiYing Wang Jun Yan 10.51387/25-NEJSDS79 http://arxiv.org/abs/2410.11399v2 Convergence to the Truth 2025-02-25T01:42:32Z This article reviews and develops an epistemological tradition in the philosophy of science, known as convergentism, which holds that inference methods should be assessed based on their ability to converge to the truth across a range of possible scenarios. Emphasis is placed on its historical origins in the work of C. S. Peirce and its recent developments in formal epistemology and data science (including statistics and machine learning). Comparisons are made with three other traditions: (1) explanationism, which holds that theory choice should be guided by a theory's overall balance of explanatory virtues, such as simplicity and fit with data; (2) instrumentalism, which maintains that scientific inference should be driven by the goal of obtaining useful models rather than true theories; and (3) Bayesianism, which shifts the focus from all-or-nothing beliefs to degrees of belief. 2024-10-15T08:44:14Z Hanti Lin