https://arxiv.org/api/vMNmRLpUJdRu4eO2sIeLmcb3uoc 2026-06-10T21:52:37Z 1686 345 15 http://arxiv.org/abs/2412.13116v1 Equity in the Use of ChatGPT for the Classroom: A Comparison of the Accuracy and Precision of ChatGPT 3.5 vs. ChatGPT4 with Respect to Statistics and Data Science Exams 2024-12-17T17:38:13Z

A college education historically has been seen as method of moving upward with regards to income brackets and social status. Indeed, many colleges recognize this connection and seek to enroll talented low income students. While these students might have their education, books, room, and board paid; there are other items that they might be expected to use that are not part of most college scholarship packages. One of those items that has recently surfaced is access to generative AI platforms. The most popular of these platforms is ChatGPT, and it has a paid version (ChatGPT4) and a free version (ChatGPT3.5). We seek to explore differences in the free and paid versions in the context of homework questions and data analyses as might be seen in a typical introductory statistics course. We determine the extent to which students who cannot afford newer and faster versions of generative AI programs would be disadvantaged in terms of writing such projects and learning these methods.

2024-12-17T17:38:13Z Originally submitted for review in May of 2024 but rejected 6 months later Monnie McGee Bivin Sadler http://arxiv.org/abs/2412.11211v1 Deep Learning-based Approaches for State Space Models: A Selective Review 2024-12-15T15:04:35Z

State-space models (SSMs) offer a powerful framework for dynamical system analysis, wherein the temporal dynamics of the system are assumed to be captured through the evolution of the latent states, which govern the values of the observations. This paper provides a selective review of recent advancements in deep neural network-based approaches for SSMs, and presents a unified perspective for discrete time deep state space models and continuous time ones such as latent neural Ordinary Differential and Stochastic Differential Equations. It starts with an overview of the classical maximum likelihood based approach for learning SSMs, reviews variational autoencoder as a general learning pipeline for neural network-based approaches in the presence of latent variables, and discusses in detail representative deep learning models that fall under the SSM framework. Very recent developments, where SSMs are used as standalone architectural modules for improving efficiency in sequence modeling, are also examined. Finally, examples involving mixed frequency and irregularly-spaced time series data are presented to demonstrate the advantage of SSMs in these settings.

2024-12-15T15:04:35Z Jiahe Lin George Michailidis http://arxiv.org/abs/2401.01973v2 Facilitating the Integration of Ethical Reasoning into Quantitative Courses: Stakeholder Analysis, Ethical Practice Standards, and Case Studies 2024-12-10T20:07:52Z

Case studies are typically used to teach 'ethics', but in quantitative courses it can seem distracting, for both instructor and learner, to introduce a case analysis. Moreover, case analyses are typically focused on issues relating to people: obtaining consent, dealing with research team members, and/or potential institutional policy violations. While relevant to some research, not all students in quantitative courses plan to become researchers, and ethical practice is an essential topic for students of of mathematics, statistics, data science, and computing regardless of whether or not the learner intends to do research. Ethical reasoning is a way of thinking that requires the individual to assess what they know about a potential ethical problem (their prerequisite knowledge), and in some cases, how behaviors they observe, are directed to perform, or have performed, diverge from what they know to be ethical behavior. Ethical reasoning is a learnable, improvable set of knowledge, skills, and abilities that enable learners to recognize what they do and do not know about what constitutes 'ethical practice' of a discipline, and in some cases, to contemplate alternative decisions about how to first recognize, and then proceed past, or respond to, such divergences. A stakeholder analysis is part of prerequisite knowledge, and can be used whether there is or is not an actual case or situation to react to. In courses with mainly quantitative content, a stakeholder analysis is a useful tool for instruction and assessment. It can be used to both integrate authentic ethical content and encourage careful quantitative thought. It is a mistake to treat 'training in ethical practice' and 'training in responsible conduct of research' as the same thing. This paper discusses how to introduce ethical reasoning, stakeholder analysis, and ethical practice standards authentically in quantitative courses.

2024-01-03T20:44:55Z To appear (verbatim) in H. Doosti, (Ed.). Ethics in Statistics. Cambridge, UK: Ethics International Press. Originally published (verbatim) in Proceedings of the 2022 Joint Statistical Meetings, Washington, DC. Alexandria, VA: American Statistical Association. pp. 1493-1519 Rochelle E. Tractenberg Suzanne Thorton http://arxiv.org/abs/2403.10987v3 Risk Quadrangle and Robust Optimization Based on Extended $\varphi$-Divergence 2024-12-10T03:20:46Z

The Fundamental Risk Quadrangle (FRQ) is a unified framework linking risk management, statistical estimation, and optimization. Distributionally robust optimization (DRO) based on $\varphi$-divergence minimizes the maximal expected loss, where the maximum is over a $\varphi$-divergence ambiguity set. This paper introduces the \emph{extended} $\varphi$-divergence and the extended $\varphi$-divergence quadrangle, which integrates DRO into the FRQ framework. We derive the primal and dual representations of the quadrangle elements (risk, deviation, regret, error, and statistic). The dual representation provides an interpretation of classification, portfolio optimization, and regression as robust optimization based on the extended $\varphi$-divergence. The primal representation offers tractable formulations of these robust optimizations as convex optimization. We provide illustrative examples showing that many common problems, such as least-squares regression, quantile regression, support vector machines, and CVaR optimization, fall within this framework. Additionally, we conduct a case study to visualize the optimal solution of the inner maximization in robust optimization.

2024-03-16T17:56:31Z Cheng Peng Anton Malandii Stan Uryasev http://arxiv.org/abs/2412.04735v1 A dynamical measure of algorithmically infused visibility 2024-12-06T02:51:39Z

This work focuses on the nature of visibility in societies where the behaviours of humans and algorithms influence each other - termed algorithmically infused societies. We propose a quantitative measure of visibility, with implications and applications to an array of disciplines including communication studies, political science, marketing, technology design, and social media analytics. The measure captures the basic characteristics of the visibility of a given topic, in algorithm/AI-mediated communication/social media settings. Topics, when trending, are ranked against each other, and the proposed measure combines the following two attributes of a topic: (i) the amount of time a topic spends at different ranks, and (ii) the different ranks the topic attains. The proposed measure incorporates a tunable parameter, termed the discrimination level, whose value determines the relative weights of the two attributes that contribute to visibility. Analysis of a large-scale, real-time dataset of trending topics, from one of the largest social media platforms, demonstrates that the proposed measure can explain a large share of the variability of the accumulated views of a topic.

2024-12-06T02:51:39Z 28 pages, 5 figures Shaojing Sun Zhiyuan Liu David Waxman http://arxiv.org/abs/2312.04898v2 Quantifying the effectiveness of linear preconditioning in Markov chain Monte Carlo 2024-12-04T08:40:49Z

We study linear preconditioning in Markov chain Monte Carlo. We consider the class of well-conditioned distributions, for which several mixing time bounds depend on the condition number $κ$. First we show that well-conditioned distributions exist for which $κ$ can be arbitrarily large and yet no linear preconditioner can reduce it. We then impose two sets of extra assumptions under which a linear preconditioner can significantly reduce $κ$. For the random walk Metropolis we further provide upper and lower bounds on the spectral gap with tight $1/κ$ dependence. This allows us to give conditions under which linear preconditioning can provably increase the gap. We then study popular preconditioners such as the covariance, its diagonal approximation, the hessian at the mode, and the QR decomposition. We show conditions under which each of these reduce $κ$ to near its minimum. We also show that the diagonal approach can in fact \textit{increase} the condition number. This is of interest as diagonal preconditioning is the default choice in well-known software packages. We conclude with a numerical study comparing preconditioners in different models, and showing how proper preconditioning can greatly reduce compute time in Hamiltonian Monte Carlo.

2023-12-08T08:29:47Z Max Hird Samuel Livingstone http://arxiv.org/abs/2412.02969v1 Unified Inductive Logic: From Formal Learning to Statistical Inference to Supervised Learning 2024-12-04T02:31:31Z

While the traditional conception of inductive logic is Carnapian, I develop a Peircean alternative and use it to unify formal learning theory, statistics, and a significant part of machine learning: supervised learning. Some crucial standards for evaluating non-deductive inferences have been assumed separately in those areas, but can actually be justified by a unifying principle.

2024-12-04T02:31:31Z Hanti Lin http://arxiv.org/abs/2412.02367v1 Internalist Reliabilism in Statistics and Machine Learning: Thoughts on Jun Otsuka's Thinking about Statistics 2024-12-03T10:47:24Z

Otsuka (2023) argues for a correspondence between data science and traditional epistemology: Bayesian statistics is internalist; classical (frequentist) statistics is externalist, owing to its reliabilist nature; model selection is pragmatist; and machine learning is a version of virtue epistemology. Where he sees diversity, I see an opportunity for unity. In this article, I argue that classical statistics, model selection, and machine learning share a foundation that is reliabilist in an unconventional sense that aligns with internalism. Hence a unification under internalist reliabilism.

2024-12-03T10:47:24Z The Asian Journal of Philosophy 3, 81 (2024) Hanti Lin 10.1007/s44204-024-00210-6 http://arxiv.org/abs/2411.19902v1 Noncommutative Model Selection for Data Clustering and Dimension Reduction Using Relative von Neumann Entropy 2024-11-29T18:04:11Z

We propose a pair of completely data-driven algorithms for unsupervised classification and dimension reduction, and we empirically study their performance on a number of data sets, both simulated data in three-dimensions and images from the COIL-20 data set. The algorithms take as input a set of points sampled from a uniform distribution supported on a metric space, the latter embedded in an ambient metric space, and they output a clustering or reduction of dimension of the data. They work by constructing a natural family of graphs from the data and selecting the graph which maximizes the relative von Neumann entropy of certain normalized heat operators constructed from the graphs. Once the appropriate graph is selected, the eigenvectors of the graph Laplacian may be used to reduce the dimension of the data, and clusters in the data may be identified with the kernel of the associated graph Laplacian. Notably, these algorithms do not require information about the size of a neighborhood or the desired number of clusters as input, in contrast to popular algorithms such as $k$-means, and even more modern spectral methods such as Laplacian eigenmaps, among others. In our computational experiments, our clustering algorithm outperforms $k$-means clustering on data sets with non-trivial geometry and topology, in particular data whose clusters are not concentrated around a specific point, and our dimension reduction algorithm is shown to work well in several simple examples.

2024-11-29T18:04:11Z 20 pages Araceli Guzmán-Tristán Antonio Rieser http://arxiv.org/abs/2411.19140v1 Examining Multimodal Gender and Content Bias in ChatGPT-4o 2024-11-28T13:41:44Z

This study investigates ChatGPT-4o's multimodal content generation, highlighting significant disparities in its treatment of sexual content and nudity versus violent and drug-related themes. Detailed analysis reveals that ChatGPT-4o consistently censors sexual content and nudity, while showing leniency towards violence and drug use. Moreover, a pronounced gender bias emerges, with female-specific content facing stricter regulation compared to male-specific content. This disparity likely stems from media scrutiny and public backlash over past AI controversies, prompting tech companies to impose stringent guidelines on sensitive issues to protect their reputations. Our findings emphasize the urgent need for AI systems to uphold genuine ethical standards and accountability, transcending mere political correctness. This research contributes to the understanding of biases in AI-driven language and multimodal models, calling for more balanced and ethical content moderation practices.

2024-11-28T13:41:44Z 17 pages, 4 figures, 3 tables. Conference: "14th International Conference on Artificial Intelligence, Soft Computing and Applications (AIAA 2024), London, 23-24 November 2024" It will be published in the proceedings "David C. Wyld et al. (Eds): IoTE, CNDC, DSA, AIAA, NLPTA, DPPR - 2024" Roberto Balestri http://arxiv.org/abs/2411.18838v1 Contrasting the optimal resource allocation to cybersecurity and cyber insurance using prospect theory versus expected utility theory 2024-11-28T00:59:48Z

Protecting against cyber-threats is vital for every organization and can be done by investing in cybersecurity controls and purchasing cyber insurance. However, these are interlinked since insurance premiums could be reduced by investing more in cybersecurity controls. The expected utility theory and the prospect theory are two alternative theories explaining decision-making under risk and uncertainty, which can inform strategies for optimizing resource allocation. While the former is considered a rational approach, research has shown that most people make decisions consistent with the latter, including on insurance uptakes. We compare and contrast these two approaches to provide important insights into how the two approaches could lead to different optimal allocations resulting in differing risk exposure as well as financial costs. We introduce the concept of a risk curve and show that identifying the nature of the risk curve is a key step in deriving the optimal resource allocation.

2024-11-28T00:59:48Z Chaitanya Joshi Jinming Yang Sergeja Slapnicar Ryan K L Ko http://arxiv.org/abs/1212.6339v2 A paradox on the spectral representation of stationary random processes 2024-11-25T21:04:31Z

In this note our aim is to show a paradox in the spectral representation of stationary random processes.

2012-12-27T10:33:34Z We find that this paradox is not true Mohammad Mohammadi Adel Mohammadpour Afshin Parvardeh http://arxiv.org/abs/2409.14606v5 A Modified Satterthwaite (1941,1946) Effective Degrees of Freedom Approximation 2024-11-22T17:54:44Z

This study introduces a correction to the approximation of effective degrees of freedom as proposed by Satterthwaite (1941, 1946), specifically addressing scenarios where component degrees of freedom are small. The correction is grounded in analytical results concerning the moments of standard normal random variables. This modification is applicable to complex variance estimates that involve both small and large degrees of freedom, offering an enhanced approximation of the higher moments required by Satterthwaite's framework. Additionally, this correction extends and partially validates the empirically derived adjustment by Johnson & Rust (1992), as it is based on theoretical foundations rather than simulations used to derive empirical transformation constants.

2024-09-22T21:54:15Z Matthias von Davier http://arxiv.org/abs/2407.19433v2 How Books Tell a History of Statistics in Portugal: Works of Foreigners, Estrangeirados, and Others 2024-11-20T23:05:37Z

Foreigners and "estrangeirados", an expression meaning "people going to a foreign country ["estrangeiro"] getting there further education", had a leading role in the development of Mathematical Statistics in Portugal. In what concerns Statistics, "estrangeirados" in the nineteenth century were mainly liberal intellectuals exiled for political reasons. From 1930 onwards, the research funding authority sent university professors abroad, and hired foreign researchers to stay in Portuguese institutions, and some of them were instrumental in the importation of new concepts and methods of inferential statistics. After 1970, there was a huge program of sending young researchers abroad for doctoral studies. At the same time, many new universities and polytechnic institutes have been created in Portugal. After that, aside from foreigners who choose to have a research career in those institutions and the "estrangeirados" who had returned and created programs of doctoral studies, others, who hadn't the opportunity of studying abroad, began to play a decisive role in the development of Statistics in Portugal. The publication of handbooks on Probability and Statistics, thesis and core papers in Portuguese scientific journals, and also of works for the layman, reveals how Statistics progressed from descriptive to a mathematical discipline used for inference in all fields of knowledge, from natural sciences to methodology of scientific research.

2024-07-28T09:16:58Z 67 pages, 34 figures Communications in Mathematics, Volume 32 (2024), Issue 3 (Special issue: Portuguese Mathematics) (November 22, 2024) cm:14005 Dinis Pestana Rui Santos 10.46298/cm.14005 http://arxiv.org/abs/2402.19162v2 A Bayesian approach to uncover local and temporal determinants of heterogeneity in repeated cross-sectional health surveys 2024-11-15T14:18:17Z

In several countries, including Italy, a prominent approach to population health surveillance involves conducting repeated cross-sectional surveys at short intervals of time. These surveys gather information on the health status of individual respondents, including details on their behaviours, risk factors, and relevant socio-demographic information. While the collected data undoubtedly provides valuable information, modelling such data presents several challenges. For instance, in health risk models, it is essential to consider behavioural information, local and temporal dynamics, and disease co-occurrence. In response to these challenges, our work proposes a multivariate temporal logistic model for chronic disease diagnoses at local level. Linear predictors are modelled using individual risk factor covariates and a latent individual propensity to diseases. Leveraging a state space formulation of the model, we construct a framework in which temporal heterogeneity in regression coefficients is informed by exogenous information at local level, correspond ing to different contextual risk factors that may affect the occurrence of chronic diseases in different ways. To explore the utility and the effectiveness of our method, we analyse behavioural and risk factor surveillance data collected in Italy (PASSI), which is well-known as a country characterised by high peculiar administrative, social and territorial diversities reflected on high variability in morbidity among population subgroups.

2024-02-29T13:45:36Z revised Mattia Stival Lorenzo Schiavon Stefano Campostrini 10.1093/jrsssa/qnae138