https://arxiv.org/api/vMNmRLpUJdRu4eO2sIeLmcb3uoc2026-06-10T21:52:37Z168634515http://arxiv.org/abs/2412.13116v1Equity in the Use of ChatGPT for the Classroom: A Comparison of the Accuracy and Precision of ChatGPT 3.5 vs. ChatGPT4 with Respect to Statistics and Data Science Exams2024-12-17T17:38:13ZA college education historically has been seen as method of moving upward with regards to income brackets and social status. Indeed, many colleges recognize this connection and seek to enroll talented low income students. While these students might have their education, books, room, and board paid; there are other items that they might be expected to use that are not part of most college scholarship packages. One of those items that has recently surfaced is access to generative AI platforms. The most popular of these platforms is ChatGPT, and it has a paid version (ChatGPT4) and a free version (ChatGPT3.5). We seek to explore differences in the free and paid versions in the context of homework questions and data analyses as might be seen in a typical introductory statistics course. We determine the extent to which students who cannot afford newer and faster versions of generative AI programs would be disadvantaged in terms of writing such projects and learning these methods.2024-12-17T17:38:13ZOriginally submitted for review in May of 2024 but rejected 6 months laterMonnie McGeeBivin Sadlerhttp://arxiv.org/abs/2412.11211v1Deep Learning-based Approaches for State Space Models: A Selective Review2024-12-15T15:04:35ZState-space models (SSMs) offer a powerful framework for dynamical system analysis, wherein the temporal dynamics of the system are assumed to be captured through the evolution of the latent states, which govern the values of the observations. This paper provides a selective review of recent advancements in deep neural network-based approaches for SSMs, and presents a unified perspective for discrete time deep state space models and continuous time ones such as latent neural Ordinary Differential and Stochastic Differential Equations. It starts with an overview of the classical maximum likelihood based approach for learning SSMs, reviews variational autoencoder as a general learning pipeline for neural network-based approaches in the presence of latent variables, and discusses in detail representative deep learning models that fall under the SSM framework. Very recent developments, where SSMs are used as standalone architectural modules for improving efficiency in sequence modeling, are also examined. Finally, examples involving mixed frequency and irregularly-spaced time series data are presented to demonstrate the advantage of SSMs in these settings.2024-12-15T15:04:35ZJiahe LinGeorge Michailidishttp://arxiv.org/abs/2401.01973v2Facilitating the Integration of Ethical Reasoning into Quantitative Courses: Stakeholder Analysis, Ethical Practice Standards, and Case Studies2024-12-10T20:07:52ZCase studies are typically used to teach 'ethics', but in quantitative courses it can seem distracting, for both instructor and learner, to introduce a case analysis. Moreover, case analyses are typically focused on issues relating to people: obtaining consent, dealing with research team members, and/or potential institutional policy violations. While relevant to some research, not all students in quantitative courses plan to become researchers, and ethical practice is an essential topic for students of of mathematics, statistics, data science, and computing regardless of whether or not the learner intends to do research. Ethical reasoning is a way of thinking that requires the individual to assess what they know about a potential ethical problem (their prerequisite knowledge), and in some cases, how behaviors they observe, are directed to perform, or have performed, diverge from what they know to be ethical behavior. Ethical reasoning is a learnable, improvable set of knowledge, skills, and abilities that enable learners to recognize what they do and do not know about what constitutes 'ethical practice' of a discipline, and in some cases, to contemplate alternative decisions about how to first recognize, and then proceed past, or respond to, such divergences. A stakeholder analysis is part of prerequisite knowledge, and can be used whether there is or is not an actual case or situation to react to. In courses with mainly quantitative content, a stakeholder analysis is a useful tool for instruction and assessment. It can be used to both integrate authentic ethical content and encourage careful quantitative thought. It is a mistake to treat 'training in ethical practice' and 'training in responsible conduct of research' as the same thing. This paper discusses how to introduce ethical reasoning, stakeholder analysis, and ethical practice standards authentically in quantitative courses.2024-01-03T20:44:55ZTo appear (verbatim) in H. Doosti, (Ed.). Ethics in Statistics. Cambridge, UK: Ethics International Press. Originally published (verbatim) in Proceedings of the 2022 Joint Statistical Meetings, Washington, DC. Alexandria, VA: American Statistical Association. pp. 1493-1519Rochelle E. TractenbergSuzanne Thortonhttp://arxiv.org/abs/2403.10987v3Risk Quadrangle and Robust Optimization Based on Extended $\varphi$-Divergence2024-12-10T03:20:46ZThe Fundamental Risk Quadrangle (FRQ) is a unified framework linking risk management, statistical estimation, and optimization. Distributionally robust optimization (DRO) based on $\varphi$-divergence minimizes the maximal expected loss, where the maximum is over a $\varphi$-divergence ambiguity set. This paper introduces the \emph{extended} $\varphi$-divergence and the extended $\varphi$-divergence quadrangle, which integrates DRO into the FRQ framework. We derive the primal and dual representations of the quadrangle elements (risk, deviation, regret, error, and statistic). The dual representation provides an interpretation of classification, portfolio optimization, and regression as robust optimization based on the extended $\varphi$-divergence. The primal representation offers tractable formulations of these robust optimizations as convex optimization. We provide illustrative examples showing that many common problems, such as least-squares regression, quantile regression, support vector machines, and CVaR optimization, fall within this framework. Additionally, we conduct a case study to visualize the optimal solution of the inner maximization in robust optimization.2024-03-16T17:56:31ZCheng PengAnton MalandiiStan Uryasevhttp://arxiv.org/abs/2412.04735v1A dynamical measure of algorithmically infused visibility2024-12-06T02:51:39ZThis work focuses on the nature of visibility in societies where the behaviours of humans and algorithms influence each other - termed algorithmically infused societies. We propose a quantitative measure of visibility, with implications and applications to an array of disciplines including communication studies, political science, marketing, technology design, and social media analytics. The measure captures the basic characteristics of the visibility of a given topic, in algorithm/AI-mediated communication/social media settings. Topics, when trending, are ranked against each other, and the proposed measure combines the following two attributes of a topic: (i) the amount of time a topic spends at different ranks, and (ii) the different ranks the topic attains. The proposed measure incorporates a tunable parameter, termed the discrimination level, whose value determines the relative weights of the two attributes that contribute to visibility. Analysis of a large-scale, real-time dataset of trending topics, from one of the largest social media platforms, demonstrates that the proposed measure can explain a large share of the variability of the accumulated views of a topic.2024-12-06T02:51:39Z28 pages, 5 figuresShaojing SunZhiyuan LiuDavid Waxmanhttp://arxiv.org/abs/2312.04898v2Quantifying the effectiveness of linear preconditioning in Markov chain Monte Carlo2024-12-04T08:40:49ZWe study linear preconditioning in Markov chain Monte Carlo. We consider the class of well-conditioned distributions, for which several mixing time bounds depend on the condition number $κ$. First we show that well-conditioned distributions exist for which $κ$ can be arbitrarily large and yet no linear preconditioner can reduce it. We then impose two sets of extra assumptions under which a linear preconditioner can significantly reduce $κ$. For the random walk Metropolis we further provide upper and lower bounds on the spectral gap with tight $1/κ$ dependence. This allows us to give conditions under which linear preconditioning can provably increase the gap. We then study popular preconditioners such as the covariance, its diagonal approximation, the hessian at the mode, and the QR decomposition. We show conditions under which each of these reduce $κ$ to near its minimum. We also show that the diagonal approach can in fact \textit{increase} the condition number. This is of interest as diagonal preconditioning is the default choice in well-known software packages. We conclude with a numerical study comparing preconditioners in different models, and showing how proper preconditioning can greatly reduce compute time in Hamiltonian Monte Carlo.2023-12-08T08:29:47ZMax HirdSamuel Livingstonehttp://arxiv.org/abs/2412.02969v1Unified Inductive Logic: From Formal Learning to Statistical Inference to Supervised Learning2024-12-04T02:31:31ZWhile the traditional conception of inductive logic is Carnapian, I develop a Peircean alternative and use it to unify formal learning theory, statistics, and a significant part of machine learning: supervised learning. Some crucial standards for evaluating non-deductive inferences have been assumed separately in those areas, but can actually be justified by a unifying principle.2024-12-04T02:31:31ZHanti Linhttp://arxiv.org/abs/2412.02367v1Internalist Reliabilism in Statistics and Machine Learning: Thoughts on Jun Otsuka's Thinking about Statistics2024-12-03T10:47:24ZOtsuka (2023) argues for a correspondence between data science and traditional epistemology: Bayesian statistics is internalist; classical (frequentist) statistics is externalist, owing to its reliabilist nature; model selection is pragmatist; and machine learning is a version of virtue epistemology. Where he sees diversity, I see an opportunity for unity. In this article, I argue that classical statistics, model selection, and machine learning share a foundation that is reliabilist in an unconventional sense that aligns with internalism. Hence a unification under internalist reliabilism.2024-12-03T10:47:24ZThe Asian Journal of Philosophy 3, 81 (2024)Hanti Lin10.1007/s44204-024-00210-6http://arxiv.org/abs/2411.19902v1Noncommutative Model Selection for Data Clustering and Dimension Reduction Using Relative von Neumann Entropy2024-11-29T18:04:11ZWe propose a pair of completely data-driven algorithms for unsupervised classification and dimension reduction, and we empirically study their performance on a number of data sets, both simulated data in three-dimensions and images from the COIL-20 data set. The algorithms take as input a set of points sampled from a uniform distribution supported on a metric space, the latter embedded in an ambient metric space, and they output a clustering or reduction of dimension of the data. They work by constructing a natural family of graphs from the data and selecting the graph which maximizes the relative von Neumann entropy of certain normalized heat operators constructed from the graphs. Once the appropriate graph is selected, the eigenvectors of the graph Laplacian may be used to reduce the dimension of the data, and clusters in the data may be identified with the kernel of the associated graph Laplacian. Notably, these algorithms do not require information about the size of a neighborhood or the desired number of clusters as input, in contrast to popular algorithms such as $k$-means, and even more modern spectral methods such as Laplacian eigenmaps, among others.
In our computational experiments, our clustering algorithm outperforms $k$-means clustering on data sets with non-trivial geometry and topology, in particular data whose clusters are not concentrated around a specific point, and our dimension reduction algorithm is shown to work well in several simple examples.2024-11-29T18:04:11Z20 pagesAraceli Guzmán-TristánAntonio Rieserhttp://arxiv.org/abs/2411.19140v1Examining Multimodal Gender and Content Bias in ChatGPT-4o2024-11-28T13:41:44ZThis study investigates ChatGPT-4o's multimodal content generation, highlighting significant disparities in its treatment of sexual content and nudity versus violent and drug-related themes. Detailed analysis reveals that ChatGPT-4o consistently censors sexual content and nudity, while showing leniency towards violence and drug use. Moreover, a pronounced gender bias emerges, with female-specific content facing stricter regulation compared to male-specific content. This disparity likely stems from media scrutiny and public backlash over past AI controversies, prompting tech companies to impose stringent guidelines on sensitive issues to protect their reputations. Our findings emphasize the urgent need for AI systems to uphold genuine ethical standards and accountability, transcending mere political correctness. This research contributes to the understanding of biases in AI-driven language and multimodal models, calling for more balanced and ethical content moderation practices.2024-11-28T13:41:44Z17 pages, 4 figures, 3 tables. Conference: "14th International Conference on Artificial Intelligence, Soft Computing and Applications (AIAA 2024), London, 23-24 November 2024" It will be published in the proceedings "David C. Wyld et al. (Eds): IoTE, CNDC, DSA, AIAA, NLPTA, DPPR - 2024"Roberto Balestrihttp://arxiv.org/abs/2411.18838v1Contrasting the optimal resource allocation to cybersecurity and cyber insurance using prospect theory versus expected utility theory2024-11-28T00:59:48ZProtecting against cyber-threats is vital for every organization and can be done by investing in cybersecurity controls and purchasing cyber insurance. However, these are interlinked since insurance premiums could be reduced by investing more in cybersecurity controls. The expected utility theory and the prospect theory are two alternative theories explaining decision-making under risk and uncertainty, which can inform strategies for optimizing resource allocation. While the former is considered a rational approach, research has shown that most people make decisions consistent with the latter, including on insurance uptakes. We compare and contrast these two approaches to provide important insights into how the two approaches could lead to different optimal allocations resulting in differing risk exposure as well as financial costs. We introduce the concept of a risk curve and show that identifying the nature of the risk curve is a key step in deriving the optimal resource allocation.2024-11-28T00:59:48ZChaitanya JoshiJinming YangSergeja SlapnicarRyan K L Kohttp://arxiv.org/abs/1212.6339v2A paradox on the spectral representation of stationary random processes2024-11-25T21:04:31ZIn this note our aim is to show a paradox in the spectral representation of stationary random processes.2012-12-27T10:33:34ZWe find that this paradox is not trueMohammad MohammadiAdel MohammadpourAfshin Parvardehhttp://arxiv.org/abs/2409.14606v5A Modified Satterthwaite (1941,1946) Effective Degrees of Freedom Approximation2024-11-22T17:54:44ZThis study introduces a correction to the approximation of effective degrees of freedom as proposed by Satterthwaite (1941, 1946), specifically addressing scenarios where component degrees of freedom are small. The correction is grounded in analytical results concerning the moments of standard normal random variables. This modification is applicable to complex variance estimates that involve both small and large degrees of freedom, offering an enhanced approximation of the higher moments required by Satterthwaite's framework. Additionally, this correction extends and partially validates the empirically derived adjustment by Johnson & Rust (1992), as it is based on theoretical foundations rather than simulations used to derive empirical transformation constants.2024-09-22T21:54:15ZMatthias von Davierhttp://arxiv.org/abs/2407.19433v2How Books Tell a History of Statistics in Portugal: Works of Foreigners, Estrangeirados, and Others2024-11-20T23:05:37ZForeigners and "estrangeirados", an expression meaning "people going to a foreign country ["estrangeiro"] getting there further education", had a leading role in the development of Mathematical Statistics in Portugal. In what concerns Statistics, "estrangeirados" in the nineteenth century were mainly liberal intellectuals exiled for political reasons. From 1930 onwards, the research funding authority sent university professors abroad, and hired foreign researchers to stay in Portuguese institutions, and some of them were instrumental in the importation of new concepts and methods of inferential statistics. After 1970, there was a huge program of sending young researchers abroad for doctoral studies. At the same time, many new universities and polytechnic institutes have been created in Portugal. After that, aside from foreigners who choose to have a research career in those institutions and the "estrangeirados" who had returned and created programs of doctoral studies, others, who hadn't the opportunity of studying abroad, began to play a decisive role in the development of Statistics in Portugal. The publication of handbooks on Probability and Statistics, thesis and core papers in Portuguese scientific journals, and also of works for the layman, reveals how Statistics progressed from descriptive to a mathematical discipline used for inference in all fields of knowledge, from natural sciences to methodology of scientific research.2024-07-28T09:16:58Z67 pages, 34 figuresCommunications in Mathematics, Volume 32 (2024), Issue 3 (Special issue: Portuguese Mathematics) (November 22, 2024) cm:14005Dinis PestanaRui Santos10.46298/cm.14005http://arxiv.org/abs/2402.19162v2A Bayesian approach to uncover local and temporal determinants of heterogeneity in repeated cross-sectional health surveys2024-11-15T14:18:17ZIn several countries, including Italy, a prominent approach to population health surveillance involves conducting repeated cross-sectional surveys at short intervals of time. These surveys gather information on the health status of individual respondents, including details on their behaviours, risk factors, and relevant socio-demographic information. While the collected data undoubtedly provides valuable information, modelling such data presents several challenges. For instance, in health risk models, it is essential to consider behavioural information, local and temporal dynamics, and disease co-occurrence. In response to these challenges, our work proposes a multivariate temporal logistic model for chronic disease diagnoses at local level. Linear predictors are modelled using individual risk factor covariates and a latent individual propensity to diseases. Leveraging a state space formulation of the model, we construct a framework in which temporal heterogeneity in regression coefficients is informed by exogenous information at local level, correspond ing to different contextual risk factors that may affect the occurrence of chronic diseases in different ways. To explore the utility and the effectiveness of our method, we analyse behavioural and risk factor surveillance data collected in Italy (PASSI), which is well-known as a country characterised by high peculiar administrative, social and territorial diversities reflected on high variability in morbidity among population subgroups.2024-02-29T13:45:36ZrevisedMattia StivalLorenzo SchiavonStefano Campostrini10.1093/jrsssa/qnae138