https://arxiv.org/api/tzCihuVFi5eTpNM0Vo7mq3ZGcRw2026-06-10T11:22:51Z168619515http://arxiv.org/abs/2510.00900v1How can the use of different modes of survey data collection introduce bias? A simple introduction to mode effects using directed acyclic graphs (DAGs)2025-10-01T13:44:00ZSurvey data are self-reported data collected directly from respondents by a questionnaire or an interview and are commonly used in epidemiology. Such data are traditionally collected via a single mode (e.g. face-to-face interview alone), but use of mixed-mode designs (e.g. offering face-to-face interview or online survey) has become more common. This introduces two key challenges. First, individuals may respond differently to the same question depending on the mode; these differences due to measurement are known as 'mode effects'. Second, different individuals may participate via different modes; these differences in sample composition between modes are known as 'mode selection'. Where recognised, mode effects are often handled by straightforward approaches such as conditioning on survey mode. However, while reducing mode effects, this and other equivalent approaches may introduce collider bias in the presence of mode selection. The existence of mode effects and the consequences of naïve conditioning may be underappreciated in epidemiology. This paper offers a simple introduction to these challenges using directed acyclic graphs by exploring a range of possible data structures. We discuss the potential implications of using conditioning- or imputation-based approaches and outline the advantages of quantitative bias analyses for dealing with mode effects.2025-10-01T13:44:00ZGeorgia D TomovaRichard J SilverwoodPeter WG TennantLiam Wright10.1093/aje/kwag017http://arxiv.org/abs/2509.26141v1CLT for LES of real valued random centrosymmetric matrices2025-09-30T11:58:04ZWe study the fluctuations of the eigenvalues of real valued large centrosymmetric random matrices via its linear eigenvalue statistic. This is essentially a central limit theorem (CLT) for sums of dependent random variables. The dependence among them leads to behavior that differs from the classical CLT. The main contribution of this article is finding the expression of the variance of the limiting Gaussian distribution. The crux of the proof lies in combinatorial arguments that involve counting overlapping loops in complete undirected weighted graphs with growing degrees.2025-09-30T11:58:04ZIndrajit JanaSunita Ranihttp://arxiv.org/abs/2510.03266v1Variational Autoencoders-based Detection of Extremes in Plant Productivity in an Earth System Model2025-09-26T22:03:20ZClimate anomalies significantly impact terrestrial carbon cycle dynamics, necessitating robust methods for detecting and analyzing anomalous behavior in plant productivity. This study presents a novel application of variational autoencoders (VAE) for identifying extreme events in gross primary productivity (GPP) from Community Earth System Model version 2 simulations across four AR6 regions in the Continental United States. We compare VAE-based anomaly detection with traditional singular spectral analysis (SSA) methods across three time periods: 1850-80, 1950-80, and 2050-80 under the SSP585 scenario. The VAE architecture employs three dense layers and a latent space with an input sequence length of 12 months, trained on a normalized GPP time series to reconstruct the GPP and identifying anomalies based on reconstruction errors. Extreme events are defined using 5th percentile thresholds applied to both VAE and SSA anomalies. Results demonstrate strong regional agreement between VAE and SSA methods in spatial patterns of extreme event frequencies, despite VAE producing higher threshold values (179-756 GgC for VAE vs. 100-784 GgC for SSA across regions and periods). Both methods reveal increasing magnitudes and frequencies of negative carbon cycle extremes toward 2050-80, particularly in Western and Central North America. The VAE approach shows comparable performance to established SSA techniques, while offering computational advantages and enhanced capability for capturing non-linear temporal dependencies in carbon cycle variability. Unlike SSA, the VAE method does not require one to define the periodicity of the signals in the data; it discovers them from the data.2025-09-26T22:03:20ZBharat SharmaJitendra Kumarhttp://arxiv.org/abs/2502.11820v3A Diagnostic to Find and Help Combat Stochastic Positivity Issues -- with a Focus on Continuous Treatments2025-09-25T13:50:13ZThe positivity assumption is central in the identification of a causal effect, and especially the stochastic variant is an issue many applied researchers face, yet is rarely discussed, especially in conjunction with continuous treatments or Modified Treatment Policies. One common recommendation for dealing with a violation is to change the estimand. However, an applied researcher is faced with two problems: First, how can she tell whether there is a stochastic positivity violation given her estimand of interest, preferably without having to estimate a model first? Second, if she finds a problem with stochastic positivity, how should she change her estimand in order to arrive at an estimand which does not face the same issues? We suggest a novel diagnostic which allows the researcher to answer both questions by providing insights into how well an estimation for a certain estimand can be made for each observation using the data at hand. We provide a simulation study on the general behaviour of different Modified Treatment Policies (MTPs) at different levels of stochastic positivity violations and show how the diagnostic helps understand where bias is to be expected. We illustrate the application of our proposed diagnostic in a pharmacoepidemiological study based on data from CHAPAS-3, a trial comparing different treatment regimens for children living with HIV.2025-02-17T14:13:09Z33 pages (24 without appendix), 12 figures (7 without appendix)Katharina RingMichael Schomaker10.1515/jci-2025-0007http://arxiv.org/abs/2509.19511v1A direct approach for full-field state-parameter estimation from fusion of noncollocated multi-rate sensor data using UKF-based algorithms2025-09-23T19:27:37ZHeterogeneous sensor setups may entail measurements recorded at varying sampling frequencies, commonly known as multi-rate data. For system identification and state estimation with such data, existing studies mostly focus on data fusion algorithms that utilize acceleration measurements, with collocated measurements of other types at lower sampling frequencies, to estimate the displacement at the collocated location with the sampling frequency of the acceleration measurements. The obtained displacements, along with the available acceleration measurements, are then utilized for system identification. This paper introduces a direct and straightforward methodology aimed at estimating the states (i.e., displacements and velocities) along with the unknown structural parameters from fused multi-rate data through Unscented Kalman Filter (UKF) based algorithms with a modification during measurement update. By utilizing all available measurements at any time instant, which can differ due to the multi-rate nature, and by modifying the non-linear measurement equation of the system accordingly at the considered time instant, the UKF framework is suitably tailored for direct applications with multi-rate measurements. The approach is demonstrated with a variety of numerical and laboratory-scale experiments, including fusion of higher sampling frequency acceleration data with lower sampling frequency displacement, axial strain, or bending strain data. The results show that the approach is successful in accurately estimating full-field states and parameters. The state estimates compare well with those obtained using existing data fusion algorithms. The advantages of the approach lie in not requiring collocated sensing, in its generalizability for different types of measurements, in its simplicity and ease of implementation, and in achieving both the state and parameter estimates simultaneously.2025-09-23T19:27:37Z14 pages, 11 figuresDhiraj GhoshAdrita KunduSuparno Mukhopadhyayhttp://arxiv.org/abs/2509.19123v1George Udny Yule and the Interpretation of Regression Betas2025-09-23T15:10:13ZInitially applied in astronomy and geodesy, the linear regression model aimed to find the best estimates for parameters with predefined meanings. E.g., orbital elements, geodetic constants. As its use expanded to other disciplines, often to summarize data without an underlying theoretical model, the need for a general interpretation of regression betas arose. Early attempts by Galton and Karl Pearson met with mixed success. G. U. Yule was the first to develop a general statistical interpretation, the culmination of efforts begun in 1896. Yule interpretation is based on the partial regression theorem, which he proved in 1907.2025-09-23T15:10:13ZFrancesco Coriellihttp://arxiv.org/abs/2509.17122v1Insensitivity-induced potential non-uniqueness in system identification of Bouc-Wen models2025-09-21T15:26:50ZDuring system identification of a structural system with Bouc-Wen (BW) restoring force mechanisms, the estimated BW parameters may be different for different sets of input-output measurements, indicating potential non-uniqueness in the parameter estimates. Nonetheless, the non-unique and incorrectly estimated BW parameters may result in dynamic responses and hysteretic behaviours which are very similar to those obtained for the correct system. In this work, the existence of alternate sets of BW parameters, which result in hysteretic restoring force behaviour similar to the true system, is studied analytically. Approximate expressions for the rate of change of the hysteretic force with deformation are derived and analyzed in detail. It is shown that alternate sets of BW parameters with significant deviations from a set of "true" BW parameters may exist, which result in the rate of change of the hysteretic force, and consequently, the restoring force behaviour itself, to remain very similar to that obtained with the "true" BW parameters. The existence of these alternate parameters results in potential non-uniqueness of the BW parameter estimates, despite satisfying analytical identifiability requirements. Furthermore, the deviations of the alternate BW parameters depend on the magnitudes of the "true" BW parameters as well as the extent of the hysteretic action being developed by the input excitation. The results are illustrated using different inputs: sinusoidal, El Centro motion, and a suite of ground motions compatible with the Kanai-Tajimi spectrum. The results of this work help in a better understanding of the potential non-uniqueness issues associated with the estimation of the BW parameters from measured responses using any system identification technique, which is caused by the insensitivity of these parameters towards the dynamic responses of the structure.2025-09-21T15:26:50Z23 pages, 18 figuresAdrita KunduSuparno Mukhopadhyayhttp://arxiv.org/abs/2503.21719v3The Principle of Redundant Reflection2025-09-18T15:45:04ZThe fact that redundant information does not update a rational belief implies that rational beliefs are updated using Bayes rule. In the framework of Hild (1998a), this is true under mild conditions for discrete, continuous, and arbitrary measure spaces. We prove this result and illustrate it with two examples.2025-03-27T17:31:22Z11 pages, 0 figuresMartin MetodievMaarten MarsmanLourens WaldorpQuentin F. GronauEric-Jan Wagenmakershttp://arxiv.org/abs/2506.22236v3A Plea for History and Philosophy of Statistics and Machine Learning2025-09-18T10:12:59ZThe integration of the history and philosophy of statistics was initiated at least by Hacking (1975) and advanced by Hacking (1990), Mayo (1996), and Zabell (2005), but it has not received sustained follow-up. Yet such integration is more urgent than ever, as the recent success of artificial intelligence has been driven largely by machine learning -- a field historically developed alongside statistics. Today, the boundary between statistics and machine learning is increasingly blurred. What we now need is integration, twice over: of history and philosophy, and of two fields they engage -- statistics and machine learning. I present a case study of a philosophical idea in machine learning (and in formal epistemology) whose root can be traced back to an often under-appreciated insight in Neyman and Pearson's 1936 work (a follow-up to their 1933 classic). This leads to the articulation of an epistemological principle -- largely implicit in, but shared by, the practices of frequentist statistics and machine learning -- which I call achievabilism: the thesis that the correct standard for assessing non-deductive inference methods should not be fixed, but should instead be sensitive to what is achievable in specific problem contexts. Another integration also emerges at the level of methodology, combining two ends of the philosophy of science spectrum: history and philosophy of science on the one hand, and formal epistemology on the other hand.2025-06-27T13:59:08ZHanti Linhttp://arxiv.org/abs/2001.10488v4Statistical Consequences of Fat Tails: Real World Preasymptotics, Epistemology, and Applications2025-09-17T07:38:53Z(The third edition corrects minor typos and adds 3 chapters synthesized from published papers plus an appendix on maximum entropy distributions.) The monograph investigates the misapplication of conventional statistical techniques to fat tailed distributions and looks for remedies, when possible.
Switching from thin tailed to fat tailed distributions requires more than "changing the color of the dress". Traditional asymptotics deal mainly with either n=1 or $n=\infty$, and the real world is in between, under of the "laws of the medium numbers" --which vary widely across specific distributions. Both the law of large numbers and the generalized central limit mechanisms operate in highly idiosyncratic ways outside the standard Gaussian or Levy-Stable basins of convergence.
A few examples:
+ The sample mean is rarely in line with the population mean, with effect on "naive empiricism", but can be sometimes be estimated via parametric methods.
+ The "empirical distribution" is rarely empirical.
+ Parameter uncertainty has compounding effects on statistical metrics.
+ Dimension reduction (principal components) fails.
+ Inequality estimators (GINI or quantile contributions) are not additive and produce wrong results.
+ Many "biases" found in psychology become entirely rational under more sophisticated probability distributions
+ Most of the failures of financial economics, econometrics, and behavioral economics can be attributed to using the wrong distributions.
This book, the first volume of the Technical Incerto, weaves a narrative around published journal articles.2020-01-24T14:45:55ZThird Revised Edition, 2025Nassim Nicholas Talebhttp://arxiv.org/abs/2510.01210v1Minimum Sample Size Calculation for Multivariable Regression of Continuous Outcomes in Chemometrics for Astrobiology and Planetary Science2025-09-17T01:35:14ZOver the last few decades, prediction models have become a fundamental tool in statistics, chemometrics, and related fields. However, to ensure that such models have high value, the inferences that they generate must be reliable. In this regard, the internal validity of a prediction model might be threatened if it is not calibrated with a sufficiently large sample size, as problems such as overfitting may occur. Such situations would be highly problematic in many fields, including space science, as the resulting inferences from prediction models often inform scientific inquiry about planetary bodies such as Mars. Therefore, to better inform the development of prediction models, we applied a theory-based guidance from the biomedical domain for establishing what the minimum sample size is under a range of conditions for continuous outcomes. This study aims to disseminate existing research criteria in biomedical research to a broader audience, specifically focusing on their potential applicability and utility within the field of chemometrics. As such, the paper emphasizes the importance of interdisciplinarity, bridging the gap between the medical domain and chemometrics. Lastly, we provide several examples of work in the context of space science. This work will be the foundation for more evidence-based model development and ensure rigorous predictive modelling in the search for life and possible habitable environments.2025-09-17T01:35:14Z19 pagesM. KonstantinidisE. A. LallaS. J. GonzalezJ. ManriqueG. Lopez-ReyesA. BarlowE. SawyersB. BarriosM. G. Dalyhttp://arxiv.org/abs/2412.12233v2Russian roulette: The need for stochastic potential outcomes when utilities depend on counterfactuals2025-09-16T16:23:51ZIt has been proposed in medical decision analysis to express the ``first do no harm'' principle as an asymmetric utility function in which the loss from killing a patient would count more than the gain from saving a life. Such a utility depends on unrealized potential outcomes, and we show how this yields a paradoxical decision recommendation in a simple hypothetical example involving games of Russian roulette. The problem is resolved if we abandon the stable unit treatment value assumption (SUTVA) and allow the potential outcomes to be random variables. This leads us to conclude that, if you are interested in this sort of asymmetric utility function, you need to move to the stochastic potential outcome framework. We discuss the implications of the choice of parameterization in this setting.2024-12-16T15:29:18ZAndrew GelmanJonas M. Mikhaeil10.1093/biomet/asaf062http://arxiv.org/abs/2412.14222v2A Survey on Large Language Model-based Agents for Statistics and Data Science2025-09-14T04:25:33ZIn recent years, data science agents powered by Large Language Models (LLMs), known as "data agents," have shown significant potential to transform the traditional data analysis paradigm. This survey provides an overview of the evolution, capabilities, and applications of LLM-based data agents, highlighting their role in simplifying complex data tasks and lowering the entry barrier for users without related expertise. We explore current trends in the design of LLM-based frameworks, detailing essential features such as planning, reasoning, reflection, multi-agent collaboration, user interface, knowledge integration, and system design, which enable agents to address data-centric problems with minimal human intervention. Furthermore, we analyze several case studies to demonstrate the practical applications of various data agents in real-world scenarios. Finally, we identify key challenges and propose future research directions to advance the development of data agents into intelligent statistical analysis software.2024-12-18T15:03:26ZAm. Statist. (2025) 1-14Maojun SunRuijian HanBinyan JiangHouduo QiDefeng SunYancheng YuanJian Huang10.1080/00031305.2025.2561140http://arxiv.org/abs/2509.07147v2On the Ambiguities of Incompatibility in Frequentist Inference2025-09-12T05:17:41ZThe interpretation of the P-value and its monotone transform s=-log2(p), or S-value, remains debated despite decades of dedicated literature. Within the neo-Fisherian framework, these values are often described as indices of (in)compatibility between the observed data and a set of ideal assumptions (i.e., the statistical model). In this regard, this paper proposes the distinction between two domains: the model domain, where assumptions are taken as perfectly true and every admissible outcome is, by construction, fully compatible with the model; and the real domain, where assumptions may fail and face empirical scrutiny. I argue that, although interpreted through an objective numerical index, any level of incompatibility can arise only in the latter domain, where the epistemic status of the model under examination is uncertain and a genuine conflict between data and hypotheses can therefore occur. The extent to which P- and S-values are taken as indicating incompatibility is a matter of contextual judgment. Within this framework, descriptive approaches serve to quantify the numerical values of P and S; these can be interpreted as indicative of a certain degree (or amount) of incompatibility between data and hypotheses once causal knowledge of the data-generating process and information about the costs and benefits of related decisions become clearer. Although the distinction between the model domain and the real domain may appear merely theoretical or even philosophical, I argue that this perspective is useful for developing a clear mental representation of how statistical estimates should be evaluated in practical settings and applications.2025-09-08T18:53:19ZAlessandro Rovettahttp://arxiv.org/abs/2409.16613v5Oral exams in introductory statistics class with non-native English speakers2025-09-12T00:08:31ZOral exams are a powerful tool to assess student's learning. This is particularly important in introductory statistics classes where students struggle to grasp various topics like the interpretation of probability, $p$-values and more. The challenge of acquiring conceptual understanding is only heightened when students are learning in a second language. In this paper, I share my experience administering oral exams to an introductory statistics class of non-native English speakers at a Japanese university. I explain the context of the university and course, before detailing the exam. Of particular interest is the relationship between exam performance and English proficiency. The results showed little relationship between the two, meaning the exam seemed to truly test student's statistical knowledge rather than their English ability. I close with encouragements and recommendations for practitioners hoping to implement similar oral exams, focusing on the unique difficulties faced by students not learning in their mother tongue.2024-09-25T04:31:07ZEric Yanchenko