https://arxiv.org/api/oUYAo7HGT2qty3/HB/v/479OHpQ 2026-06-10T09:37:07Z 1686 165 15 http://arxiv.org/abs/2511.17575v1 Random Text, Zipf's Law, Critical Length,and Implications for Large Language Models 2025-11-14T23:05:59Z

We study a deliberately simple, fully non-linguistic model of text: a sequence of independent draws from a finite alphabet of letters plus a single space symbol. A word is defined as a maximal block of non-space symbols. Within this symbol-level framework, which assumes no morphology, syntax, or semantics, we derive several structural results. First, word lengths follow a geometric distribution governed solely by the probability of the space symbol. Second, the expected number of words of a given length, and the expected number of distinct words of that length, admit closed-form expressions based on a coupon-collector argument. This yields a critical word length k* at which word types transition from appearing many times on average to appearing at most once. Third, combining the exponential growth of the number of possible strings of length k with the exponential decay of the probability of each string, we obtain a Zipf-type rank-frequency law p(r) proportional to r^{-alpha}, with an exponent determined explicitly by the alphabet size and the space probability. Our contribution is twofold. Mathematically, we give a unified derivation linking word lengths, vocabulary growth, critical length, and rank-frequency structure in a single explicit model. Conceptually, we argue that this provides a structurally grounded null model for both natural-language word statistics and token statistics in large language models. The results show that Zipf-like patterns can arise purely from combinatorics and segmentation, without optimization principles or linguistic organization, and help clarify which phenomena require deeper explanation beyond random-text structure.

2025-11-14T23:05:59Z Vladimir Berman http://arxiv.org/abs/2507.09007v2 Possibilistic inferential models: a review 2025-11-12T13:11:39Z

An inferential model (IM) is a model describing the construction of provably reliable, data-driven uncertainty quantification and inference about relevant unknowns. IMs and Fisher's fiducial argument have similar objectives, but a fundamental distinction between the two is that the former doesn't require that uncertainty quantification be probabilistic, offering greater flexibility and allowing for a proof of its reliability. Important recent developments have been made thanks in part to newfound connections with the imprecise probability literature, in particular, possibility theory. The brand of possibilistic IMs studied here are straightforward to construct, have very strong frequentist-like reliability properties, and offer fully conditional, Bayesian-like (imprecise) probabilistic reasoning. This paper reviews these key recent developments, describing the new theory, methods, and computational tools. A generalization of the basic possibilistic IM is also presented, making new and unexpected connections with ideas in modern statistics and machine learning, e.g., bootstrap and conformal prediction.

2025-07-11T20:16:52Z Journal of the American Statistical Association, volume 121, pages 807--826, 2026 Ryan Martin 10.1080/01621459.2025.2606127 http://arxiv.org/abs/2511.07628v1 Beyond Correctness: Evaluating and Improving LLM Feedback in Statistical Education 2025-11-10T20:56:26Z

Large language models (LLMs) have been proposed as scalable tools to address the gap between the importance of individualized written feedback and the practical challenges of providing it at scale. However, concerns persist regarding the accuracy, depth, and pedagogical value of their feedback responses. The present study investigates the extent to which LLMs can generate feedback that aligns with educational theory and compares techniques to improve their performance. Using mock in-class exam data from two consecutive years of an introductory statistics course at LMU Munich, we evaluated GPT-generated feedback against an established but expanded pedagogical framework. Four enhancement methods were compared in a highly standardized setting, making meaningful comparisons possible: Using a state-of-the-art model, zero-shot prompting, few-shot prompting, and supervised fine-tuning using Low-Rank Adaptation (LoRA). Results show that while all LLM setups reliably provided correctness judgments and explanations, their ability to deliver contextual feedback and suggestions on how students can monitor and regulate their own learning remained limited. Among the tested methods, zero-shot prompting achieved the strongest balance between quality and cost, while fine-tuning required substantially more resources without yielding clear advantages. For educators, this suggests that carefully designed prompts can substantially improve the usefulness of LLM feedback, making it a promising tool, particularly in large introductory courses where students would otherwise receive little or no written feedback.

2025-11-10T20:56:26Z Niklas Ippisch Markus Herklotz Anna-Carolina Haensch Carsten Schwemmer http://arxiv.org/abs/2511.04903v1 Efficacy Analysis in Clinical Trials: A Comprehensive Review of Statistical and Machine Learning Approaches 2025-11-07T01:05:12Z

Efficacy testing is a cornerstone of clinical trials, ensuring that medical interventions achieve their intended therapeutic effects. Over the decades, a wide range of statistical methodologies have been developed to address the complexities of clinical trial data, including parametric, nonparametric, Bayesian, and machine learning approaches. Parametric methods, such as t-tests, ANOVA, and LMMs, have traditionally been the foundation of efficacy testing due to their efficiency under well-defined assumptions. Nonparametric techniques, including the Friedman test, Brunner-Munzel test, and modern extensions like nparLD, have emerged as robust alternatives, particularly for skewed, ordinal, or non-normal data. Bayesian methodologies have enabled the incorporation of prior information and uncertainty quantification, while machine learning techniques, such as deep learning and reinforcement learning, are revolutionizing trial designs and outcome predictions. Despite these advancements, significant gaps remain, including challenges in handling high-dimensional data, missingness, and ensuring equitable efficacy testing across diverse populations. This review provides a comprehensive overview of these statistical methods, highlighting their applications, strengths, limitations, and future directions. By bridging traditional statistical frameworks with modern computational techniques, the field can continue to advance toward more reliable and personalized clinical trial methodologies.

2025-11-07T01:05:12Z Dhrubajyoti Ghosh Samhita Pal 10.51387/26-NEJSDS104 http://arxiv.org/abs/2502.07948v2 The nature of mathematical models 2025-11-06T11:11:43Z

Modeling has become a widespread, useful tool in mathematics applied to diverse fields, from physics to economics to biomedicine. Practitioners of modeling may use algebraic or differential equations, to the elements of which they attribute an intuitive relationship with some relevant aspect of reality they wish to represent. More sophisticated expressions may include stochasticity, either as observation error or system noise. However, a clear, unambiguous mathematical definition of what a model is and of what is the relationship between the model and the real-life phenomena it purports to represent has so far not been formulated. The present work aims to fill this gap, motivating the definition of a mathematical model as an operator on a Hilbert space of random variables, identifying the experimental realization as the map between the theoretical space of model construction and the computational space of statistical model identification, and tracing the relationship of the geometry of the model manifold in the abstract setting with the corresponding geometry of the prediction surfaces in statistical estimation.

2025-02-11T20:54:47Z 23 pages, 3 figures Andrea De Gaetano http://arxiv.org/abs/2511.04213v1 Can we trust LLMs as a tutor for our students? Evaluating the Quality of LLM-generated Feedback in Statistics Exams 2025-11-06T09:18:54Z

One of the central challenges for instructors is offering meaningful individual feedback, especially in large courses. Faced with limited time and resources, educators are often forced to rely on generalized feedback, even when more personalized support would be pedagogically valuable. To overcome this limitation, one potential technical solution is to utilize large language models (LLMs). For an exploratory study using a new platform connected with LLMs, we conducted a LLM-corrected mock exam during the "Introduction to Statistics" lecture at the University of Munich (Germany). The online platform allows instructors to upload exercises along with the correct solutions. Students complete these exercises and receive overall feedback on their results, as well as individualized feedback generated by GPT-4 based on the correct answers provided by the lecturers. The resulting dataset comprised task-level information for all participating students, including individual responses and the corresponding LLM-generated feedback. Our systematic analysis revealed that approximately 7 \% of the 2,389 feedback instances contained errors, ranging from minor technical inaccuracies to conceptually misleading explanations. Further, using a combined feedback framework approach, we found that the feedback predominantly focused on explaining why an answer was correct or incorrect, with fewer instances providing deeper conceptual insights, learning strategies or self-regulatory advice. These findings highlight both the potential and the limitations of deploying LLMs as scalable feedback tools in higher education, emphasizing the need for careful quality monitoring and prompt design to maximize their pedagogical value.

2025-11-06T09:18:54Z Preprint Markus Herklotz Niklas Ippisch Anna-Carolina Haensch http://arxiv.org/abs/2511.03242v1 Topography, climate, land cover, and biodiversity: Explaining endemic richness and management implications on a Mediterranean island 2025-11-05T07:09:18Z

Island endemism is shaped by complex interactions among environmental, ecological, and evolutionary factors, yet the relative contributions of topography, climate, and land cover remain incompletely quantified. We investigated the drivers of endemic plant richness across Crete, a Mediterranean biodiversity hotspot, using spatially explicit data on species distributions, topographic complexity, climatic variability, land cover, and soil characteristics. Artificial Neural Network models, a machine learning tool, were employed to assess the relative importance of these predictors and to identify hotspots of endemism. We found that total species richness, elevation range, and climatic variability were the strongest predictors of endemic richness, reflecting the role of biodiversity, topographic heterogeneity, and climatic gradients in generating diverse habitats and micro-refugia that promote speciation and buffer extinction risk. Endemic hotspots only partially overlapped with areas of high total species richness, indicating that total species richness was the optimal from the ones examined, yet an imperfect surrogate. These environmentally heterogeneous areas also provide critical ecosystem services, including soil stabilization, pollination, and cultural value, which are increasingly threatened by tourism, renewable energy development, land-use change, and climate impacts. Our findings underscore the importance of prioritizing mountainous and climatically variable regions in conservation planning, integrating ecosystem service considerations, and accounting for within-island spatial heterogeneity. By explicitly linking the environmental drivers of endemism to both biodiversity patterns and ecosystem function, this study provides a framework for evidence-based conservation planning in Crete and other Mediterranean islands with similar geological and biogeographic contexts.

2025-11-05T07:09:18Z Aristides Moustakas Ioannis N Vogiatzakis http://arxiv.org/abs/2511.02881v1 From Hume to Jaynes: Induction as the Logic of Plausible Reasoning 2025-11-04T09:02:52Z

The problem of induction has persisted since Hume exposed the logical gap between repeated observation and universal inference. Traditional attempts to resolve it have oscillated between two extremes: the probabilistic optimism of Laplace and Jeffreys, who sought to quantify belief through probability, and the critical skepticism of Popper, who replaced confirmation with falsification. Both approaches, however, assume that induction must deliver certainty or its negation. In this paper, I argue that the problem of induction dissolves when recast in terms of logical coherence (understood as internal consistency of credences under updating) rather than truth. Following E. T. Jaynes, probability is interpreted not as frequency or decision rule but as the extension of deductive logic to incomplete information. Under this interpretation, Bayes's theorem is not an empirical statement but a consistency condition that constrains rational belief updating. Induction thus emerges as the special case of deductive reasoning applied to uncertain premises. Falsification appears as the limiting form of Bayesian updating when new data drive posterior plausibility toward zero, while the Bayes Factor quantifies the continuous spectrum of evidential strength. Through analytical examples, including Laplace's sunrise problem, Jeffreys's mixed prior, and confidence-based reformulations, I show that only the logic of plausible reasoning unifies these perspectives without contradiction. Induction, properly understood, is not the leap from past to future but the discipline of maintaining coherence between evidence, belief, and information.

2025-11-04T09:02:52Z Tommaso Costa http://arxiv.org/abs/2510.13389v2 Understanding and Using the Relative Importance Measures Based on Orthogonalization and Reallocation 2025-11-03T05:28:34Z

A class of relative importance measures based on orthogonalization and reallocation, ORMs, has been found to effectively approximate the General Dominance index (GD). In particular, Johnson's Relative Weight (RW) has been deemed the most successful ORM in the literature. Nevertheless, the theoretical foundation of the ORMs remains unclear. To further understand the ORMs, we provide a generalized framework that breaks down the ORM into two functional steps: orthogonalization and reallocation. To assess the impact of each step on the performance of ORMs, we conduct extensive Monte Carlo simulations under various predictors' correlation structures and response variable distributions. Our findings reveal that Johnson's minimal transformation consistently outperforms other common orthogonalization methods. We also summarize the performance of reallocation methods under four scenarios of predictors' correlation structures in terms of the first principal component and the variance inflation factor (VIF). This analysis provides guidelines for selecting appropriate reallocation methods in different scenarios, illustrated with real-world dataset examples. Our research offers a deeper understanding of ORMs and provides valuable insights for practitioners seeking to accurately measure variable importance in various modeling contexts.

2025-10-15T10:28:09Z 20 pages, 10 figures Tien-En Chang Argon Chen http://arxiv.org/abs/2511.00982v1 The Neutrality Boundary Framework: Quantifying Statistical Robustness Geometrically 2025-11-02T15:50:21Z

We introduce the Neutrality Boundary Framework (NBF), a set of geometric metrics for quantifying statistical robustness and fragility as the normalized distance from the neutrality boundary, the manifold where the effect equals zero. The neutrality boundary value nb in [0,1) provides a threshold-free, sample-size invariant measure of stability that complements traditional effect sizes and p-values. We derive the general form nb = |Delta - Delta_0| / (|Delta - Delta_0| + S), where S>0 is a scale parameter for normalization; we prove boundedness and monotonicity, and provide domain-specific implementations: Risk Quotient (binary outcomes), partial eta^2 (ANOVA), and Fisher z-based measures (correlation). Unlike threshold-dependent fragility indices, NBF quantifies robustness geometrically across arbitrary significance levels and statistical contexts.

2025-11-02T15:50:21Z 8 pages, no figures Thomas F. Heston http://arxiv.org/abs/2503.10710v3 How causal perspectives can inform neuroscience data analysis 2025-11-01T01:41:09Z

Over the past two decades, considerable strides have been made in advancing neuroscientific techniques, yet challenges remain in attributing causality to observed associations. This review addresses a fundamental issue in observational neuroscience studies and advocates for incorporating causal inference frameworks into standard practice. We systematically introduce necessary definitions and concepts, emphasizing how causal assumptions underlie statistical analyses even when not explicitly stated. Through a running example on sleep quality and white matter integrity, we illustrate how persistent challenges, including confounding and selection biases, can be conceptualized and addressed using causal frameworks. We demonstrate practical approaches for making assumption violations transparent through hands-on examples: supplementary case studies using multi-site harmonization and head motion exclusion procedures provide step-by-step diagnostic techniques for checking covariate overlap and identifying selection bias through exclusion pattern analysis. We explore how these causal perspectives can inform both experimental design and analytical choices, particularly for observational studies where traditional randomization is infeasible. Together, we believe this framework offers concrete tools for strengthening causal interpretations and inspiring more robust approaches to problems in neuroscience.

2025-03-12T22:20:24Z Eric W. Bridgeford Brian S. Caffo Maya B. Mathur Russell A. Poldrack http://arxiv.org/abs/2409.14284v5 Survey Data Integration for Distribution Function Estimation 2025-10-30T15:34:00Z

Estimates of finite population cumulativedistribution functions (CDFs) and quantiles are critical forpolicy-making, resource allocation, and public health planning. For instance, federal finance agencies may require accurate estimates of the proportion of individuals with income below the federal poverty line to determine funding eligibility, while health organizations may rely on precise quantile estimates of key health variables to guide local health interventions. Despite growing interest in survey data integration, research on the integration of probability and nonprobability samples toestimate CDFs and quantiles remains limited. In this study, we propose a novel residual-based CDF estimator that integrates information from a probability sample with data from potentially large nonprobability samples. Our approach leverages shared covariates observed in both datasets, while the response variable is available only in the nonprobability sample. Using a semiparametric approach, we train an outcome model on the nonprobability sample and incorporate model residuals with sampling weights from the probability sample to estimate the CDF of the target variable. Based on this CDF estimator, we define a quantile estimator and introduce linearization and bootstrap methods for variance estimation of both the CDF and quantile estimators. Under certain regularity conditions, we establish the asymptotic properties, including bias and variance, of the CDF estimator. Our empirical findings support the theoretical results and demonstrate the favorable performance of the proposed estimators relative to plug-in mass imputation estimators and the naïve estimators derived from the nonprobability sample only. A real data example is presented to illustrate the proposed estimators.

2024-09-22T01:09:19Z Jeremy Flood Sayed Mostafa http://arxiv.org/abs/2510.26177v1 Variable selection in spatial lag models using the focussed information criterion 2025-10-30T06:35:04Z

Spatial regression models have a variety of applications in several fields ranging from economics to public health. Typically, it is of interest to select important exogenous predictors of the spatially autocorrelated response variable. In this paper, we propose variable selection in linear spatial lag models by means of the focussed information criterion (FIC). The FIC-based variable selection involves the minimization of the asymptotic risk in the estimation of a certain parametric focus function of interest under potential model misspecification. We systematically investigate the key asymptotics of the maximum likelihood estimators under the sequence of locally perturbed mutually contiguous probability models. Using these results, we obtain the expressions for the bias and the variance of the estimated focus leading to the desired FIC formula. We provide practically useful focus functions that account for various spatial characteristics such as mean response, variability in the estimation and spatial spillover effects. Furthermore, we develop an averaged version of the FIC that incorporates varying covariate levels while evaluating the models. The empirical performance of the proposed methodology is demonstrated through simulations and real data analysis.

2025-10-30T06:35:04Z 20 pages, 2 figures, 3 tables Sagar Pandhare Divya Kappara Siuli Mukhopadhyay http://arxiv.org/abs/2510.23830v1 Statistical estimation of $π$: varying choices over dimensions 2025-10-27T20:16:49Z

This article studies statistical estimation of $π$ based on the fact that the ratio of the volumes of a $d$-dimensional hypersphere and a $d$-dimensional hypercube is a certain function of $π$, and the function depends on the dimension $d$. The estimation of $π$ is carried out for various choices of $d$ (strictly speaking, $d\in\{1, 2, \ldots, 20\}$) using the idea of Monte Carlo simulations. Various intriguing facts are observed, and the estimation of $π$ using infinite dimensional observations is outlined. Moreover, the R codes associated with relevant numerical studies are provided.

2025-10-27T20:16:49Z 8 pages, 4 figures. This is a preliminary draft. The manuscript will be updated further before formal communication Syon Bhattacharjee Subhra Sankar Dhar http://arxiv.org/abs/2505.20822v2 Larger cities, more commuters, more crime? The role of inter-city commuting in the scaling of urban crime 2025-10-27T11:09:03Z

Cities attract a daily influx of non-resident commuters, reflecting their roles within wider urban networks -- not as isolated places. However, it remains unclear how this interconnectivity shapes the way crime scales with population, given that larger cities tend to receive more commuters and experience more crime. In this work, we investigate how inter-city commuting relates to the population-crime relationship. We find that larger cities receive proportionately more commuters, which in turn is associated with higher levels of burglary, drug possession, robbery, shoplifting, and theft. For example, each 1% increase in inbound commuters corresponds to a 0.32% rise in theft and 0.20% rise in burglary, holding population size constant. We demonstrate that models incorporating both population size and commuter inflows explain variation in these offenses better than population-only models. Our findings underscore the importance of considering how cities are connected -- not just their population size -- in disentangling the population-crime relationship.

2025-05-27T07:31:43Z 19 pages, 3 figures Simon Puttock Umberto Barros Diego Pinheiro Marcos Oliveira