https://arxiv.org/api/sTP4GAnScbGWd1jCmWKEWx+5/GI2026-03-22T17:22:15Z16299015http://arxiv.org/abs/2407.18835v5Robust Estimation of Polychoric Correlation2025-12-01T11:58:15ZPolychoric correlation is often an important building block in the analysis of rating data, particularly for structural equation models. However, the commonly employed maximum likelihood (ML) estimator is highly susceptible to misspecification of the polychoric correlation model, for instance through violations of latent normality assumptions. We propose a novel estimator that is designed to be robust against partial misspecification of the polychoric model, that is, when the model is misspecified for an unknown fraction of observations, such as careless respondents. To this end, the estimator minimizes a robust loss function based on the divergence between observed frequencies and theoretical frequencies implied by the polychoric model. In contrast to existing literature, our estimator makes no assumption on the type or degree of model misspecification. It furthermore generalizes ML estimation, is consistent as well as asymptotically normally distributed, and comes at no additional computational cost. We demonstrate the robustness and practical usefulness of our estimator in simulation studies and an empirical application on a Big Five administration. In the latter, the polychoric correlation estimates of our estimator and ML differ substantially, which, after further inspection, is likely due to the presence of careless respondents that the estimator helps identify.2024-07-26T15:54:37Z78 pages (37 main text), 21 figures (9 in main text), 10 tables (5 in main text). This is the final version of this article, as accepted in PsychometrikaPsychometrika 91 (2026) 247-278Max WelzPatrick MairAndreas Alfons10.1017/psy.2025.10066http://arxiv.org/abs/2512.00976v1Subgroup Validity in Machine Learning for Echocardiogram Data2025-11-30T16:47:04ZEchocardiogram datasets enable training deep learning models to automate interpretation of cardiac ultrasound, thereby expanding access to accurate readings of diagnostically-useful images. However, the gender, sex, race, and ethnicity of the patients in these datasets are underreported and subgroup-specific predictive performance is unevaluated. These reporting deficiencies raise concerns about subgroup validity that must be studied and addressed before model deployment. In this paper, we show that current open echocardiogram datasets are unable to assuage subgroup validity concerns. We improve sociodemographic reporting for two datasets: TMED-2 and MIMIC-IV-ECHO. Analysis of six open datasets reveals no consideration of gender-diverse patients and insufficient patient counts for many racial and ethnic groups. We further perform an exploratory subgroup analysis of two published aortic stenosis detection models on TMED-2. We find insufficient evidence for subgroup validity for sex, racial, and ethnic subgroups. Our findings highlight that more data for underrepresented subgroups, improved demographic reporting, and subgroup-focused analyses are needed to prove subgroup validity in future work.2025-11-30T16:47:04ZCynthia FeeneyShane WilliamsBenjamin S. WesslerMichael C. Hugheshttp://arxiv.org/abs/2512.00183v1Incorporating Missingness in a Framework for Generating Realistic Synthetic Randomized Controlled Trial Data2025-11-28T19:45:14ZThe current literature regarding generation of complex, realistic synthetic tabular data, particularly for randomized controlled trials (RCTs), often ignores missing data. However, missing data are common in RCT data and often are not Missing Completely At Random. We bridge the gap of determining how best to generate realistic synthetic data while also accounting for the missingness mechanism. We demonstrate how to generate synthetic missing values while ensuring that synthetic data mimic the targeted real data distribution. We propose and empirically compare several data generation frameworks utilizing various strategies for handling missing data (complete case, inverse probability weighting, and multiple imputation) by quantifying generation performance through a range of metrics. Focusing on the Missing At Random setting, we find that incorporating additional models to account for the missingness always outperformed a complete case approach.2025-11-28T19:45:14ZNiki Z. PetrakosErica E. M. MoodieNicolas Savyhttp://arxiv.org/abs/2209.09299v4Finite- and Large- Sample Inference for Model and Coefficients in High-dimensional Linear Regression with Repro Samples2025-11-27T00:18:29ZIn this paper, we present a novel and effective inference approach to conduct both finite- and large-sample inference for high-dimensional linear regression models. This approach is developed under the so-called repro samples framework, in which we conduct statistical inference by creating and studying the behavior of artificial samples that are obtained by mimicking the sampling mechanism of the data. We construct confidence sets for (a) the true model corresponding to the nonzero coefficients, (b) a single or any collection of regression coefficients, and (c) both the model and regression coefficients jointly. To facilitate the constructions of these confidence sets and overcome computational difficulties of searching all possible models, we use an innovative Fisher inversion technique to construct a model candidate set that includes the true sparse model with the probability close to 1 for models with both Gaussian and non-Gaussian errors. The proposed approach fills in two major gaps in the high-dimensional regression literature: (1) lack of effective approaches to addressing model selection uncertainty and providing valid inference for the underlying true model; (2) lack of effective inference approaches to guaranteeing finite-sample performance. We provide both finite-sample and asymptotic results to theoretically guarantee the performance of the proposed methods. In addition, our numerical results demonstrate that the proposed methods are valid and achieve better coverage with smaller confidence sets than the current state-of-the-art approaches, such as debiasing and bootstrap approaches.2022-09-19T18:48:16ZPeng WangMin-Ge XieLinjun Zhanghttp://arxiv.org/abs/2511.18225v1Adaptive Conformal Prediction for Quantum Machine Learning2025-11-23T00:04:03ZQuantum machine learning seeks to leverage quantum computers to improve upon classical machine learning algorithms. Currently, robust uncertainty quantification methods remain underdeveloped in the quantum domain, despite the critical need for reliable and trustworthy predictions. Recent work has introduced quantum conformal prediction, a framework that produces prediction sets that are guaranteed to contain the true outcome with user-specified probability. In this work, we formalise how the time-varying noise inherent in quantum processors can undermine conformal guarantees, even when calibration and test data are exchangeable. To address this challenge, we draw on Adaptive Conformal Inference, a method which maintains validity over time via repeated recalibration. We introduce Adaptive Quantum Conformal Prediction (AQCP), an algorithm which preserves asymptotic average coverage guarantees under arbitrary hardware noise conditions. Empirical studies on an IBM quantum processor demonstrate that AQCP achieves target coverage levels and exhibits greater stability than quantum conformal prediction.2025-11-23T00:04:03Z26 pages, 5 figuresDouglas SpencerSamual NichollsMichele Capriohttp://arxiv.org/abs/2511.16447v1Integrating Deep Learning and Spatial Statistics in Marine Ecosystem Monitoring2025-11-20T15:15:32ZIn ecology, photogrammetry is a crucial method for efficiently collecting non-destructive samples of natural environments. When estimating the spatial distribution of animals, detecting objects in large-scale images becomes crucial. Object detection models enable large-scale analysis but introduce uncertainty because detection probability depends on various factors. To address detection bias, we model the distribution of a species of benthic animals (holothurians) in an area of the Italian Tyrrhenian coast near Giglio Island using a Thinned Log-Gaussian Cox Process (LGCP). We assume that a "true" intensity function accurately describes the distribution, while the observed process, resulting from independent thinning, is represented by a degraded intensity. The detection function controls the thinning mechanism, influenced by the object's location and other detection-related features. We use manual identification of holothurians as our benchmark. We compare automatic detection with this benchmark, an unthinned LGCP, and the thinned model to highlight the improvements gained from the proposed approach.Our method allows researchers to use photogrammetry, automatically identify objects of interest, and correct biases and approximations caused by the observation process.2025-11-20T15:15:32ZGian Mario SangiovanniGianluca MastrantonioDaniele VenturaAlessio PolliceGiovanna Jona Lasiniohttp://arxiv.org/abs/2511.14607v1A System Dynamics Approach to Evaluating Sludge Management Strategies in Vinasse Treatment: Cost-Benefit Analysis and Scenario Assessment2025-11-18T15:59:19ZIn the Chilean local alcohol industry (pisco indus- try), for one liter of alcohol produced 10-15 liters of vinasse as the main wastewater of the process. To comply with industrial waste regulations, vinasse must be stored, which enables evaporation, leaving behind a residual sludge. However, treating vinasse remains an environmental and industrial challenge, having a high nutrient concentration and acidity that can degrade soil quality and harm surrounding vegetation. While previous studies have modeled sludge generation and transport in urban water systems, research on industrial wastewater, such as the alcohol industry, remains limited, affecting the search for opportunities to improve the treatment process.. This paper proposes a System Dynamics Model (SDM) to assess the costs associated with three management strategies: natural drying of vinasse, relocation to an alternative site, and implementation of a coagulation- flocculation treatment to accelerate sludge production. This paper makes two contributions. First, we describe a pioneer SDM applicable to sludge management, which includes variables such as sludge transport, coagulant quantity, and ambient temperature to make hypothetical scenarios that affect the treatment processes of vinasse. Second, we present the expected results of the associated costs of the scenarios proposed in the model, helping decision-makers to manage vinasse. The model is calibrated with historical data provided by a company in the North of Chile, helping to improve the decision-making for vinasse treatment.2025-11-18T15:59:19ZAgustin OlivaresPaul LegerRodrigo Pobletehttp://arxiv.org/abs/2511.14546v1PLS-SEM-power: A Shiny App and R package for Computing Required Sample Size and Minimum Detectable Effect Size in PLS-SEMs2025-11-18T14:50:18ZDespite its evanescent nature, statistical power is crucial for planning Partial Least Squares Structural Equation Modelling (PLS-SEM) studies. This brief paper introduces PLS-SEM-power, a Shiny Application and R package that implements the inverse square root method by Kock and Hadaya (2018) to calculate both the minimum required sample size (a priori analysis) and the Minimum Detectable Effect Size (MDES, sensitivity analysis), given a chosen significance level (alpha level) at 80% power (1 - beta). The application provides an intuitive user interface, facilitating reproducible and easily accessible analyses in diverse research contexts.2025-11-18T14:50:18Zfor the associated Shiny App, see https://aleansani.shinyapps.io/pls-sem-power for the user guide and code, see https://github.com/AleAnsani/plssempowerAlessandro AnsaniElena Rinallohttp://arxiv.org/abs/2511.14052v1Making Evidence Actionable in Adaptive Learning2025-11-18T02:06:08ZAdaptive learning often diagnoses precisely yet intervenes weakly, yielding help that is mistimed or misaligned. This study presents evidence supporting an instructor-governed feedback loop that converts concept-level assessment evidence into vetted micro-interventions. The adaptive learning algorithm contains three safeguards: adequacy as a hard guarantee of gap closure, attention as a budgeted constraint for time and redundancy, and diversity as protection against overfitting to a single resource. We formalize intervention assignment as a binary integer program with constraints for coverage, time, difficulty windows informed by ability estimates, prerequisites encoded by a concept matrix, and anti-redundancy enforced through diversity. Greedy selection serves low-richness and tight-latency regimes, gradient-based relaxation serves rich repositories, and a hybrid method transitions along a richness-latency frontier. In simulation and in an introductory physics deployment with one thousand two hundred four students, both solvers achieved full skill coverage for essentially all learners within bounded watch time. The gradient-based method reduced redundant coverage by approximately twelve percentage points relative to greedy and harmonized difficulty across slates, while greedy delivered comparable adequacy with lower computational cost in scarce settings. Slack variables localized missing content and supported targeted curation, sustaining sufficiency across subgroups. The result is a tractable and auditable controller that closes the diagnostic-pedagogical loop and delivers equitable, load-aware personalization at classroom scale.2025-11-18T02:06:08ZAmirreza MehrabiJason W. MorphewBreejha QuezadaN. Sanjay Rebellohttp://arxiv.org/abs/2511.17575v1Random Text, Zipf's Law, Critical Length,and Implications for Large Language Models2025-11-14T23:05:59ZWe study a deliberately simple, fully non-linguistic model of text: a sequence of independent draws from a finite alphabet of letters plus a single space symbol. A word is defined as a maximal block of non-space symbols. Within this symbol-level framework, which assumes no morphology, syntax, or semantics, we derive several structural results. First, word lengths follow a geometric distribution governed solely by the probability of the space symbol. Second, the expected number of words of a given length, and the expected number of distinct words of that length, admit closed-form expressions based on a coupon-collector argument. This yields a critical word length k* at which word types transition from appearing many times on average to appearing at most once. Third, combining the exponential growth of the number of possible strings of length k with the exponential decay of the probability of each string, we obtain a Zipf-type rank-frequency law p(r) proportional to r^{-alpha}, with an exponent determined explicitly by the alphabet size and the space probability.
Our contribution is twofold. Mathematically, we give a unified derivation linking word lengths, vocabulary growth, critical length, and rank-frequency structure in a single explicit model. Conceptually, we argue that this provides a structurally grounded null model for both natural-language word statistics and token statistics in large language models. The results show that Zipf-like patterns can arise purely from combinatorics and segmentation, without optimization principles or linguistic organization, and help clarify which phenomena require deeper explanation beyond random-text structure.2025-11-14T23:05:59ZVladimir Bermanhttp://arxiv.org/abs/2507.09007v2Possibilistic inferential models: a review2025-11-12T13:11:39ZAn inferential model (IM) is a model describing the construction of provably reliable, data-driven uncertainty quantification and inference about relevant unknowns. IMs and Fisher's fiducial argument have similar objectives, but a fundamental distinction between the two is that the former doesn't require that uncertainty quantification be probabilistic, offering greater flexibility and allowing for a proof of its reliability. Important recent developments have been made thanks in part to newfound connections with the imprecise probability literature, in particular, possibility theory. The brand of possibilistic IMs studied here are straightforward to construct, have very strong frequentist-like reliability properties, and offer fully conditional, Bayesian-like (imprecise) probabilistic reasoning. This paper reviews these key recent developments, describing the new theory, methods, and computational tools. A generalization of the basic possibilistic IM is also presented, making new and unexpected connections with ideas in modern statistics and machine learning, e.g., bootstrap and conformal prediction.2025-07-11T20:16:52ZRyan Martinhttp://arxiv.org/abs/2511.07628v1Beyond Correctness: Evaluating and Improving LLM Feedback in Statistical Education2025-11-10T20:56:26ZLarge language models (LLMs) have been proposed as scalable tools to address the gap between the importance of individualized written feedback and the practical challenges of providing it at scale. However, concerns persist regarding the accuracy, depth, and pedagogical value of their feedback responses. The present study investigates the extent to which LLMs can generate feedback that aligns with educational theory and compares techniques to improve their performance. Using mock in-class exam data from two consecutive years of an introductory statistics course at LMU Munich, we evaluated GPT-generated feedback against an established but expanded pedagogical framework. Four enhancement methods were compared in a highly standardized setting, making meaningful comparisons possible: Using a state-of-the-art model, zero-shot prompting, few-shot prompting, and supervised fine-tuning using Low-Rank Adaptation (LoRA). Results show that while all LLM setups reliably provided correctness judgments and explanations, their ability to deliver contextual feedback and suggestions on how students can monitor and regulate their own learning remained limited. Among the tested methods, zero-shot prompting achieved the strongest balance between quality and cost, while fine-tuning required substantially more resources without yielding clear advantages. For educators, this suggests that carefully designed prompts can substantially improve the usefulness of LLM feedback, making it a promising tool, particularly in large introductory courses where students would otherwise receive little or no written feedback.2025-11-10T20:56:26ZNiklas IppischMarkus HerklotzAnna-Carolina HaenschCarsten Schwemmerhttp://arxiv.org/abs/2511.05834v1Impacts of Data Splitting Strategies on Parameterized Link Prediction Algorithms2025-11-08T03:52:22ZLink prediction is a fundamental problem in network science, aiming to infer potential or missing links based on observed network structures. With the increasing adoption of parameterized models, the rigor of evaluation protocols has become critically important. However, a previously common practice of using the test set during hyperparameter tuning has led to human-induced information leakage, thereby inflating the reported model performance. To address this issue, this study introduces a novel evaluation metric, Loss Ratio, which quantitatively measures the extent of performance overestimation. We conduct large-scale experiments on 60 real-world networks across six domains. The results demonstrate that the information leakage leads to an average overestimation about 3.6\%, with the bias reaching over 15\% for specific algorithms. Meanwhile, heuristic and random-walk-based methods exhibit greater robustness and stability. The analysis uncovers a pervasive information leakage issue in link prediction evaluation and underscores the necessity of adopting standardized data splitting strategies to enable fair and reproducible benchmarking of link prediction models.2025-11-08T03:52:22Z18 pages, 3 figuresXinshan JiaoYuxin LuoYilin BiTao Zhouhttp://arxiv.org/abs/2511.04903v1Efficacy Analysis in Clinical Trials: A Comprehensive Review of Statistical and Machine Learning Approaches2025-11-07T01:05:12ZEfficacy testing is a cornerstone of clinical trials, ensuring that medical interventions achieve their intended therapeutic effects. Over the decades, a wide range of statistical methodologies have been developed to address the complexities of clinical trial data, including parametric, nonparametric, Bayesian, and machine learning approaches. Parametric methods, such as t-tests, ANOVA, and LMMs, have traditionally been the foundation of efficacy testing due to their efficiency under well-defined assumptions. Nonparametric techniques, including the Friedman test, Brunner-Munzel test, and modern extensions like nparLD, have emerged as robust alternatives, particularly for skewed, ordinal, or non-normal data. Bayesian methodologies have enabled the incorporation of prior information and uncertainty quantification, while machine learning techniques, such as deep learning and reinforcement learning, are revolutionizing trial designs and outcome predictions. Despite these advancements, significant gaps remain, including challenges in handling high-dimensional data, missingness, and ensuring equitable efficacy testing across diverse populations. This review provides a comprehensive overview of these statistical methods, highlighting their applications, strengths, limitations, and future directions. By bridging traditional statistical frameworks with modern computational techniques, the field can continue to advance toward more reliable and personalized clinical trial methodologies.2025-11-07T01:05:12ZDhrubajyoti GhoshSamhita Palhttp://arxiv.org/abs/2502.07948v2The nature of mathematical models2025-11-06T11:11:43ZModeling has become a widespread, useful tool in mathematics applied to diverse fields, from physics to economics to biomedicine. Practitioners of modeling may use algebraic or differential equations, to the elements of which they attribute an intuitive relationship with some relevant aspect of reality they wish to represent. More sophisticated expressions may include stochasticity, either as observation error or system noise. However, a clear, unambiguous mathematical definition of what a model is and of what is the relationship between the model and the real-life phenomena it purports to represent has so far not been formulated. The present work aims to fill this gap, motivating the definition of a mathematical model as an operator on a Hilbert space of random variables, identifying the experimental realization as the map between the theoretical space of model construction and the computational space of statistical model identification, and tracing the relationship of the geometry of the model manifold in the abstract setting with the corresponding geometry of the prediction surfaces in statistical estimation.2025-02-11T20:54:47Z23 pages, 3 figuresAndrea De Gaetano