Modeling diffusion in networks with communities: a multitype branching process approach

2025-12-08T09:51:48Z

The dynamics of diffusion in complex networks are widely studied to understand how entities, such as information, diseases, or behaviors, spread in an interconnected environment. Complex networks often present community structure, and tools to analyze diffusion processes on networks with communities are needed. In this paper, we develop theoretical tools using multi-type branching processes to model and analyze diffusion processes, following a simple contagion mechanism, across a broad class of networks with community structure. We show how, by using limited information about the network -- the degree distribution within and between communities -- we can calculate standard statistical characteristics of propagation dynamics, such as the extinction probability, hazard function, and cascade size distribution. These properties can be estimated not only for the entire network but also for each community separately. Furthermore, we estimate the probability of spread crossing from one community to another where it is not currently spreading. We demonstrate the accuracy of our framework by applying it to two specific examples: the Stochastic Block Model and a log-normal network with community structure. We show how the initial seeding location affects the observed cascade size distribution on a heavy-tailed network and that our framework accurately captures this effect.

Evidence and Elimination: A Bayesian Interpretation of Falsification in Scientific Practice

2025-12-07T10:32:51Z

The classical conception of falsification presents scientific theories as entities that are decisively refuted when their predictions fail. This picture has long been challenged by both philosophical analysis and scientific practice, yet the relationship between Popper's eliminative view of theory testing and Bayesian model comparison remains insufficiently articulated. This paper develops a unified account in which falsification is reinterpreted as a Bayesian process of model elimination. A theory is not rejected because it contradicts an observation in a logical sense; it is eliminated because it assigns vanishing integrated probability to the data in comparison with an alternative model. This reinterpretation resolves the difficulties raised by the Duhem-Quine thesis, clarifies the status of auxiliary hypotheses, and explains why ad hoc modifications reduce rather than increase theoretical credibility. The analysis is illustrated through two classical episodes in celestial mechanics, the discovery of Neptune and the anomalous precession of Mercury. In the Neptune case, an auxiliary hypothesis internal to Newtonian gravity dramatically increases the marginal likelihood of the theory, preserving it from apparent refutation. In the Mercury case, no permissible auxiliary modification can rescue the Newtonian model, while general relativity assigns high probability to the anomaly without adjustable parameters. The resulting posterior collapse provides a quantitative realisation of Popper's eliminative criterion. Bayesian model comparison therefore supplies the mathematical structure that Popper's philosophy lacked and offers a coherent account of scientific theory change as a process of successive eliminations within a space of competing models.

Interpret the estimand framework from a causal inference perspective

2025-12-06T07:52:00Z

The estimand framework proposed by ICH in 2017 has brought fundamental changes in the pharmaceutical industry. It clearly describes how a treatment effect in a clinical question should be precisely defined and estimated, through attributes including treatments, endpoints and intercurrent events. However, ideas around the estimand framework are commonly in text, and different interpretations on this framework may exist. This article aims to interpret the estimand framework through its underlying theories, the causal inference framework based on potential outcomes. The statistical origin and formula of an estimand is given through the causal inference framework, with all attributes translated into statistical terms. We describe how five strategies proposed by ICH to analyze intercurrent events are incorporated in the statistical formula of an estimand, and we also suggest a new strategy to analyze intercurrent events. The roles of target populations and analysis sets in the estimand framework are compared and discussed based on the statistical formula of an estimand. This article recommends continuing studying causal inference theories behind the estimand framework and improving the estimand framework with greater methodological comprehensibility and availability.

Robustness Test for AI Forecasting of Hurricane Florence Using FourCastNetv2 and Random Perturbations of the Initial Condition

2025-12-04T23:47:21Z

Understanding the robustness of a weather forecasting model with respect to input noise or different uncertainties is important in assessing its output reliability, particularly for extreme weather events like hurricanes. In this paper, we test sensitivity and robustness of an artificial intelligence (AI) weather forecasting model: NVIDIAs FourCastNetv2 (FCNv2). We conduct two experiments designed to assess model output under different levels of injected noise in the models initial condition. First, we perturb the initial condition of Hurricane Florence from the European Centre for Medium-Range Weather Forecasts (ECMWF) Reanalysis v5 (ERA5) dataset (September 13-16, 2018) with varying amounts of Gaussian noise and examine the impact on predicted trajectories and forecasted storm intensity. Second, we start FCNv2 with fully random initial conditions and observe how the model responds to nonsensical inputs. Our results indicate that FCNv2 accurately preserves hurricane features under low to moderate noise injection. Even under high levels of noise, the model maintains the general storm trajectory and structure, although positional accuracy begins to degrade. FCNv2 consistently underestimates storm intensity and persistence across all levels of injected noise. With full random initial conditions, the model generates smooth and cohesive forecasts after a few timesteps, implying the models tendency towards stable, smoothed outputs. Our approach is simple and portable to other data-driven AI weather forecasting models.

Invited Discussion of "Model Uncertainty and Missing Data: An Objective Bayesian Perspective" by Gonzalo García-Donato , María Eugenia Castellanos , Stefano Cabras Alicia Quirós , and Anabel Forte

2025-12-02T22:15:13Z

The article by Garc{í}a-Donato and co-authors addresses the dual challenges of accounting for model uncertainty and missing data within the Gaussian regression frameworks from an objective Bayesian perspective. Thru the use of an imputation $g$-prior that replaces $X_γ^TX_γ$ for model $γ$ in the covariance of $β_γ$ with $Σ_{X_γ}$, the authors develop a coherent approach to addressing the missing data problem and model uncertainty simultaneously with random $X_γ$ in the missing at random (MAR) or missing completely at random (MCAR) settings, while still being computationally tractable. I discuss the connection of the imputation $g$-prior to the $g$-prior with imputed $X$, and to model selection for graphical models that provide an alternative justification for the $g$-prior for random $X$s.

Geometric Modeling of Hippocampal Tau Deposition: A Surface-Based Framework for Covariate Analysis and Off-Target Contamination Detection

2025-12-02T18:58:40Z

We introduce a framework combining geometric modeling with disease progression analysis to investigate tau deposition in Alzheimer's disease (AD) using positron emission tomography (PET) data. Focusing on the hippocampus, we construct a principal surface that captures the spatial distribution and morphological changes of tau pathology. By projecting voxels onto this surface, we quantify tau coverage, intensity, and thickness through bidirectional projection distances and interpolated standardized uptake value ratios (SUVR). This low-dimensional embedding preserves spatial specificity while mitigating multiple comparison issues. Covariate effects are analyzed using a two-stage regression model with inverse probability weighting to adjust for signal sparsity and selection bias. Using the SuStaIn model, we identify subtypes and stages of AD, revealing distinct tau dynamics: the limbic-predominant subtype shows age-related nonlinear accumulation in coverage and thickness, whereas the posterior subtype exhibits uniform SUVR increases across disease progression. Model-based predictions show that hippocampal tau deposition follows a structured spatial trajectory expanding bidirectionally with increasing thickness, while subtype differences highlight posterior hippocampal involvement consistent with whole-brain patterns. Finally, directional signal patterns on the principal surface reveal contamination from the choroid plexus, demonstrating the broader applicability of the proposed framework across modalities including amyloid PET.

Q-triplet characterization of atmospheric time series at Antofagasta: A missing values problem

2025-12-02T14:28:57Z

Located in northern Chile (23.7°S, 70.4°W), Antofagasta has an exceptionally arid and stable climate characterized by minimal precipitation and consistent weather patterns. Nevertheless, despite these climate conditions being meaningful for several research and practical applications, our understanding of weather dynamics remains limited. The available meteorological data from 1969 to 2016 is analogical, which presents a significant challenge to analyze because these records are riddled with missing values, some measurements were taken at irregular measuring intervals, making it an interesting puzzle to grasp the Antofagasta's climate scenario. To overcome this issue, we present a comprehensive statistical analysis of atmospheric temperature, pressure, and humidity time series. Our analytical approach involves the q-triplet calculation method, serving as a powerful tool to identify distinctive behavior within systems under non-equilibrium states. Our results suggest that, in general, the q-triplet values satisfy the condition $q_\text{sens}<1

Robust Estimation of Polychoric Correlation

2025-12-01T11:58:15Z

Polychoric correlation is often an important building block in the analysis of rating data, particularly for structural equation models. However, the commonly employed maximum likelihood (ML) estimator is highly susceptible to misspecification of the polychoric correlation model, for instance through violations of latent normality assumptions. We propose a novel estimator that is designed to be robust against partial misspecification of the polychoric model, that is, when the model is misspecified for an unknown fraction of observations, such as careless respondents. To this end, the estimator minimizes a robust loss function based on the divergence between observed frequencies and theoretical frequencies implied by the polychoric model. In contrast to existing literature, our estimator makes no assumption on the type or degree of model misspecification. It furthermore generalizes ML estimation, is consistent as well as asymptotically normally distributed, and comes at no additional computational cost. We demonstrate the robustness and practical usefulness of our estimator in simulation studies and an empirical application on a Big Five administration. In the latter, the polychoric correlation estimates of our estimator and ML differ substantially, which, after further inspection, is likely due to the presence of careless respondents that the estimator helps identify.

Subgroup Validity in Machine Learning for Echocardiogram Data

2025-11-30T16:47:04Z

Echocardiogram datasets enable training deep learning models to automate interpretation of cardiac ultrasound, thereby expanding access to accurate readings of diagnostically-useful images. However, the gender, sex, race, and ethnicity of the patients in these datasets are underreported and subgroup-specific predictive performance is unevaluated. These reporting deficiencies raise concerns about subgroup validity that must be studied and addressed before model deployment. In this paper, we show that current open echocardiogram datasets are unable to assuage subgroup validity concerns. We improve sociodemographic reporting for two datasets: TMED-2 and MIMIC-IV-ECHO. Analysis of six open datasets reveals no consideration of gender-diverse patients and insufficient patient counts for many racial and ethnic groups. We further perform an exploratory subgroup analysis of two published aortic stenosis detection models on TMED-2. We find insufficient evidence for subgroup validity for sex, racial, and ethnic subgroups. Our findings highlight that more data for underrepresented subgroups, improved demographic reporting, and subgroup-focused analyses are needed to prove subgroup validity in future work.

Incorporating Missingness in a Framework for Generating Realistic Synthetic Randomized Controlled Trial Data

2025-11-28T19:45:14Z

The current literature regarding generation of complex, realistic synthetic tabular data, particularly for randomized controlled trials (RCTs), often ignores missing data. However, missing data are common in RCT data and often are not Missing Completely At Random. We bridge the gap of determining how best to generate realistic synthetic data while also accounting for the missingness mechanism. We demonstrate how to generate synthetic missing values while ensuring that synthetic data mimic the targeted real data distribution. We propose and empirically compare several data generation frameworks utilizing various strategies for handling missing data (complete case, inverse probability weighting, and multiple imputation) by quantifying generation performance through a range of metrics. Focusing on the Missing At Random setting, we find that incorporating additional models to account for the missingness always outperformed a complete case approach.

Finite- and Large- Sample Inference for Model and Coefficients in High-dimensional Linear Regression with Repro Samples

2025-11-27T00:18:29Z

In this paper, we present a novel and effective inference approach to conduct both finite- and large-sample inference for high-dimensional linear regression models. This approach is developed under the so-called repro samples framework, in which we conduct statistical inference by creating and studying the behavior of artificial samples that are obtained by mimicking the sampling mechanism of the data. We construct confidence sets for (a) the true model corresponding to the nonzero coefficients, (b) a single or any collection of regression coefficients, and (c) both the model and regression coefficients jointly. To facilitate the constructions of these confidence sets and overcome computational difficulties of searching all possible models, we use an innovative Fisher inversion technique to construct a model candidate set that includes the true sparse model with the probability close to 1 for models with both Gaussian and non-Gaussian errors. The proposed approach fills in two major gaps in the high-dimensional regression literature: (1) lack of effective approaches to addressing model selection uncertainty and providing valid inference for the underlying true model; (2) lack of effective inference approaches to guaranteeing finite-sample performance. We provide both finite-sample and asymptotic results to theoretically guarantee the performance of the proposed methods. In addition, our numerical results demonstrate that the proposed methods are valid and achieve better coverage with smaller confidence sets than the current state-of-the-art approaches, such as debiasing and bootstrap approaches.

Integrating Deep Learning and Spatial Statistics in Marine Ecosystem Monitoring

2025-11-20T15:15:32Z

In ecology, photogrammetry is a crucial method for efficiently collecting non-destructive samples of natural environments. When estimating the spatial distribution of animals, detecting objects in large-scale images becomes crucial. Object detection models enable large-scale analysis but introduce uncertainty because detection probability depends on various factors. To address detection bias, we model the distribution of a species of benthic animals (holothurians) in an area of the Italian Tyrrhenian coast near Giglio Island using a Thinned Log-Gaussian Cox Process (LGCP). We assume that a "true" intensity function accurately describes the distribution, while the observed process, resulting from independent thinning, is represented by a degraded intensity. The detection function controls the thinning mechanism, influenced by the object's location and other detection-related features. We use manual identification of holothurians as our benchmark. We compare automatic detection with this benchmark, an unthinned LGCP, and the thinned model to highlight the improvements gained from the proposed approach.Our method allows researchers to use photogrammetry, automatically identify objects of interest, and correct biases and approximations caused by the observation process.

A System Dynamics Approach to Evaluating Sludge Management Strategies in Vinasse Treatment: Cost-Benefit Analysis and Scenario Assessment

2025-11-18T15:59:19Z

In the Chilean local alcohol industry (pisco indus- try), for one liter of alcohol produced 10-15 liters of vinasse as the main wastewater of the process. To comply with industrial waste regulations, vinasse must be stored, which enables evaporation, leaving behind a residual sludge. However, treating vinasse remains an environmental and industrial challenge, having a high nutrient concentration and acidity that can degrade soil quality and harm surrounding vegetation. While previous studies have modeled sludge generation and transport in urban water systems, research on industrial wastewater, such as the alcohol industry, remains limited, affecting the search for opportunities to improve the treatment process.. This paper proposes a System Dynamics Model (SDM) to assess the costs associated with three management strategies: natural drying of vinasse, relocation to an alternative site, and implementation of a coagulation- flocculation treatment to accelerate sludge production. This paper makes two contributions. First, we describe a pioneer SDM applicable to sludge management, which includes variables such as sludge transport, coagulant quantity, and ambient temperature to make hypothetical scenarios that affect the treatment processes of vinasse. Second, we present the expected results of the associated costs of the scenarios proposed in the model, helping decision-makers to manage vinasse. The model is calibrated with historical data provided by a company in the North of Chile, helping to improve the decision-making for vinasse treatment.

PLS-SEM-power: A Shiny App and R package for Computing Required Sample Size and Minimum Detectable Effect Size in PLS-SEMs

2025-11-18T14:50:18Z

Despite its evanescent nature, statistical power is crucial for planning Partial Least Squares Structural Equation Modelling (PLS-SEM) studies. This brief paper introduces PLS-SEM-power, a Shiny Application and R package that implements the inverse square root method by Kock and Hadaya (2018) to calculate both the minimum required sample size (a priori analysis) and the Minimum Detectable Effect Size (MDES, sensitivity analysis), given a chosen significance level (alpha level) at 80% power (1 - beta). The application provides an intuitive user interface, facilitating reproducible and easily accessible analyses in diverse research contexts.

Making Evidence Actionable in Adaptive Learning

2025-11-18T02:06:08Z

Adaptive learning often diagnoses precisely yet intervenes weakly, yielding help that is mistimed or misaligned. This study presents evidence supporting an instructor-governed feedback loop that converts concept-level assessment evidence into vetted micro-interventions. The adaptive learning algorithm contains three safeguards: adequacy as a hard guarantee of gap closure, attention as a budgeted constraint for time and redundancy, and diversity as protection against overfitting to a single resource. We formalize intervention assignment as a binary integer program with constraints for coverage, time, difficulty windows informed by ability estimates, prerequisites encoded by a concept matrix, and anti-redundancy enforced through diversity. Greedy selection serves low-richness and tight-latency regimes, gradient-based relaxation serves rich repositories, and a hybrid method transitions along a richness-latency frontier. In simulation and in an introductory physics deployment with one thousand two hundred four students, both solvers achieved full skill coverage for essentially all learners within bounded watch time. The gradient-based method reduced redundant coverage by approximately twelve percentage points relative to greedy and harmonized difficulty across slates, while greedy delivered comparable adequacy with lower computational cost in scarce settings. Slack variables localized missing content and supported targeted curation, sustaining sufficiency across subgroups. The result is a tractable and auditable controller that closes the diagnostic-pedagogical loop and delivers equitable, load-aware personalization at classroom scale.