Measuring Poverty and Inequality with Reduced Data: A Machine Learning Approach Using Nigerian Household Data

2026-05-29T19:33:31Z

Reliable measurement of income and consumption is essential for monitoring poverty and inequality in low- and middle-income countries, yet full household surveys are costly and difficult to implement regularly. This paper examines whether reduced survey instruments can preserve key distributional information. We apply Random Forest Recursive Feature Elimination (RF-RFE) to the 2018/19 Nigeria General Household Survey-Panel to identify the income sources, consumption categories and household characteristics that best classify individuals within the welfare distribution. The analysis focuses on three outcomes: poverty status, location in the quintile distribution and position relative to the Gini-based inequality line. The survey's post-planting and post-harvest periods allow us to assess performance under different seasonal contexts. Results show that RF-RFE achieves strong classification accuracy with few predictors. For consumption, poverty status and inequality-line position are accurately predicted using a small set of expenditure categories, while quintile classification reaches about 80 percent accuracy for seasonal consumption and 60--65 percent for annual consumption predicted from a single seasonal visit. For income, poverty status reaches around 90 percent accuracy with five predictors, and inequality-line position is largely captured by labour earnings. The findings suggest that machine-learning methods can help improve survey design and reduce data requirements while retaining much of the distributional information needed to measure and monitor poverty and inequality.

When Softmax Fails at the Top: Extreme Value Corrections for InfoNCE

2026-05-29T18:47:27Z

InfoNCE is the standard contrastive learning objective, but its softmax form is not only a computational convenience: it also encodes a statistical assumption about how the top-scoring example is selected. Using extreme value theory, we show that this assumption is often misaligned with the normalized embedding setting used in modern contrastive learning. Motivated by this mismatch, we propose \textsc{WEINCE}, a simple modification of InfoNCE that uses anchor-wise online batch statistics to blend the usual softmax logits with an endpoint shortfall correction, adding no trainable parameters. Across five vision benchmarks, \textsc{WEINCE} yields consistent improvements in frozen-feature evaluation. These results show that a more faithful statistical treatment of hard negatives can improve contrastive objectives.

Addressing errors in multiple variables using generalized raking and cumulative probability models

2026-05-29T17:34:03Z

Routinely collected data, such as electronic health record (EHR) data, are frequently used for biomedical research, but these data are prone to errors, which can bias study findings. Validating data in subsamples of records can reduce bias, and the efficiency of estimates can be improved by incorporating in analyses both the error-prone data available on the entire cohort and the validated data available on the subsample. One approach to incorporate both data sources is with generalized raking, which calibrates validation sampling weights using error-prone data from the entire cohort. Motivated by an EHR study of maternal weight gain during pregnancy with a validation subsample, we develop and illustrate generalized raking techniques for cumulative probability models (CPMs). CPMs are robust, rank-based and semiparametric models for continuous, ordinal, or mixed type outcome data. We develop efficient generalized raking estimators for CPMs, evaluate their performance relative to competing methods, and demonstrate the utility and strengths of generalized raking with CPMs in a study that examines factors associated with weight gain during pregnancy.

Highest Posterior Density Intervals of Unimodal Distributions As Analogues to Profile Likelihood Ratio Confidence Intervals

2026-05-29T16:29:44Z

In Bayesian statistics, the highest posterior density (HPD) interval is often used to describe properties of a posterior distribution. As a method for estimating confidence intervals (CIs), the HPD has two main desirable properties. Firstly, it is the shortest interval to have a specified coverage probability. Secondly, every point inside the HPD interval has a density greater than every point outside the interval. However, the HPD interval is sometimes criticized for being transformation invariant. We make the case that under certain conditions the HPD interval is a natural analog to the frequentist profile likelihood ratio confidence interval (LRCI). Our main result is to derive a proof showing that under specified conditions, the HPD interval with respect to the density mode is transformation invariant for monotonic functions in a manner which is similar to a profile LRCI.

Bayesian Nonparametric Clustering to Support Medical Decision-Making: A Variational Inference Approach

2026-05-29T16:29:35Z

Medical decision-making increasingly requires rapid and reliable assignment of patients to disease subtypes, as many diseases are no longer treated as single entities. For example, cancer patients may be stratified into aggressive and non-aggressive subtypes, with different treatment strategies for each group. We propose a Bayesian nonparametric approach based on a Dirichlet process mixture model for clustering individuals into disease subtypes. We implement a coordinate ascent variational inference algorithm, yielding an effective and computationally efficient alternative to Markov chain Monte Carlo (MCMC), to support medical decision-making. In synthetic experiments, we demonstrate that the proposed approach accurately assigns observations to their ground-truth clusters, achieving strong performance across evaluation metrics, such as homogeneity and completeness. Additionally, we illustrate the proposed approach achieves a substantial improvement in computational cost compared to MCMC, without sacrificing accuracy that would lead to the increased risk of misdiagnosis.

Assessing Racial Disparities in Healthcare Expenditures via Mediator Distribution Shifts

2026-05-29T15:54:34Z

Racial disparities in healthcare expenditures are well-documented, yet the underlying drivers remain complex. This study develops a framework to decompose such disparities through shifts in the distributions of mediating variables, rather than treating race itself as a manipulable exposure. We define disparities as differences in covariate-adjusted outcome distributions across racial groups, and decompose the total disparity into a component attributable to differences in mediator distributions, and a residual component that remains after equalizing those distributions. Using data from the Medical Expenditures Panel Survey (MEPS), we examine the extent to which expenditure disparities would persist or be reduced if mediators such as socioeconomic status (SES), insurance access, health behaviors, or health status were equalized across racial groups. To ensure valid inference, we derive asymptotically linear estimators based on influence-function techniques and flexible machine learning, including super learners and a two-part model designed for the zero-inflated, right-skewed nature of expenditure data. Applying this framework to MEPS data from 2009 and 2016, substantial disparities were observed across all pairwise racial comparisons, with the largest gaps observed between non-Hispanic Whites and Hispanics in both years. Differences in SES and health status were the largest contributors to these disparities, with insurance access also playing a meaningful role, particularly for Hispanic populations, whereas health behaviors contributed minimally. Residual disparities persisted, especially in comparisons involving non-Hispanic Whites, suggesting the influence of unmeasured or structural factors.

A Dynamic Latent Space Model for Healthcare Mobility Networks: the Italian National Health Service case

2026-05-29T14:58:50Z

Healthcare mobility -- patients seeking treatment outside their territory of residence -- represents a major source of inequality and financial imbalance in decentralised health systems. In Italy, persistent north-south asymmetries in patient flows among Local Health Authorities (ASLs) have reinforced existing disparities within the National Health Service; yet the structural organisation and temporal dynamics of these flows remain poorly understood at the sub-regional level. We propose a Bayesian dynamic latent space model for directed weighted networks with a hurdle negative binomial likelihood, and apply it to administrative discharge records on mobility for hip replacement procedures among 109 Italian ASLs over 2018-2024. The model jointly addresses excess zeros, overdispersion and network dependence, while capturing directional heterogeneity through multiplicative sender and receiver effects and controlling for differences in territorial size via an appropriate exposure term. Applied to Italian mobility data, the model reveals the evolving geometry of the healthcare system, quantifies the disruption induced by the COVID-19 pandemic, and uncovers structural asymmetries in outward propensity and ASLs attractiveness. The framework provides a flexible tool for the statistical analysis of dynamic healthcare mobility networks with direct relevance to the monitoring and evaluation of territorial healthcare provision.

The Effect of Mobility Trajectory Sparsity on Epidemic Modeling Outcomes

2026-05-29T13:14:32Z

GPS mobility data are increasingly used in epidemic modeling, allowing the construction of co-location networks or population flows. These trajectories typically exhibit high temporal sparsity because data collection is opportunistic and tied to phone use. Despite growing awareness of this limitation, the analysis and treatment of biases derived from it have been largely overlooked in existing epidemic modeling studies, raising concerns about the robustness of downstream inferences. We introduce a principled framework to quantify the impact of trajectory sparsity on key epidemic modeling outcomes across different levels of missingness. Our approach leverages a highly-complete dataset that exhibits both near-complete and sparse GPS trajectories. Near-complete trajectories provide baseline epidemic outcomes, while sparse trajectories provide realistic missingness patterns that we impose on the baseline to measure bias. In this way, we show how missing records can result in substantial underestimation of key measures of epidemic intensity, explained not only by the amount of missing data, but by more complex features of data missingness that should be taken into account when designing correction methods. Finally, we propose and evaluate a correction based on inverse probability weighting of network edges before epidemic model calibration, which is shown to reduce bias and parameter misspecification. We also demonstrate this correction on a separate anonymized sample from a commercial GPS mobility dataset and report on its effect. Together, our findings provide a first rigorous quantification of trajectory-sparsity bias in epidemic modeling, offering initial guidance on the treatment of this issue.

Subjective Time Deformation in Intertemporal Choice: A Functional Data Analysis Approach

2026-05-29T12:59:42Z

Intertemporal choice data are usually summarized through scalar discount-rate parameters or fitted by predetermined parametric discount functions, although relevant information may lie in the shape of the whole discounting trajectory. This paper proposes a Functional Data Analysis framework for reconstructing and analyzing implicit subjective-time trajectories from discrete intertemporal equivalence judgments. Monetary equivalence responses from a multilingual questionnaire are transformed into individual discount curves, regularized by monotone smoothing, and used to recover normalized implicit subjective-time trajectories. The trajectories are examined through derivative summaries, Functional Principal Component Analysis, and clustering on standardized component scores. The empirical application, based on 107 participants, shows that heterogeneity in intertemporal choice is not fully captured by scalar discount-rate variation. The first two functional principal components explain 97.44% of the variability, indicating a low-dimensional structure. Functional clustering identifies three stable profiles of temporal deformation, supported by bootstrap stability analysis and sensitivity checks on components, algorithms, distances, smoothing specifications, and outlier treatment. Parametric benchmarks based on exponential, Weber-Fechner, and Stevens specifications provide accurate fits for many individuals, but do not fully recover the functional clustering structure. The comparison with explicit subjective-time perception measures reveals only partial alignment between implicit trajectories reconstructed from choices and directly reported temporal perception. Functional Data Analysis provides an applied statistical framework for representing intertemporal choice heterogeneity as variation in functional shape, complementing scalar discount-rate and parametric subjective-time models.

A Kernel Score Perspective on Forecast Disagreement and the Linear Pool

2026-05-29T10:45:24Z

This paper generalizes several results on linear pooling from squared error loss to all kernel scores. The latter are a rich family of scoring rules that covers point and distribution forecasts for univariate and multivariate, discrete and continuous settings. Its members include the Continuous Ranked Probability Score for univariate distribution forecasting and the Energy Score for multivariate distribution forecasting. Our results indicate that forecast disagreement (measured as the average pairwise divergence of all component distributions) has important implications for the linear pool's performance. The results are useful for understanding and designing linear pools in general combination settings. In particular, they motivate using the linear pool (as opposed to other combination formulas) and yield a novel condition under which equal combination weights are optimal under a given kernel scoring rule.

Learning-to-Defer in Non-Stationary Time Series via Switching State-Space Models

2026-05-29T08:03:31Z

Learning-to-defer (L2D) routes each decision to a system's own predictor or to an external expert. Streaming time-series settings break the offline-L2D assumptions: the data are non-stationary, expert availability shifts over time, and the internal predictor is trained online. We propose L2D-SLDS, a one-stage online L2D framework based on a factorized switching linear-Gaussian state-space model over all potential residuals: a discrete regime, a shared global factor, and per-expert idiosyncratic states. The always-observed internal residual continuously updates beliefs about every unqueried expert through the shared factor, and a learner-aware query score balances immediate cost against latent-state information gain and one-step learner improvement. We prove an oracle inequality against a time-varying learn-and-defer comparator, decomposing regret into a query-bonus budget, an SLDS predictive-cost-error term~$\mathcal{E}_{\mathrm{SLDS}}$, and the internal learner's interval dynamic regret. On synthetic, Melbourne, Jena, and 24-expert Delhi benchmarks, L2D-SLDS is competitive with or improves on contextual- and non-stationary-bandit baselines while deferring on ${<}2\%$ of real-data rounds.

Coordination without communication: beyond optimisation and geometric Brownian motion

2026-05-29T07:35:55Z

We introduce a physically grounded framework for coordination in a population based on information constrained feedback in a partially observed stochastic dynamical system. Population size evolves as a continuous time birth death Markov process whose transition rates respond to a shared stochastic measurement signal correlated with the underlying population state. Individuals neither communicate directly nor optimise strategies; instead, coordination emerges from macro to micro feedback mediated by imperfect common information. We show that geometric Brownian motion arises as a limiting case of the conditional dynamics when measurement strength and population statistics satisfy suitable conditions. More generally, varying the signal to noise properties of the measurement channel produces a wider class of stochastic growth processes, including diffusive and jump like regimes, even though ensemble average growth remains exponential. In an appropriate limit the framework recovers the stochastic multiplicative growth model of Peters and Adamou, providing a physical interpretation of coordination as inference and feedback under partial observability.

Bayesian Classification with Probit-link Split-and-merge Gaussian Process Prior in EEG-based Brain-Computer Interfaces

2026-05-29T03:05:06Z

A Brain-Computer Interface (BCI) speller systems based on Event-Related Potentials (ERPs) enables users to select characters by detecting brain responses to visual stimuli, recorded through electroencephalogram (EEG). One challenge is to accurately identify target-related responses, such as the P300 component. However, existing methods tend to ignore feature selection, perform feature selection without interpretability, or require large computational effort or data manipulation. To address these limitations, we propose a novel Bayesian generative modeling framework to the binary classification of EEG responses to stimuli. Our approach employs a Probit-link Split-and-merge Gaussian Process (P-SMGP) prior to perform spatial-temporal feature selection, effectively capturing the distinctions between target and non-target ERP responses. Through both simulation studies and real EEG data analysis, our approach can reduce computational complexity and provide statistical interpretations on transformed ERP functions while maintaining comparable prediction accuracy. These findings underscore the value of interpretable, stimulus-level modeling for advancing predictive and personalized BCI systems.

Bayesian Rain Field Reconstruction using Commercial Microwave Links and Diffusion Model Priors

2026-05-28T23:25:19Z

Commercial Microwave Links (CMLs) offer dense spatial coverage for rainfall sensing but produce path-integrated measurements that make accurate ground-level reconstruction challenging. Existing methods typically oversimplify CMLs as point sensors and neglect line integration relating rainfall to signal attenuation, resulting in degraded performance under heterogeneous precipitation. In this work, we view rain field reconstruction as a Bayesian inverse problem with Diffusion Models (DMs) as high-fidelity spatial priors. We show that diffusion models better preserve key rainfall statistics compared to censored Gaussian processes. Framing rainfall estimation as a Bayesian inverse problem with a DM prior enables training-free posterior sampling using a broad family of methods, including Plug-and-Play, Sequential Monte Carlo, and Replica Exchange methods. Experiments on synthetic and real-world datasets demonstrate consistent improvements over established CML-based reconstruction baselines.

A Bayesian Framework for Uncertainty-Aware Estimation of Main Pulmonary Artery Velocity Profiles from Phase-Contrast MRI

2026-05-28T22:18:39Z

Computational cardiovascular flow models are highly sensitive to prescribed inlet velocity profiles. While imaging-derived velocity fields provide physiologically realistic information, they can introduce increased preprocessing complexity, imaging noise, and computational burden. Simplified analytical formulations are computationally efficient but may not fully capture subject-specific flow characteristics. In this study, we present an uncertainty-aware framework that combines two-dimensional phase-contrast magnetic resonance imaging (2D PC-MRI) with mechanistic velocity-profile formulations to generate subject-specific pulmonary artery velocity representations. Imaging-derived radial velocity distributions were constructed from main pulmonary artery (MPA) PC-MRI data in canine and swine subjects using elliptical radial binning and normalization. Power-law and Womersley velocity-profile formulations were fitted within a Bayesian inference framework while accounting for uncertainty associated with imaging measurements and model representation. The two formulations were compared using regional and global weighted root mean square error (wRMSE) metrics. Both models demonstrated close agreement with the imaging-derived velocity profiles across subjects. Although the Womersley formulation provided greater flexibility near the vessel wall, it did not result in statistically significant improvements in fitting performance compared with the simpler power-law model. The proposed framework provides low-dimensional, physiologically interpretable, and uncertainty-aware velocity-profile representations that may serve as computationally efficient alternatives for subject-specific cardiovascular flow modeling.