https://arxiv.org/api/6bW6xr9GnIB3hVccizOXiOPFe/k 2026-06-21T12:39:04Z 23582 645 15 http://arxiv.org/abs/2501.06540v2 Copula-enhanced Vision Transformer for high myopia diagnosis through OU UWF fundus images 2026-05-01T14:31:51Z

The advancement of AI-assisted myopia screening necessitates the joint diagnosis of both-eye (OU) high myopia (HM) status and the prediction of axial length (AL). This clinical requirement introduces a complex mixed-type (binary-continuous) multitask learning task with bi-domain (OU) image covariates, giving rise to two key challenges: i) capture the inter-ocular asymmetry of OU images within a cutting-edge foundation model; ii) model and estimate the conditional dependence structure among mixed-type multivariate responses given image covariates. We address the challenges by: i) imposing residual adapters on the Vision Transformer foundation model to capture the OU similarity and heterogeneity simultaneously; ii) developing a four-dimensional copula loss that is implementable in PyTorch based on a latent variable expression for the Gaussian copula likelihood, and proposing a computationally efficient fast Monte Carlo Expectation Maximization (fMCEM) algorithm to estimate copula parameters. We further formulate a specific overfitting problem called stronger covariance phenomenon in multitask learning. We reveal the disturbance of the phenomenon to estimation of copula parameters and theoretically demonstrate the numerical stability of the proposed fMCEM algorithm against the disturbance. The application to our annotated OU ultra-widefield fundus image dataset and simulation on synthetic data demonstrate that our method stably enhances the predictive capabilities on both classification and regression tasks.

2025-01-11T13:23:56Z Chong Zhong Yunhao Liu Yang Li Xiang Fu Jin Yang Danjuan Yang Meiyan Li Jinfeng Xu Aiyi Liu Alan H. Welsh Xingtao Zhou Bo Fu Catherine C. Liu http://arxiv.org/abs/2605.00470v1 Robust spatial scalar-on-function regression: A Fisher-consistent redescending M-estimation approach 2026-05-01T07:14:03Z

We develop a Fisher-consistent redescending robust estimator for the spatial scalar-on-function regression model, where a scalar response depends on both a functional predictor and a spatial autoregressive lag. Existing estimation procedures for this model are typically based on likelihood methods or monotone-loss robust M-estimators. They may be highly sensitive to vertical outliers, leverage points in the functional predictor, and numerical instability induced by strong spatial dependence. To address these issues, we propose a new estimation framework that first applies robust functional principal component analysis to obtain a contamination-resistant finite-dimensional representation of the functional predictor and then estimates the resulting spatial regression model through a bias-corrected system of M-estimating equations. The proposed method allows redescending loss functions, including Andrews' sine and Danish losses, and jointly estimates the regression coefficients, spatial dependence parameter, and scale parameter within a unified Fisher-consistent framework. For computation, we develop a hybrid IRLS-Newton algorithm that combines weighted least-squares updates for the regression parameters with a Newton-Raphson update for the spatial parameter. We establish Fisher consistency, consistency, asymptotic normality, and the asymptotic distribution of the reconstructed slope function. Monte Carlo experiments show that the proposed estimators remain competitive under clean data and substantially outperform classical and Huber-type robust competitors under contamination, particularly in severe outlier settings. An application to French air-quality data further demonstrates improved predictive performance and stable estimation of spatial dependence. Our method has been implemented in the fcsar R package.

2026-05-01T07:14:03Z 51 pages, 7 figures, 6 tables Muge Mutis Ufuk Beyaztas Han Lin Shang http://arxiv.org/abs/2202.00814v4 Adjustment for Unmeasured Spatial Confounding in Settings of Continuous Exposure Conditional on the Binary Exposure Status: Conditional Generalized Propensity Score-Based Spatial Matching 2026-04-30T21:47:55Z

Propensity score (PS) matching to estimate causal effects of exposure is biased when unmeasured spatial confounding exists. Some exposures are continuous yet dependent on a binary variable (e.g., level of a contaminant (continuous) within a specified radius from residence (binary)). Further, unmeasured spatial confounding may vary by spatial patterns for both continuous and binary attributes of exposure. We propose a new generalized propensity score (GPS) matching method for such settings, referred to as conditional GPS (CGPS)-based spatial matching (CGPSsm). A motivating example is to investigate the association between proximity to refineries with high petroleum production and refining (PPR) and stroke prevalence in the southeastern United States. CGPSsm matches exposed observational units (e.g., exposed participants) to unexposed units by their spatial proximity and GPS integrated with spatial information. GPS is estimated by separately estimating PS for the binary status (exposed vs. unexposed) and CGPS on the binary status. CGPSsm maintains the salient benefits of PS matching and spatial analysis: straightforward assessments of covariate balance and adjustment for unmeasured spatial confounding. Simulations showed that CGPSsm can adjust for unmeasured spatial confounding. Using our example, we found positive association between PPR and stroke prevalence. Our R package, CGPSspatialmatch, has been made publicly available.

2022-02-01T23:54:00Z Online supplementary materials are appended at the bottom of the main pdf As of 2026, under revision at a method-oriented journal Honghyok Kim Michelle Bell http://arxiv.org/abs/2603.02275v2 A Comparative Study of UMAP and Other Dimensionality Reduction Methods 2026-04-30T21:12:27Z

Uniform Manifold Approximation and Projection (UMAP) is a widely used manifold learning technique for dimensionality reduction. This paper studies UMAP, supervised UMAP, and several competing dimensionality reduction methods, including Principal Component Analysis (PCA), Kernel PCA, Sliced Inverse Regression (SIR), Kernel SIR, and t-distributed Stochastic Neighbor Embedding, through a comprehensive comparative analysis. Although UMAP has attracted substantial attention for preserving local and global structures, its supervised extensions, particularly for regression settings, remain rather underexplored. We provide a systematic evaluation of supervised UMAP for both regression and classification using simulated and real datasets, with performance assessed via predictive accuracy on low-dimensional embeddings. Our results show that supervised UMAP performs well for classification but exhibits limitations in effectively incorporating response information for regression, highlighting an important direction for future development.

2026-03-01T17:37:29Z 31 pages, 4 figures Guanzhe Zhang Shanshan Ding Zhezhen Jin http://arxiv.org/abs/2501.18383v3 A tutorial on conducting sample size and power calculations for detecting treatment effect heterogeneity in cluster randomized trials with linear mixed models 2026-04-30T20:58:49Z

Cluster-randomized trials (CRTs) are a well-established class of designs for evaluating community-based interventions. An essential task in planning these trials is determining the number of clusters and cluster sizes needed to achieve sufficient statistical power for detecting a clinically relevant effect size. While methods for evaluating the average treatment effect (ATE) for the entire study population are well-established, sample size methods for testing heterogeneity of treatment effects (HTEs), i.e., treatment-covariate interaction or difference in subpopulation-specific treatment effects, in CRTs have only recently been developed. For pre-specified analyses of HTEs in CRTs, effect-modifying covariates should, ideally, be accompanied by sample size or power calculations to ensure the trial has adequate power for the planned analyses. Power analysis for testing HTEs is more complex than for ATEs due to the additional design parameters that must be specified. Power and sample size formulas for testing HTEs via linear mixed effects (LME) models have been separately derived for different cluster-randomized designs, including single and multi-period parallel designs, crossover designs, and stepped-wedge designs, and for continuous and binary outcomes. This tutorial provides a consolidated reference guide for these methods and enhances their accessibility through an online R Shiny calculator. We further discuss key considerations for conducting sample size and power calculations to test pre-specified HTE hypotheses in CRTs, highlighting the importance of specifying advanced estimates of intracluster correlation coefficients for both outcomes and covariates, and their implications for power. The sample size methodology and calculator functionality are demonstrated through a real CRT example.

2025-01-30T14:32:57Z v3: accepted, 33 pages (19 main, supplemental 14); v2: revision under review, 36 pages (main 22, supplemental 14); v1: 28 pages, 4 tables, 5 figures International Journal of Epidemiology (2026) Volume 55, Issue 3 Mary Ryan Baumann Monica Taljaard Patrick J. Heagerty Michael O. Harhay Guangyu Tong Rui Wang Fan Li 10.1093/ije/dyag069 http://arxiv.org/abs/2509.17960v2 Everything all at once: On choosing an estimand for multi-component environmental exposures 2026-04-30T20:22:03Z

Many research questions -- particularly those in environmental health -- do not involve binary exposures. In environmental epidemiology, this includes multivariate exposure mixtures with nondiscrete components. Causal inference estimands and estimators to quantify the relationship between an exposure mixture and an outcome are relatively few. We propose an approach to quantify a relationship between a shift in the exposure mixture and the outcome -- either in the single timepoint or longitudinal setting. The shift in the exposure mixture can be defined flexibly in terms of shifting one or more components, including examining interaction between mixture components, and in terms of shifting the same or different amounts across components. The estimand we discuss has a similar interpretation as a main effect regression coefficient. First, we focus on choosing a shift in the exposure mixture supported by observed data. We demonstrate how to assess extrapolation and modify the shift to minimize reliance on extrapolation. Second, we propose estimating the relationship between the exposure mixture shift and outcome completely nonparametrically, using machine learning in model-fitting. This is in contrast to other current approaches, which employ parametric modeling for at least some relationships, which we would like to avoid because parametric modeling assumptions in complex, nonrandomized settings are tenuous at best. We are motivated by longitudinal data on pesticide exposures among participants in the CHAMACOS Maternal Cognition cohort. We examine the relationship between longitudinal exposure to agricultural pesticides and risk of hypertension. We provide step-by-step code to facilitate the easy replication and adaptation of the approaches we use.

2025-09-22T16:15:53Z Kara E. Rudolph Shodai Inose Nicholas Williams Ivan Diaz Lucia Calderon Jacqueline M. Torres Marianthi-Anna Kioumourtzoglou http://arxiv.org/abs/2605.00175v1 Using Linked Micromaps to Explore Complex Structures in Official Statistics 2026-04-30T19:50:33Z

Over the past decade, researchers have focused increasing levels of attention on the use of survey and non-survey data to inform decision-making by multiple stakeholders. Work with such data generally requires extensive exploration before a statistics practitioner focuses on specific steps in model building and inference. For many of the resulting initial exploratory analyses, crucial issues center on the extent to which empirical results may vary over geography and subpopulations. Such information is usually presented in tabular form, which can be difficult for stakeholders and decision makers to understand and to utilize. To address these issues, this paper uses data from the U.S. Bureau of Labor Statistics to illustrate a suite of tools known as linked micromaps. These applications show how linked micromaps can help stakeholders better understand and view descriptive statistics for populations and subpopulations, explore multivariate relationships and ordinal structure, and discover patterns of heterogeneity across time and space. In addition, this paper comments briefly on the prospective use of linked micromaps in model-building and analysis of multiple components of uncertainty.

2026-04-30T19:50:33Z Randall Powers Darcy Steeg Morris John Eltinge Wendy Martinez http://arxiv.org/abs/2605.00171v1 Adaptive Norm-Based Regularization for Neural Networks 2026-04-30T19:43:44Z

In this paper, we study norm-based regularization methods for neural networks. We compare existing penalization approaches and introduce two regularization strategies that extend classical ridge- and lasso-type penalties to neural network models. The first strategy modifies weight decay by incorporating the covariance structure of the input features into a ridge-type $\ell_2$ penalty, allowing regularization to account for feature dependence. The second combines an $\ell_1$ sparsity penalty with covariance-aware $\ell_2$ regularization, producing neural network weights that are both sparse and structurally informed. Monte Carlo simulations are used to evaluate these methods under different data-generating settings, followed by two real-data applications on building cooling-load prediction and leukemia cell-type classification from high-dimensional gene expression data. Across simulated and real-data examples, the proposed regularizers improve predictive performance on unseen data and provide more effective complexity control than standard norm-based penalties, particularly when features are correlated or high-dimensional.

2026-04-30T19:43:44Z 37 pages, 9 figures Muhammad Qasim Farrukh Javed http://arxiv.org/abs/2605.00108v1 Urban Science Beyond Samples: Up-to-Date Street Network Models and Indicators for Every Urban Area in the World 2026-04-30T18:03:46Z

Urban planners need up-to-date, global, and consistent street network models and indicators to measure resilience and performance, model accessibility, and target local quality-of-life interventions. This article presents up-to-date street network models and indicators for every urban area in the world. It uses 2025 urban area boundaries from the Global Human Settlement Layer, allowing users to join these data to hundreds of other urban attributes. Its workflow ingests 180 million OpenStreetMap nodes and 360 million OpenStreetMap edges across 10,351 urban areas in 189 countries. The code, models, and indicators are publicly available for reuse. These resources unlock worldwide urban street network science beyond samples as well as local analyses in under-resourced regions where models and indicators are otherwise less-accessible.

2026-04-30T18:03:46Z Environment and Planning B: Urban Analytics and City Science, 2026 Geoff Boeing http://arxiv.org/abs/2503.24324v2 Mitigating Financial Risk from Climate-Induced Agricultural Price Volatility 2026-04-30T16:05:07Z

Agricultural price volatility, driven by market dynamics and meteorological factors such as temperature and precipitation, poses challenges for sustainable finance, planning, and policy. This study analyzes the impact of climate on crop price volatility for soybean in Madhya Pradesh (India) and Illinois (US), rice in Assam (India), wheat in North Dakota (US), cotton in Gujarat (India), and corn in Iowa (US). Using CMIP6 climate projections from the Copernicus Climate Change Service, we examine historical climate patterns and evaluate two future scenarios: SSP2-4.5 (moderate) and SSP5-8.5 (severe). We estimate conditional price volatility using the Exponential Generalized Autoregressive Conditional Heteroskedasticity (EGARCH) model, and forecast this volatility with a Seasonal Autoregressive Integrated Moving Average with Exogenous Regressors (SARIMAX) model that incorporates meteorological variables. Finally, we apply the Black-Scholes framework to evaluate the cost of put-option-based insurance, which provides protection to farmers against adverse price drops linked to climate change. Our results highlight the role of meteorological data in improving agricultural risk modelling, enabling better design of insurance mechanisms, price stabilization tools, and sustainable policy interventions under climate uncertainty.

2025-03-31T17:11:00Z 15 pages, 11 figures Sourish Das Sudeep Shukla Abbinav Sankar Kailasam Anish Rai Sejal Garg Anirban Chakraborti http://arxiv.org/abs/2502.19234v2 Arctic teleconnection on climate and ozone pollution in the polar jet stream path of eastern US 2026-04-30T15:28:49Z

Arctic sea-ice loss is a defining feature of climate change and offers insight into its impact on mid-latitude air quality. Here, we investigate how variability in Arctic sea-ice extent (ASI) affects ground-level ozone ($O_3$) across eastern US states through physically and chemically mediated atmospheric pathways. Using observations and causal-inference methods grounded in atmospheric dynamics, we show that ASI drives wintertime ozone variability primarily via indirect meteorological mechanisms, including changes in humidity, temperature, and atmospheric circulation along the polar and subtropical jet streams. Inland regions exhibit the strongest sensitivity, while coastal areas are modulated by marine boundary-layer processes. Seasonal contrasts reveal that Arctic-driven dynamics suppress ozone in winter but can enhance accumulation under certain summer conditions. These findings highlight the importance of Arctic-midlatitude teleconnections in shaping regional air quality and highlight the need to integrate large-scale climate processes into ozone management and climate adaptation strategies.

2025-02-26T15:44:20Z 19 pages, 6 figures K Shuvo Bakar Sourish Das Sudeep Shukla Anirban Chakraborti http://arxiv.org/abs/2512.20914v2 Invariant Feature Extraction Through Conditional Independence and the Optimal Transport Barycenter Problem: the Gaussian case 2026-04-30T15:07:29Z

A methodology is developed to extract $d$ invariant features $W=f(X)$ that predict a response variable $Y$ without being confounded by variables $Z$ that may influence both $X$ and $Y$. The methodology's main ingredient is the penalization of any statistical dependence between $W$ and $Z$ conditioned on $Y$, replaced by the more readily implementable plain independence between $W$ and the random variable $Z_Y = T(Z,Y)$ that solves the [Monge] Optimal Transport Barycenter Problem for $Z\mid Y$. In the Gaussian case considered in this article, the two statements are equivalent. When the true confounders $Z$ are unknown, other measurable contextual variables $S$ can be used as surrogates, a replacement that involves no relaxation in the Gaussian case if the covariance matrix $Σ_{ZS}$ has full range. The resulting linear feature extractor adopts a closed form in terms of the first $d$ eigenvectors of a known matrix. The procedure extends with little change to more general, non-Gaussian / non-linear cases.

2025-12-24T03:39:18Z Ian Bounos Pablo Groisman Mariela Sued Esteban Tabak http://arxiv.org/abs/2604.27892v1 Prediction-powered Inference by Mixture of Experts 2026-04-30T14:08:17Z

The rapidly expanding artificial intelligence (AI) industry has produced diverse yet powerful prediction tools, each with its own network architecture, training strategy, data-processing pipeline, and domain-specific strengths. These tools create new opportunities for semi-supervised inference, in which labeled data are limited and expensive to obtain, whereas unlabeled data are abundant and widely available. Given a collection of predictors, we treat them as a mixture of experts (MOE) and introduce an MOE-powered semi-supervised inference framework built upon prediction-powered inference (PPI). Motivated by the variance reduction principle underlying PPI, the proposed framework seeks the mixture of experts that achieves the smallest possible variance. Compared with standard PPI, the MOE-powered inference framework adapts to the unknown performance of individual predictors, benefits from their collective predictive power, and enjoys a best-expert guarantee. The framework is flexible and applies to mean estimation, linear regression, quantile estimation, and general M-estimation. We develop non-asymptotic theory for the MOE-powered inference framework and establish upper bounds on the coverage error of the resulting confidence intervals. Numerical experiments demonstrate the practical effectiveness of MOE-powered inference and corroborate our theoretical findings.

2026-04-30T14:08:17Z Yanwu Gu Linglong Kong Dong Xia http://arxiv.org/abs/2604.27831v1 Optimal allocation of trials to sub-regions in crop variety testing with multiple years and correlated genotype effects 2026-04-30T13:18:59Z

Plant breeding and variety trials are usually conducted in multiple environments sampled from a defined target population of environments in order to characterize the performance of breeding lines or varieties. When the population is large and heterogeneous, it may be sub-divided into sub-regions or zones according to administrative and agro-ecological criteria. Analysis then focuses on prediction of performance in the individual sub-regions. Modelling the genotype effect in each sub-region as random, information can be borrowed across sub-regions using best linear unbiased prediction based on a suitable variance-covariance matrix for the genotype-zone effects. Here, we consider the important case where kinship of pedigree information is available for the genotypes under test. This information can be integrated into the variance-covariance matrix for genotype-zone effects. The objective we pursue here is to determine the optimal allocation of a fixed budget of trials to sub-regions. This design problem is solved using a combination of theory and explicit equations on one hand and numerical optimization on the other hand. Our proposed novel approach allows obtaining the optimal allocation when the number of genotypes is in the hundreds, a common setting in large plant breeding programs as well as in variety testing for economically important crops.

2026-04-30T13:18:59Z Maryna Prus Lenka Filová Hans-Peter Piepho Waqas Ahmed Malik http://arxiv.org/abs/2604.27732v1 A Note on the Generalized Cape Cod Reserving Method 2026-04-30T11:23:31Z

Claims reserving is one of the most important actuarial tasks in non-life insurance modeling. There are several popular methods to perform claims reserving such as the chain-ladder (CL), the Bornhuetter--Ferguson (BF) or the generalized Cape Cod (GCC) methods. These methods have originally been introduced as deterministic algorithms, and only in a later step, they have been lifted to stochastic models allowing for analyzing claims prediction uncertainty. This holds true for the CL and the BF methods, but not for the GCC method. The purpose of this article is to close this gap and derive an analytical formula for the mean squared error of prediction (MSEP) of the GCC method.

2026-04-30T11:23:31Z Ronald Richman Mario V. Wüthrich