https://arxiv.org/api/tYji+b/1iVBSSDC0D4kMPSigBsk 2026-03-21T00:39:02Z 34634 135 15 http://arxiv.org/abs/2603.14129v1 Semiparametric copula-based quantile regression for semicontinuous outcomes with application to healthcare data 2026-03-14T21:30:29Z A semiparametric copula-based two-part quantile regression framework is developed for the analysis of semicontinuous outcomes characterized by a point mass at zero and a continuous positive component. The proposed approach models the occurrence and magnitude processes separately and links them through copula-based conditional distributions, allowing for flexible dependence structures and nonlinear covariate effects across quantiles. Large-sample properties of the resulting estimator are established, and extensive simulation studies demonstrate improved finite-sample performance relative to logistic/linear quantile regression, particularly under nonlinear dependence and substantial zero inflation. An application to healthcare data illustrates how the proposed method provides a nuanced characterization of the association between social deprivation and uncompensated and charity care burdens, revealing heterogeneous and nonlinear effects that are not captured by competing approaches. 2026-03-14T21:30:29Z 25 pages, 2 figures Guanjie Lyu Mohamed Belalia Abdulkadir Hussein http://arxiv.org/abs/2512.06522v2 Hierarchical Clustering With Confidence 2026-03-14T20:41:53Z Agglomerative hierarchical clustering is one of the most widely used approaches for exploring how observations in a dataset relate to each other. However, its greedy nature makes it highly sensitive to small perturbations in the data, often producing different clustering results and making it difficult to separate genuine structure from spurious patterns. In this paper, we show how randomizing hierarchical clustering can be useful not just for measuring stability but also for designing valid hypothesis testing procedures based on the clustering results. We propose a simple randomization scheme together with a method for constructing a valid p-value at each node of the hierarchical clustering dendrogram that quantifies evidence against performing the greedy merge. Our test controls the Type I error rate, works with any hierarchical linkage without case-specific derivations, and simulations show it is substantially more powerful than existing selective inference approaches. To demonstrate the practical utility of our p-values, we develop an adaptive $α$-spending procedure that estimates the number of clusters, with a probabilistic guarantee on overestimation. Experiments on simulated and real data show that this estimate yields powerful clustering and can be used, for example, to assess clustering stability across multiple runs of the randomized algorithm. 2025-12-06T18:18:20Z 57 Pages, 11 Figures, 2 Algorithms Di Wu Jacob Bien Snigdha Panigrahi http://arxiv.org/abs/2502.19275v4 Deep Computerized Adaptive Testing 2026-03-14T20:31:17Z Computerized adaptive tests (CATs) play a crucial role in educational assessment and diagnostic screening in behavioral health. Unlike traditional linear tests that administer a fixed set of pre-assembled items, CATs adaptively tailor the test to an examinee's latent trait level by selecting a smaller subset of items based on their previous responses. Existing CAT frameworks predominantly rely on item response theory (IRT) models with a single latent variable, a choice driven by both conceptual simplicity and computational feasibility. However, many real-world item response datasets exhibit complex, multi-factor structures, limiting the applicability of CATs in broader settings. In this work, we develop a novel CAT system that incorporates multivariate latent traits, building on recent advances in Bayesian sparse multivariate IRT. Our approach leverages direct sampling from the latent factor posterior distributions, significantly accelerating existing information-theoretic item selection criteria by eliminating the need for computationally intensive Markov Chain Monte Carlo (MCMC) simulations. Recognizing the potential sub-optimality of existing item selection rules, which are often based on myopic one-step-lookahead optimization of some information-theoretic criterion, we propose a double deep Q-learning algorithm to learn an optimal item selection policy. Through simulation and real-data studies, we demonstrate that our approach not only accelerates existing item selection methods but also highlights the potential of reinforcement learning in CATs. 2025-02-26T16:30:30Z Jiguang Li Robert Gibbons Veronika Rockova http://arxiv.org/abs/2510.10870v2 Transfer Learning with Distance Covariance for Random Forest: Error Bounds and an EHR Application 2026-03-14T20:30:18Z We propose a method for transfer learning in nonparametric regression using a random forest (RF) with distance covariance-based feature weights, assuming the unknown source and target regression functions are sparsely different. Our method obtains residuals from a source domain-trained Centered RF (CRF) in the target domain, then fits another CRF to these residuals with feature splitting probabilities proportional to feature-residual sample distance covariance. We derive an upper bound on the mean square error rate of the procedure as a function of sample sizes and difference dimension, theoretically demonstrating transfer learning benefits in random forests. A major difficulty for transfer learning in random forests is the lack of explicit regularization in the method. Our results explain why shallower trees with preferential selection of features lead to both lower bias and lower variance for fitting a low-dimensional function. We show that in the residual random forest, this implicit regularization is enabled by sample distance covariance. In simulations, we show that the results obtained for the CRFs also hold numerically for the standard RF (SRF) method with data-driven feature split selection. Beyond transfer learning, our results also show the benefit of distance-covariance-based weights on the performance of RF when some features dominate. Our method shows significant gains in predicting the mortality of ICU patients in smaller-bed target hospitals using a large multi-hospital dataset of electronic health records for 200,000 ICU patients. 2025-10-13T00:31:56Z Chenze Li Subhadeep Paul http://arxiv.org/abs/2603.16937v1 Integrating Explainable Machine Learning and Mixed-Integer Optimization for Personalized Sleep Quality Intervention 2026-03-14T19:54:07Z Sleep quality is influenced by a complex interplay of behavioral, environmental, and psychosocial factors, yet most computational studies focus mainly on predictive risk identification rather than actionable intervention design. Although machine learning models can accurately predict subjective sleep outcomes, they rarely translate predictive insights into practical intervention strategies. To address this gap, we propose a personalized predictive-prescriptive framework that integrates interpretable machine learning with mixed-integer optimization. A supervised classifier trained on survey data predicts sleep quality, while SHAP-based feature attribution quantifies the influence of modifiable factors. These importance measures are incorporated into a mixed-integer optimization model that identifies minimal and feasible behavioral adjustments, while modelling resistance to change through a penalty mechanism. The framework achieves strong predictive performance, with a test F1-score of 0.9544 and an accuracy of 0.9366. Sensitivity and Pareto analyses reveal a clear trade-off between expected improvement and intervention intensity, with diminishing returns as additional changes are introduced. At the individual level, the model generates concise recommendations, often suggesting one or two high-impact behavioral adjustments and sometimes recommending no change when expected gains are minimal. By integrating prediction, explanation, and constrained optimization, this framework demonstrates how data-driven insights can be translated into structured and personalized decision support for sleep improvement. 2026-03-14T19:54:07Z 34 Pages. 7 Tables. 6 Figures Mahfuz Ahmed Anik Mohsin Mahmud Topu Azmine Toushik Wasi Md Isfar Khan MD Manjurul Ahsan http://arxiv.org/abs/2603.14094v1 Maximin Robust Bayesian Experimental Design 2026-03-14T19:40:39Z We address the brittleness of Bayesian experimental design under model misspecification by formulating the problem as a max--min game between the experimenter and an adversarial nature subject to information-theoretic constraints. We demonstrate that this approach yields a robust objective governed by Sibson's $α$-mutual information~(MI), which identifies the $α$-tilted posterior as the robust belief update and establishes the Rényi divergence as the appropriate measure of conditional information gain. To mitigate the bias and variance of nested Monte Carlo estimators needed to estimate Sibson's $α$-MI, we adopt a PAC-Bayes framework to search over stochastic design policies, yielding rigorous high-probability lower bounds on the robust expected information gain that explicitly control finite-sample error. 2026-03-14T19:40:39Z 11 pages + 15 in appendix, 5 figures Hany Abdulsamad Sahel Iqbal Christian A. Naesseth Takuo Matsubara Adrien Corenflos http://arxiv.org/abs/2603.14092v1 Soft Mean Expected Calibration Error (SMECE): A Calibration Metric for Probabilistic Labels 2026-03-14T19:33:53Z The Expected Calibration Error (ece), the dominant calibration metric in machine learning, compares predicted probabilities against empirical frequencies of binary outcomes. This is appropriate when labels are binary events. However, many modern settings produce labels that are themselves probabilities rather than binary outcomes: a radiologist's stated confidence, a teacher model's soft output in knowledge distillation, a class posterior derived from a generative model, or an annotator agreement fraction. In these settings, ece commits a category error - it discards the probabilistic information in the label by forcing it into a binary comparison. The result is not a noisy approximation that more data will correct. It is a structural misalignment that persists and converges to the wrong answer with increasing precision as sample size grows. We introduce the Soft Mean Expected Calibration Error (smece), a calibration metric for settings where labels are of probabilistic nature. The modification to the ece formula is one line: replace the empirical hard-label fraction in each prediction bin with the mean probability label of the samples in that bin. smece reduces exactly to ece when labels are binary, making it a strict generalisation. 2026-03-14T19:33:53Z Michael Leznik http://arxiv.org/abs/2603.14070v1 Structured Credal Learning 2026-03-14T18:26:29Z Real-world learning tasks often encounter uncertainty due to covariate shift and noisy or inconsistent labels. However, existing robust learning methods merge these effects into a single distributional uncertainty set. In this work, we introduce a novel structured credal learning framework that explicitly separates these two sources. Specifically, we derive geometric bounds on the total variation diameter of structured credal sets and demonstrate how this quantity decomposes into contributions from covariate shift and expected label disagreement. This decomposition reveals a gating effect: covariate modulates how much label disagreement contributes to the joint uncertainty, such that seemingly benign covariate shifts can substantially increase the effective uncertainty. We also establish finite-sample concentration bounds in a fixed covariate regime and demonstrate that this quantity can be efficiently estimated. Lastly, we show that robust optimization over these structured credal sets reduces to a tractable discrete min-max problem, avoiding ad-hoc robustness parameters. Overall, our approach provides a principled and practical foundation for robust learning under combined covariate and label mechanism ambiguity. 2026-03-14T18:26:29Z Varun Venkatesh Eyke Hüllermeier Bernd Bischl Mina Rezaei http://arxiv.org/abs/2603.13935v1 A two-sample test for symmetric positive definite matrix distributions using Wishart kernel density estimators 2026-03-14T13:06:34Z We develop a nonparametric two-sample test for distributions supported on the cone of symmetric positive definite matrices. The procedure relies on the Wishart kernel density estimator (KDE) introduced by Belzile et al. (2025), whose support-adaptive kernel alleviates boundary bias by remaining confined to the cone. Our test statistic is the rescaled integrated squared difference between two Wishart KDEs and can be expressed as a two-sample $V$-statistic via an explicit closed-form overlap of Wishart kernels, avoiding numerical integration. Under the null hypothesis of equal densities, we derive the asymptotic distribution in both the common shrinking-bandwidth and fixed-bandwidth regimes. The proposed method provides a kernel-based competitor to the empirical Laplace-transform two-sample test of Lukić (2024). Unlike the orthogonally invariant Hankel-transform test of Lukić and Milošević (2024), our statistic can detect alternatives that differ only through eigenvector structure, for instance, Wishart models with the same shape parameter and the same scale eigenvalues but different orientations. 2026-03-14T13:06:34Z 34 pages, 0 figures Frédéric Ouimet http://arxiv.org/abs/2511.03596v2 Accounting for Heavy Censoring in Evaluating the Risk Stratification Abilities of Existing Models for Time to Diagnosis of Huntington Disease 2026-03-14T12:59:31Z Huntington disease (HD) is a neurodegenerative disease with progressively worsening symptoms. Accurately modeling time to HD diagnosis is essential for clinical trial design. Langbehn's model, the CAG-Age Product (CAP) model, the Prognostic Index Normed (PIN) model, and the Multivariate Risk Score (MRS) model have all been proposed for this task. However, these models may yield conflicting predictions and few studies have systematically compared their performance. Further, those that have could be misleading due to testing the models on the same data used to train them and failing to account for high rates of right censoring (80%+) in performance metrics. We discuss the theoretical foundations of these models, offering intuitive comparisons about their practical feasibility. We externally validate their risk stratification abilities using data from the ENROLL-HD study and two censoring-appropriate performance metrics, guiding model selection for HD clinical trial design. As these models were developed in HD studies that ended more than a decade ago, we compared their predictive performance using published parameters versus updated ones (re-estimated using ENROLL-HD). We show how these models can be used to estimate sample sizes for an HD clinical trial. Based on either metric and using published or updated parameters, the MRS model, which incorporates the most covariates, performed best. However, the simpler PIN model offered similarly good performance while requiring fewer variables, many of which would require patients to undergo additional tests. In illustrating an HD clinical trial design, we defined an optimal threshold based on model performance metrics to determine which patients are more likely to be diagnosed. Sample size calculations using an optimal threshold based on metrics that did not account for censoring, as in previous studies, are shown to lead to underpowered trials. 2025-11-05T16:16:48Z 16 pages, 4 tables, 2 figures Kyle F. Grosser Abigail G. Foes Stellen Li Vraj Parikh Tanya P. Garcia Sarah C. Lotspeich http://arxiv.org/abs/2603.13930v1 Spatially Varying Coefficient Mallows Model Averaging 2026-03-14T12:59:18Z Model averaging, as an appealing ensemble technique, strategically integrates all valuable information from candidate models to construct fast and accurate prediction. Despite of having been widely practiced in many fields such as cross-sectional data, censored data and longitudinal data, its application to spatial data characterized by inherent spatial heterogeneity remains surprisingly limited. To mitigate risk of model misspecification and enhance the flexibility of prediction, we propose a combined estimator constructed by computing the weighted average of estimators derived from a set of spatially varying coefficient candidate models. Herein, the model weights are determined via a Mallows-type criterion, which dynamically calibrates the relative importance of individual candidate models in the ensemble. Theoretically, we establish desirable asymptotic properties under two practical scenarios. First, in the case where all candidate models are misspecified, the proposed model averaging estimator attains asymptotic optimality in the sense that it minimizes the squared error loss function asymptotically. Second, when the candidate model set encompasses at least one quasi-correct model, the weights assigned by the Mallows-type criterion asymptotically concentrate on the quasi-correct models, and the resulting model averaging estimator converges in probability to the true conditional mean. Both simulation studies and a real-world empirical example demonstrate that the proposed method generally outperforms alternative comparative approaches in terms of predictive accuracy and robustness. 2026-03-14T12:59:18Z Zhuang Yong Lv Jing Tingting Li http://arxiv.org/abs/2603.13848v1 A family of divergence-based correlation measures for contingency tables under bivariate normality 2026-03-14T09:02:21Z We propose a family of association measures for two-way contingency tables whose latent distribution can be assumed to be bivariate normal. When this assumption holds, the power-divergence measuring departure from independence can be approximated in closed form as a function of the latent correlation coefficient. By inverting this relationship, we obtain a family of measures $ρ_{(λ)}$, indexed by a scalar parameter $-1 \leq λ\leq 1$, that directly approximates the latent correlation. Special cases include the informational measure of correlation proposed by Linfoot (1957) at $λ= 0$ and Pearson's contingency coefficient $C$ at $λ= 1$. Additionally, we derive asymptotic distributions via the delta method and construct two families of confidence intervals. Simulation studies confirm that the proposed measures approximate the true latent correlation more faithfully than conventional divergence-based measures, and that they successfully distinguish between weak and moderate associations where existing measures tend to give indistinguishable values. Compared with the polychoric correlation coefficient, the proposed measures are computed several thousand times faster and remain numerically stable even when the latent correlation is close to one. 2026-03-14T09:02:21Z Wataru Urasaki http://arxiv.org/abs/2603.13762v1 Learning the Optimal Composite Mediator: Closed-Form Solution and Inference 2026-03-14T05:20:29Z Understanding how an exposure transmits its effect through high-dimensional intermediaries is a central problem in observational research. We study the problem of finding a composite mediator that maximises the indirect effect of an exposure on an outcome in a linear structural equation model. Although the objective is non-convex in the weight vector, a geometric argument yields a closed-form global solution: the optimal weight bisects the angle between two computable path vectors in a weighted inner product space, recovered via two linear solves. The resulting algorithm, MaxIE, runs at the same cost as ordinary least squares -- orders of magnitude lower than numerical optimisation -- with a dual formulation for settings where mediators outnumber observations. The same path vectors yield a test for the global null that no composite mediator exists, with t(p-1) in the classical and t(n-2) in the dual regime. Power is characterised analytically as a function of the population path angle; simulations confirm size control and the power characterisation. Applied to a UK Biobank proteomics dataset (n=38,383, p=2,916), the method rejects the global null (p-value = 6.4e-9) and identifies the optimal proteomic composite mediating age's effect on dementia. 2026-03-14T05:20:29Z Zihuai He http://arxiv.org/abs/2602.02319v5 Leave-One-Out Neighborhood Smoothing for Graphons: Berry-Esseen Bounds, Confidence Intervals, and Honest Tuning 2026-03-14T03:24:00Z Neighborhood smoothing methods achieve minimax-optimal rates for estimating edge probabilities under graphon models, but their use for statistical inference has remained limited. The main obstacle is that classical neighborhood smoothers select data-driven neighborhoods and average edges using the same adjacency matrix, inducing complex dependencies that invalidate standard concentration and normal approximation arguments. We introduce a leave-one-out modification of neighborhood smoothing for undirected simple graphs. When estimating a single entry P_ij, the neighborhood of node i is constructed from an adjacency matrix in which the jth row and column are set to zero, thereby decoupling neighborhood selection from the edges being averaged. We show that this construction restores conditional independence of the centered summands, enabling the use of classical probabilistic tools for inference. Under piecewise Lipschitz graphon assumptions and logarithmic degree growth, we derive variance-adaptive concentration inequalities based on Bousquet's inequality and establish Berry-Esseen bounds with explicit rates for the normalized estimation error. These results yield both finite-sample and asymptotic confidence intervals for individual edge probabilities. The same leave-one-out structure also supports an honest cross-validation scheme for tuning parameter selection, for which we prove an oracle inequality. The proposed estimator retains the optimal row-wise mean-squared error rates of classical neighborhood smoothing while providing valid entrywise uncertainty quantification. 2026-02-02T16:46:56Z Behzad Aalipur Rachel Kilby http://arxiv.org/abs/2603.13704v1 A Kernel-Based Nonparametric Test for Conditional Independence of Functional Data 2026-03-14T02:20:05Z Conditional independence is a fundamental concept in many areas of statistical research, including, for example, sufficient dimension reduction, causal inference, and statistical graphical models. In many modern applications, data arise in the form of random functions, making it important to determine whether two random functions are conditionally independent given a third. However, to the best of our knowledge, existing conditional independence tests in the literature apply only to multivariate data, and extensions to the functional setting are not available. To fill this gap, we develop a kernel-based test for conditional independence of random functions based on the conjoined conditional covariance operator (CCCO). We rigorously derive the asymptotic distribution of the CCCO estimator using a recently established sharpened convergence rate for the regression operator (Choi et al., 2026). Based on this result, we construct a test statistic using the spectral decomposition of the operator appearing in the asymptotic distribution. The proposed method is illustrated through applications to an activity and biometrics dataset and a macroeconomic dataset. 2026-03-14T02:20:05Z Yin Tang Bing Li