https://arxiv.org/api/9gaCbow3iLgy4GA4cB/nYBvpQ1c 2026-06-10T10:34:26Z 36124 180 15 http://arxiv.org/abs/2511.09890v3 A Clustering Approach for Basket Trials Based on Treatment Response Trajectories 2026-06-03T23:27:09Z

Heterogeneity in efficacy is sometimes observed across baskets in basket trials. In this study, we propose a model-free clustering framework that groups baskets based on transition probabilities derived from the trajectories of treatment response, rather than relying solely on a single efficacy endpoint such as the objective response rate. The number of clusters is not predetermined but is automatically determined in a data-driven manner based on the similarity structure among baskets. After clustering, baskets within the same cluster are analyzed using a hierarchical Bayesian model. This framework aims to improve the estimation precision of efficacy endpoints and enhance statistical power while maintaining the type~I error rate at the nominal level. The performance of the proposed method was evaluated through simulation studies. The results demonstrated that the proposed method can accurately identify cluster structures in heterogeneous settings and, even under such conditions, maintain the type~I error rate at the nominal level while improving statistical power.

2025-11-13T02:51:25Z Masahiro Kojima Keisuke Hanada Atsuya Sato http://arxiv.org/abs/2510.05085v2 WOW: WAIC-Optimized Gating of Mixture Priors for External Data Borrowing 2026-06-03T23:02:50Z

The integration of external data using Bayesian mixture priors has become a powerful approach in clinical trials, offering significant potential to improve trial efficiency. Despite their strengths in analytical tractability and practical flexibility, existing methods such as the robust meta-analytic-predictive (rMAP) and self-adapting mixture (SAM) often presume borrowing without rigorously assessing whether external information is appropriate to incorporate. When external and concurrent data are discordant, excessive borrowing can bias estimation and lead to misleading conclusions. To address this, we introduce WOW, a Kullback-Leibler-based gating strategy guided by the widely applicable information criterion (WAIC). Within the mixture-prior framework, WAIC-Optimized Weighting (WOW) conducts a preliminary compatibility assessment between external and concurrent trial data to determine eligibility for borrowing. Only if this gating criterion is satisfied does borrowing proceed; a downstream mixture prior procedure, using user-specified fixed or adaptive weights, can then be applied to determine the amount of borrowing. Simulation studies demonstrate that incorporating the WOW strategy before Bayesian mixture prior borrowing methods effectively mitigates excessive borrowing and improves estimation accuracy. A real-data illustration further highlights the feasibility and interpretability of the proposed gate-then-borrow strategy. By providing a practical safeguard against inappropriate borrowing, WOW strengthens the reliability of mixture-prior methods and supports better decision-making in clinical trials.

2025-10-06T17:53:05Z Shouhao Zhou Qiuxin Gao Chenqi Fu Yanxun Xu http://arxiv.org/abs/2606.05488v1 Sparse Functional Singular Value Decomposition for Biclustering and Triclustering Longitudinal Data 2026-06-03T22:26:01Z

Identifying subtypes of complex conditions, such as Inflammatory Bowel Disease (IBD), often requires capturing latent patterns in longitudinal omics data. However, these data are typically high-dimensional, sparsely sampled, and irregularly observed over time, posing substantial challenges for conventional (bi)clustering and functional data analysis methods. We propose Tri-SfSVD, a unified sparse functional Singular Value Decomposition framework for discovering biclusters and triclusters in longitudinal data. Unlike existing functional biclustering methods that rely on ad hoc imputation or enforce restrictive shape-homogeneity assumptions, Tri-SfSVD integrates continuous trajectory estimation with simultaneous subject, feature, and temporal selection within a single optimization framework. By imposing sparse penalties across subjects, variables, and temporal subregions, the proposed method works directly on observed data to uncover localized structures at the subject, subject-feature, and subject-feature-time levels. Extensive simulations demonstrate that Tri-SfSVD outperforms existing approaches in high-dimensional settings. Applied to IBD multi-omics data, the method identified three biclusters linking sample clusters with distinct IBD-related clinical characteristics to microbial pathway groups associated with specific bacterial taxa, providing interpretable subject-pathway associations for characterizing disease heterogeneity. Applied to multi-channel EEG data, the method identified three triclusters linking sample clusters with distinct alcohol-related phenotypes to localized brain activity patterns, including subgroup differences separated by temporal subregions within the same spatial region.

2026-06-03T22:26:01Z Yue Zhao Thierry Chekouo Sandra Safo http://arxiv.org/abs/2606.09892v1 LMT: A Bayesian Framework for Causal Discovery from Textual Alarm Records in Manufacturing Systems 2026-06-03T19:42:17Z

Textual event records, such as alarm logs, have become an increasingly common data source in engineering and manufacturing systems. Beyond identifying correlations or recurring patterns, engineers are often interested in understanding which types of events causally trigger or influence other events during system operation. Textual event descriptions may contain semantic clues about such causal relationships, and recent large language models (LLMs) provide a promising tool for extracting these signals. However, relying solely on LLM-encoded textual information is insufficient for accurate causal discovery, since semantic patterns do not directly reveal causal mechanisms and may confuse causation with correlation or frequent sequential patterns. To address these challenges, we propose \textbf{LMT}, a Bayesian causal discovery framework for engineering event data that jointly leverages textual descriptions and timestamps. Specifically, LMT first uses LLMs to extract semantic causal signals from event descriptions and constructs a prior distribution over causal graphs among event types or event clusters. It then incorporates temporal evidence through a Poisson-process-based likelihood, allowing the LLM-informed prior to be refined by timestamp-based statistical evidence. By integrating the textual and temporal information, LMT produces a causal graph that is both interpretable and data-supported. Simulation studies show that the proposed framework is effective across different settings and is especially advantageous in small-sample alarm-event scenarios.

2026-06-03T19:42:17Z 19 pages Xiaofeng Xiao Jianhong Chen Qiuzhuang Sun Naichen Shi Xubo Yue http://arxiv.org/abs/2606.05374v1 Analyzing spatial point processes degraded by displacement and imperfect detection 2026-06-03T19:21:53Z

Spatial point processes are a valuable tool for probabilistic modeling to explain location data. However, the data themselves are often observed imperfectly. In order to perform accurate inference, one must account for these imperfections, which we refer to as degradation. We consider two forms of degradation for spatial Poisson processes: thinning and displacement. First, we provide some theoretical results on model identifiability, showing that, under weak conditions, one can jointly learn the scale of the displacement, a parametric form of thinning, and a nonparametric intensity function. The ability to learn all of these components and the resulting improvements for inference compared to the conceptual non-degraded but misspecified model are shown empirically via simulation study. Finally, we apply this approach to North Atlantic right whale call data from Cape Cod Bay.

2026-06-03T19:21:53Z Kevin M. Collins Erin M. Schliep Alan E. Gelfand Tina M. Yack Christopher W. Clark Robert S. Schick http://arxiv.org/abs/2603.12427v2 Variational Bayes and Truncation approximations for Enriched Dirichlet process mixtures 2026-06-03T18:32:43Z

A common impediment in conducting inference for Bayesian nonparametric models is either the need for complex MCMC algorithms and/or computational run-time for large datasets. We propose solutions here for Enriched Dirichlet process mixtures (EDPM). We derive a variational Bayes estimator based on a previously developed truncation approximation for EDPMs. The variational Bayes estimator can be used in two ways: 1) to develop a more efficient truncation approximation; 2) as good initial values for a blocked Gibbs sampler based on this more efficient truncation approximation or for a polya urn sampler. We derive the accuracy of this more efficient truncation approximation and demonstrate how this allows for simple implementation of a blocked Gibbs Sampler EDPMs in Nimble. We confirm the validity of the approximations by simulations and illustrate on a real data set.

2026-03-12T20:20:18Z Somnath Bhadra Michael J. Daniels http://arxiv.org/abs/2509.11381v3 Accuracy Limits of Causal Trees for Individualized Treatment Effects 2026-06-03T18:26:24Z

Recursive decision trees are widely used to estimate heterogeneous causal treatment effects in experimental and observational studies. These methods are typically implemented using CART-type recursive partitioning, with splitting criteria designed to identify variation in treatment effects across covariate-defined subgroups. We study causal tree estimators based on adaptive recursive partitioning and establish lower bounds on their estimation accuracy. The class we analyze includes versions with and without sample splitting, based on common treatment effect and squared-error splitting criteria. Even in a constant-effect benchmark with randomized treatment assignment, causal trees constructed via standard CART-type splitting rules can have uniform-norm errors that decrease more slowly than any power of the sample size. The underlying mechanism is that greedy recursive partitioning selects highly imbalanced splits with nonvanishing probability, producing terminal nodes containing very few observations and leading to large estimation variance. We further show that sample splitting, often called ``honesty,'' does not remove this limitation. As a consequence, causal tree estimators may converge arbitrarily slowly uniformly over the covariate space. At the same time, these estimators can have small integrated mean squared error, showing that average accuracy can mask local inaccuracy. Our results also clarify the role of balanced partition assumptions in existing theoretical guarantees for causal forests and related ensemble methods.

2025-09-14T18:29:45Z Matias D. Cattaneo Jason M. Klusowski Ruiqi Rae Yu http://arxiv.org/abs/2606.05324v1 Optimizing Irreversible Perturbations of the Unadjusted Langevin Algorithm 2026-06-03T18:10:12Z

Irreversible perturbations accelerate the convergence of Langevin dynamics, breaking detailed balance while preserving the invariant measure. The design of optimal irreversible perturbations has been studied in the continuous-time Gaussian setting, but extensions to non-Gaussian target distributions, and the impact of time discretization on the design of optimal perturbations, have not been well understood. Numerical discretizations of Langevin dynamics introduce bias, which is typically exacerbated by irreversible perturbations; handling this interaction demands a joint treatment of acceleration and accuracy. This paper develops a systematic framework for optimizing position-independent irreversible perturbations of the unadjusted Langevin algorithm (ULA). We formulate a constrained optimization problem that simultaneously accounts for mixing efficiency and discretization bias, where the former is characterized by a spectral gap analogue and the latter is quantified via a weighted expected squared jump distance. Within this framework, we derive an explicit characterization of the optimal position-independent irreversible perturbation. Extensive numerical experiments demonstrate that our design yields faster convergence with controlled bias, and improves mean squared estimation errors compared to other choices of irreversible perturbation.

2026-06-03T18:10:12Z 60 pages, 30 figures, 1 algorithm, 1 table Qianyu Julie Zhu Youssef Marzouk Konstantinos Spiliopoulos Benjamin Zhang http://arxiv.org/abs/2606.05317v1 A Family of Quantile Functions Useful in Clinical Studies 2026-06-03T18:06:26Z

Motivated by upper-tail quantile-domain summaries, we study the quantile-based effectiveness persistence function defined as the ratio between the tail mean and the quantile function. We derive statistical properties of this measure and consider a rational (Möbius) specification of the quantilebased effectiveness persistence function. Under natural boundary conditions, this specification reduces to a canonical form. The resulting canonical family defines a two-parameter class of nonnegative distributions through its quantile function. Various properties, including descriptive measures, L-moments, and quantile-based reliability concepts, are derived for this class. Estimation of the model parameters using maximum likelihood is also developed. The proposed family is illustrated using a real survival dataset.

2026-06-03T18:06:26Z Sankaran P. G. Prasanth V. P. Midhu N. N http://arxiv.org/abs/2205.08609v3 Bagged Polynomial Regression and Neural Networks 2026-06-03T17:47:33Z

Climate and environmental applications increasingly rely on high-dimensional prediction from remote sensing and other scientific data. Neural networks (NN) can deliver strong accuracy in these settings, but they are often hard to audit and hard to align with domain knowledge. As an alternative, we propose bagged polynomial regression with random projections (BPR), an econometrics-native ensemble that averages many regularized low-degree polynomial models fit on randomly selected covariate groups. We provide novel finite-sample and asymptotic risk bounds and show how covariate partitioning can improve rates for smooth target functions by controlling dictionary basis growth. Rate improvements may be particularly relevant for the estimation of marginal effects. In an application to satellite-based crop classification using optical and radar imagery, BPR matches NN accuracy while remaining straightforward to diagnose. We provide practical transparency tools, coefficient summaries and partial-dependence diagnostics, that show BPR captures intuitive feature relationships that NNs do not.

2022-05-17T19:55:56Z Sylvia Klosin Jaume Vives-i-Bastida http://arxiv.org/abs/2601.05669v2 Two-Stage Robust Sparse Gradient Methods for Regression Under Heavy-Tailed Designs 2026-06-03T16:09:43Z

We study high-dimensional sparse regression under simultaneous heavy-tailed covariates and noise. Heavy-tailed data affect sparse optimization in two different ways: extreme covariates can destabilize the gradient field during global localization, while heavy-tailed noise limits the final statistical accuracy during local refinement. Motivated by this two-phase structure, we propose two-stage RIGHT, a robust sparse first-order method based on coordinate-wise median-of-means (MoM) gradient estimation and delayed sample splitting. The MoM gradient estimator is computationally simple, compatible with hard-thresholded updates, and admits phase-adaptive concentration bounds whose rates depend on the current localization radius. Delayed splitting reuses data during global localization and reserves fresh batches for the shorter refinement stage, reducing the sample-splitting cost. The theoretical results reveal a decoupled rate structure: the design-tail index controls gradient stability and sample complexity, whereas the noise-tail index controls the final statistical rate. We also provide phase-wise lower-bound benchmarks showing that the design-driven localization barrier is intrinsic. Extensive simulation experiments and real data analysis showcase the efficacy of the proposed method over existing competitors.

2026-01-09T09:40:21Z Kaiyuan Zhou Xiaoyu Zhang Wenyang Zhang Di Wang http://arxiv.org/abs/2606.05026v1 Removal of Multivariate Environmental Influences in Structural Health Monitoring through Conditional Covariances and Supervised Learning 2026-06-03T15:55:24Z

In structural health monitoring (SHM) systems, data is collected from a multitude of sensors measuring, for example, vibration or strain in the structure, along with additional features that capture environmental or operational information. It is well known that changes in the measured sensor outputs do not necessarily originate from structural damage but are often induced by environmental changes. One popular approach to account for these effects is regressing the system outputs on the confounding factors, also known as "response surface modeling". Afterward, the predicted values are subtracted from the observed ones to obtain corrected data with the environmental effects (supposedly) removed. However, the evaluation of real-world SHM data shows that environmental conditions may affect not only the expected output values but also higher-order statistical moments, particularly the variances of and the covariances and correlations between the output quantities, such as eigenfrequencies of different modes or strain sensors at different locations. By construction, the (supervised) machine learning techniques commonly used for response surface modeling cannot account for those higher-order effects. To address these issues, we present and discuss several approaches for identifying and quantifying multivariate confounding effects on output covariances and correlations: a nonparametric, kernel-based estimator, a random forest, a semiparametric additive model, and a deep learning approach. Furthermore, we show how the resulting conditional covariance matrices can be used in an SHM pipeline. We compare the competing methods on both artificial data and real-world load test data from the Vahrendorfer Stadtweg bridge in Hamburg, Germany, as well as eigenfrequency data from the railway bridge KW51 near Leuven, Belgium.

2026-06-03T15:55:24Z 25 pages, 8 figures Lizzie Neumann Philipp Wittenberg Jan Gertheiss http://arxiv.org/abs/2603.13704v2 A Kernel-Based Nonparametric Test for Conditional Independence of Functional Data 2026-06-03T15:49:44Z

Conditional independence is a fundamental concept in many areas of statistical research, including, for example, sufficient dimension reduction, causal inference, and statistical graphical models. In many modern applications, data arise in the form of random functions, making it important to determine whether two random functions are conditionally independent given a third. However, to the best of our knowledge, existing conditional independence tests in the literature apply only to multivariate data, and extensions to the functional setting are not available. To fill this gap, we develop a kernel-based test for conditional independence of random functions based on the conjoined conditional covariance operator (CCCO). We rigorously derive the asymptotic distribution of the CCCO estimator using a recently established sharpened convergence rate for the regression operator (Choi et al., 2026). Based on this result, we construct a test statistic using the spectral decomposition of the operator appearing in the asymptotic distribution. The proposed method is illustrated through applications to an activity and biometrics dataset and a macroeconomic dataset.

2026-03-14T02:20:05Z Yin Tang Bing Li http://arxiv.org/abs/2605.20468v2 CASCADE Conformal Prediction: Uncertainty-Adaptive Prediction Intervals for Two-Stage Clinical Decision Support 2026-06-03T15:45:32Z

Effective medication management in Parkinson's Disease (PD) is challenging due to heterogeneous disease progression, variable patient response, and medication side effects. While AI models can forecast levodopa equivalent daily dose (LEDD) as a measure of medication needs, standard uncertainty quantification often fails to communicate the reliability of these predictions, treating high and low confidence clinical decisions identically. We introduce CASCADE (Calibrated Adaptive Scaling via Conformal And Distributional Estimation), a novel conformal prediction framework that propagates epistemic uncertainty from a screening classifier to adapt downstream predictions. Unlike standard conformal methods that rely on auxiliary residual regression, we leverage epistemic uncertainty from a primary classification task (identifying whether a medication change is needed) to dynamically scale the prediction intervals of a secondary regression task (predicting how much change). By mapping Venn-Abers multi-probabilistic uncertainty directly to non-conformity scores, our framework achieves continuous risk adaptation. We demonstrate that this ``cascade effect'' produces highly efficient intervals for confident patients (38.9% narrower than standard conformal baselines) while automatically expanding intervals to ensure robust coverage for uncertain cases, bridging the gap between discrete clinical decision-making and continuous dose forecasting in PD.

2026-05-19T20:30:10Z Accepted to ICML 2026 AgenticUQ Workshop. 14 Pages, 3 Figures Ricardo Diaz-Rincon Muxuan Liang Adolfo Ramirez-Zamora Benjamin Shickel http://arxiv.org/abs/2604.14486v3 Tweedie Calculus 2026-06-03T15:19:13Z

Tweedie's formula, which under Gaussian noise expresses the posterior mean of a latent variable directly from the observed-data density, is a cornerstone of empirical Bayes and measurement-error analysis. No general theory, however, explains when analogous identities hold, how they are structured, or how to derive them for non-Gaussian noise and for posterior functionals other than the mean. This paper develops such a framework for additive-noise models. I characterize when conditional expectations of an unobserved latent variable, given the observed signal, admit direct expressions in terms of the observed density -- identities I call \emph{Tweedie representations} -- and show that they are governed by a linear map, the \emph{Tweedie functional}. Under general conditions, I prove that this functional exists, is unique, and is continuous. I provide a constructive method for its computation based on Fourier analysis: the functional is obtained by extending the inverse Fourier transform of an explicit tempered distribution. The theory yields posterior-mean formulas for non-Gaussian noise and provides new representations for nonlinear posterior functionals. Applications include Laplace mechanisms in differential privacy and heteroskedastic Gaussian sequence models in compound decision problems.

2026-04-15T23:53:41Z Santiago Torres