https://arxiv.org/api/vMFqEE4mE1g7ZHJ+kRzwAhaFBto 2026-06-13T18:40:18Z 23522 105 15 http://arxiv.org/abs/2511.09890v3 A Clustering Approach for Basket Trials Based on Treatment Response Trajectories 2026-06-03T23:27:09Z

Heterogeneity in efficacy is sometimes observed across baskets in basket trials. In this study, we propose a model-free clustering framework that groups baskets based on transition probabilities derived from the trajectories of treatment response, rather than relying solely on a single efficacy endpoint such as the objective response rate. The number of clusters is not predetermined but is automatically determined in a data-driven manner based on the similarity structure among baskets. After clustering, baskets within the same cluster are analyzed using a hierarchical Bayesian model. This framework aims to improve the estimation precision of efficacy endpoints and enhance statistical power while maintaining the type~I error rate at the nominal level. The performance of the proposed method was evaluated through simulation studies. The results demonstrated that the proposed method can accurately identify cluster structures in heterogeneous settings and, even under such conditions, maintain the type~I error rate at the nominal level while improving statistical power.

2025-11-13T02:51:25Z Masahiro Kojima Keisuke Hanada Atsuya Sato http://arxiv.org/abs/2606.05420v1 Assessing the Carbon Emissions and Energy Consumption of U.S. Hyperscale Data Centers 2026-06-03T20:38:10Z

The rapid proliferation of hyperscale data centers (HDCs) in the US, mainly driven by the adoption of artificial intelligence, has raised concerns about this industry's environmental footprint. We compiled facility-level information on 403 US hyperscale data centers operating between May 2024 and April 2025 and estimated their electricity consumption, electricity sources, and attributable CO2 emissions. Across different facility-load scenarios, these HDCs consumed approximately 68-99 TWh of electricity and were associated with about 37-54 million metric tons of CO2. Under the central scenario, HDC electricity demand corresponded to approximately 1.8% of total US electricity consumption, with roughly 54% of attributed generation supplied by fossil-fuel sources. The HDC electricity-weighted average carbon intensity was approximately 545 gCO2/kWh, about 48% above the contemporaneous US national grid-average carbon intensity of 370 gCO2/kWh. Our approach provides an attributional tool for assessing the environmental footprint of hyperscale data centers using the most recent EPA eGRID plant-level data.

2026-06-03T20:38:10Z Gianluca Guidi Francesca Dominici Tiziano Squartini Callaway Sprinkle Jonathan Gilmour Kevin Butler Eric Bell Scott Delaney Falco J. Bargagli-Stoffi http://arxiv.org/abs/2604.26634v2 Electricity price forecasting across Norway's five bidding zones in the post-crisis era 2026-06-03T19:38:26Z

Norway's electricity market is heavily dominated by hydropower, but the 2021-2022 energy crisis and stronger integration with Continental Europe have fundamentally altered price formation, reducing the reliability of forecasting models calibrated on historical data. Despite the critical need for updated models, a unified benchmark evaluating feature contributions across all structurally diverse Norwegian bidding zones remains lacking. Here we present a comprehensive evaluation of one-step-ahead forecasting of the Nord Pool market across all five Norwegian bidding zones. We constructed a multimodal hourly dataset spanning 2019-2025 and evaluated eight forecasting model families, including Light Gradient Boosting Machine (LightGBM), autoregressive models with exogenous variables, and advanced deep learning architectures, using a strictly causal test set. We implemented robust rolling-origin backtesting, leave-one-group-out feature ablation, and conditional regime analysis to dissect model performance and feature utility. Our results show that LightGBM achieves the best performance in every zone, with mean absolute error ranging from 1.60 to 5.58 euros per megawatt-hour, while a ridge-regularized autoregressive model with exogenous variables remains a highly competitive linear benchmark in northern zones. Feature ablation reveals that models relying solely on lagged prices and calendar variables achieve high accuracy and often match or closely approach the performance of the full multimodal model. However, conditional regime analysis demonstrates that external features like reservoir levels and gas prices remain crucial to stratify forecast errors, which consistently increase under stressed market regimes. This highlights the practical value of model interpretability and regime awareness for decision makers facing structural changes in market dynamics.

2026-04-29T13:02:02Z This version removes variables unavailable at prediction time to eliminate look-ahead leakage, clarifies the forecasting task definition, and updates the results and discussion accordingly. All tables and figures have been recomputed My Thi Diem Phan Trung Tuyen Truong Hoai Phuong Ha Dat Thanh Nguyen http://arxiv.org/abs/2606.05324v1 Optimizing Irreversible Perturbations of the Unadjusted Langevin Algorithm 2026-06-03T18:10:12Z

Irreversible perturbations accelerate the convergence of Langevin dynamics, breaking detailed balance while preserving the invariant measure. The design of optimal irreversible perturbations has been studied in the continuous-time Gaussian setting, but extensions to non-Gaussian target distributions, and the impact of time discretization on the design of optimal perturbations, have not been well understood. Numerical discretizations of Langevin dynamics introduce bias, which is typically exacerbated by irreversible perturbations; handling this interaction demands a joint treatment of acceleration and accuracy. This paper develops a systematic framework for optimizing position-independent irreversible perturbations of the unadjusted Langevin algorithm (ULA). We formulate a constrained optimization problem that simultaneously accounts for mixing efficiency and discretization bias, where the former is characterized by a spectral gap analogue and the latter is quantified via a weighted expected squared jump distance. Within this framework, we derive an explicit characterization of the optimal position-independent irreversible perturbation. Extensive numerical experiments demonstrate that our design yields faster convergence with controlled bias, and improves mean squared estimation errors compared to other choices of irreversible perturbation.

2026-06-03T18:10:12Z 60 pages, 30 figures, 1 algorithm, 1 table Qianyu Julie Zhu Youssef Marzouk Konstantinos Spiliopoulos Benjamin Zhang http://arxiv.org/abs/2606.05308v1 Statistically Reliable LLM-Based Ranking Evaluation via Prediction-Powered Inference 2026-06-03T18:01:08Z

With PRECISE, we extended Prediction-Powered Inference to produce bias-corrected estimates of ranking evaluation metrics by combining a small human-labeled set with a large LLM-judged set. PPI is provably unbiased regardless of the LLM judge's error profile. We make it applicable to hierarchical metrics like Precision@K, where annotations are per-document but the metric is per-query, by reducing the output-space computation from O(2^|C|) to O(2^K). On the ESCI benchmark, augmenting 30 human annotations with Claude 3 Sonnet judgments reduces the standard error of Precision@4 estimates from 4.45 to 3.50 (a 21% relative reduction). In a production system, our framework correctly identified the best of three system variants from 100 human labels and 2 hours of domain-expert annotation; A/B testing confirmed this ranking with +407 bps in daily sales.

2026-06-03T18:01:08Z Accepted at ACL 2026 - GEM Workshop Abhishek Divekar http://arxiv.org/abs/2512.20753v2 Algorithmic Bias in Lending: Evidence from a Fintech Audit 2026-06-03T16:22:45Z

Algorithmic lending has transformed the consumer credit landscape, with machine learning models commonly facilitating underwriting decisions. To comply with fair lending laws, these algorithms exclude legally protected characteristics, such as race and gender. Yet algorithmic underwriting can still inadvertently favor certain groups, prompting concerns about whether lending algorithms exhibit discriminatory behavior. Using proprietary loan-level data from a major U.S. fintech platform, we audit lending decisions across approximately 80,000 personal loans. We find that loans made to men and Black borrowers yielded lower profits than loans to other groups, suggesting that men and Black borrowers benefited from relatively favorable pricing. We trace these disparities to miscalibration in the platform's underwriting model, which overestimates risk for women and underestimates risk for Black borrowers. We then show that one could correct this miscalibration -- and the corresponding disparities -- by including race and gender in underwriting models, illustrating a tension between competing notions of fairness.

2025-12-23T20:26:38Z Madison Coots Robert Bartlett Julian Nyarko Sharad Goel http://arxiv.org/abs/2606.05258v1 Harnessing Source Heterogeneity for Cluster-Structured Transfer Learning 2026-06-03T16:09:15Z

Transfer learning is a natural strategy when a target population has limited data but multiple related auxiliary sources are available. A central difficulty is source heterogeneity: auxiliary sources may not be equally useful, and their usefulness may vary in a structured, cluster-like fashion. Existing transfer-learning methods often reduce source selection to a binary informative/non-informative decision, overlooking subgroups of sources with differential transferability. Motivated by a suicide-risk study using data from the Connecticut Hospital Information Management Exchange (CHIME), comprising 636,758 patients across 27 hospitals, we propose Trans-GLMC, a cluster-structured transfer-learning procedure for generalized linear models. The CHIME setting illustrates the core challenge: hospital-specific risk models are unstable because suicide attempts are rare at any single facility, whereas indiscriminate pooling across hospitals can obscure facility-level differences in patient mix and risk profiles. Trans-GLMC first constructs a coefficient-based distance among the target and candidate sources to recover latent source clusters. It then combines global fusion, within-cluster refinement, and target debiasing to produce an estimator that adapts to the detected structure. We establish a non-asymptotic error bound that improves over its unclustered counterpart whenever a meaningful target cluster exists and matches the unclustered rate up to constants otherwise. In simulations and in the CHIME study, Trans-GLMC improves facility-specific prediction, identifies interpretable communities of hospitals with mutual transferability, and recovers clinically coherent suicide-risk factors.

2026-06-03T16:09:15Z Xiaohui Yin Jun Jin Shane J. Sacco Robert H. Aseltine Kun Chen http://arxiv.org/abs/2605.20657v2 Cooling Channel Design Optimization for High Power Multi-Chip Packages 2026-06-03T16:06:20Z

Thermal management is a major challenge in next-generation high-performance computing systems, particularly for heterogeneous multi-chip packages such as the NVIDIA GB200 Grace Blackwell Superchip. In this work, a physics-based computational framework is developed to optimize embedded cooling channel layouts for high-power multi-chip modules. The model couples steady-state heat conduction with a porous media-based representation of coolant transport, coupled with a row-wise coolant energy balance, to estimate chip temperature fields within microchannel networks. Unlike conventional designs, an interdigitated cooling architecture is parameterized using geometric variables, including channel count, width, and expansion over chip regions, enabling systematic design exploration. To enable efficient optimization, a surrogate-based approach is employed to approximate the relationship between geometric parameters and temperature metrics. The resulting model is optimized using a mixed-integer quadratic programming algorithm to minimize a weighted objective based on peak and average chip temperatures. To improve physical relevance, channel placement is further constrained to increase cooling coverage near GPU regions, where thermal loads are highest. The framework is applied to a representative multi-chip configuration based on NVIDIA GB200 architecture, consisting of two graphics processing units and one central processing unit. The results demonstrate that the optimal design reduces the peak chip temperature by 140.45°C and the average chip temperature by 35.87°C compared to the baseline configuration.

2026-05-20T03:21:57Z 9 pages, 8 figures Michael Acquah Zheng Liu http://arxiv.org/abs/2606.05026v1 Removal of Multivariate Environmental Influences in Structural Health Monitoring through Conditional Covariances and Supervised Learning 2026-06-03T15:55:24Z

In structural health monitoring (SHM) systems, data is collected from a multitude of sensors measuring, for example, vibration or strain in the structure, along with additional features that capture environmental or operational information. It is well known that changes in the measured sensor outputs do not necessarily originate from structural damage but are often induced by environmental changes. One popular approach to account for these effects is regressing the system outputs on the confounding factors, also known as "response surface modeling". Afterward, the predicted values are subtracted from the observed ones to obtain corrected data with the environmental effects (supposedly) removed. However, the evaluation of real-world SHM data shows that environmental conditions may affect not only the expected output values but also higher-order statistical moments, particularly the variances of and the covariances and correlations between the output quantities, such as eigenfrequencies of different modes or strain sensors at different locations. By construction, the (supervised) machine learning techniques commonly used for response surface modeling cannot account for those higher-order effects. To address these issues, we present and discuss several approaches for identifying and quantifying multivariate confounding effects on output covariances and correlations: a nonparametric, kernel-based estimator, a random forest, a semiparametric additive model, and a deep learning approach. Furthermore, we show how the resulting conditional covariance matrices can be used in an SHM pipeline. We compare the competing methods on both artificial data and real-world load test data from the Vahrendorfer Stadtweg bridge in Hamburg, Germany, as well as eigenfrequency data from the railway bridge KW51 near Leuven, Belgium.

2026-06-03T15:55:24Z 25 pages, 8 figures Lizzie Neumann Philipp Wittenberg Jan Gertheiss http://arxiv.org/abs/2605.25934v2 Weighted NPMLE for the Marginal Mean of Recurrent Events with a Competing Terminal Event 2026-06-03T14:37:34Z

Regression modeling of recurrent and terminal events continues to present methodological challenges in survival analysis. Existing approaches either make unverifiable assumptions about the dependency structure between the two event types or rely on the proportional intensity assumption for the marginal mean. A semiparametric regression model is proposed that is based on a novel weighted likelihood function, thereby targeting directly the marginal mean of the recurrent event. Our general model captures a large class of semiparametric regression models and accommodates external time-dependent covariate effects on the marginal mean intensity. We establish the consistency and asymptotic normality of the estimators and propose a sandwich estimator of the variance. We propose a novel simulation procedure that directly targets the marginal mean intensity of the recurrent events. In simulation studies, we demonstrate a strong performance of the weighted NPMLE under independent right-censoring. The practical utility of the proposed methodology is demonstrated through application to data from the STATCOPE trial, a large randomized clinical trial that investigated the efficacy of simvastatin for COPD exacerbations. We provide personalized predictions for the number of exacerbations and reassess the effect of simvastatin treatment, accounting for death as a competing terminal event for patients with GOLD stage 4.

2026-05-25T15:13:47Z Anna Bellach Michael R. Kosorok http://arxiv.org/abs/2403.00965v2 Binary Gaussian Copula Synthesis: an LLM-powered data augmentation framework for early dialysis prediction in chronic kidney disease 2026-06-03T14:24:58Z

Only a small fraction of patients with chronic kidney disease (CKD) progress to dialysis, creating severe class imbalance that limits the performance of machine learning models for early dialysis prediction. This challenge is compounded by the binary structure of electronic health record (EHR) data, for which most existing augmentation methods were not designed. We propose Binary Gaussian Copula Synthesis (BGCS), a two-stage data augmentation method tailored to binary clinical data. BGCS first generates synthetic minority-class samples using a Gaussian copula framework that explicitly models pairwise dependencies among binary features, then applies a fine-tuned GPT-2 classifier to filter out clinically implausible samples before training. We evaluated BGCS on a real-world EHR dataset of 15,169 patients with CKD from West Virginia collected between 2008 and 2022, benchmarking it against SMOTE, CTGAN, and standard Gaussian Copula across four machine learning classifiers over 25 independent runs. BGCS consistently outperformed all comparison methods, achieving the highest minority-class recall for 90-day dialysis prediction, with median values ranging from 0.78 to 0.87 across classifiers, and the strongest distributional fidelity to real data, with a mean p-value of 0.68 across features. The best-performing BGCS-augmented model was integrated into an interpretable decision tree-based clinical decision support system for dialysis risk stratification, with electrolyte imbalances, cardiovascular comorbidities, and renal monitoring indicators emerging as the most influential predictive features. These findings suggest that augmentation methods designed for the structural properties of binary EHR data can meaningfully improve early dialysis risk prediction and support the development of interpretable clinical decision-support tools for CKD care.

2024-03-01T20:32:17Z Hamed Khosravi Milad Khanchi Mobina Noori Srinjoy Das Abdullah Al-Mamun Imtiaz Ahmed http://arxiv.org/abs/2606.04900v1 Multi-objective probabilistic forecast combination for inventory demand 2026-06-03T14:01:13Z

Probabilistic forecasts are essential for inventory management, where decisions depend on the full distribution of future demand. While probabilistic forecast combination is widely used to improve statistical accuracy, most existing approaches optimize statistical loss alone and overlook operational objectives. However, in inventory settings, higher forecast accuracy does not necessarily translate into better decision performance, especially under nonlinear cost structures and multiple, potentially conflicting, decision targets. To address this gap, we propose a multi-objective probabilistic forecast combination framework that simultaneously considers forecast accuracy and inventory decision performance. The framework formulates forecast combination as a multi-objective optimization problem and derives a set of Pareto-optimal combinations, enabling explicit trade-offs between forecasting and operational goals. Empirical studies using Walmart retail data and Royal Air Force spare parts data demonstrate that the proposed approach achieves more balanced and robust performance than individual models, simple averaging, and single-objective optimization. Our results provide a practical and flexible framework for aligning probabilistic forecasting with inventory decision-making.

2026-06-03T14:01:13Z Shengjie Wang Yanfei Kang Evangelos Spiliotis Fotios Petropoulos http://arxiv.org/abs/2606.04879v1 Bootstrap-based Hypothesis Test of 2D Contours using Elastic Shape Analysis 2026-06-03T13:43:38Z

Shapes of objects in images are often complex, high-dimensional, and vary in ways not captured by standard Euclidean geometry and statistics. Statistical shape analysis encompasses methods for flexible and interpretable measurement of intrinsic shape and shape variability in geometric objects. Elastic Shape Analysis (ESA) is one such method that measures shape differences between objects, represented by contours, in a way that is invariant to rotation, scale, translation, and parameterization. Although ESA is useful for quantifying shape of objects in many image applications, formal methods for statistical inference in image-based ESA remain limited. This work introduces a hypothesis test procedure based on empirical confidence intervals for the elastic shape distance (ESD) between a proposed underlying true shape and an estimated shape. The confidence intervals are created using a bootstrap procedure for non-smooth functionals, which accounts for the non-differentiability of the ESD. The effectiveness of the method is illustrated through both numerical studies and real world image examples from inertial confinement fusion (ICF).

2026-06-03T13:43:38Z 35 pages, 11 figures Susan Glenn Justin Strait Kelly Moran Chris Danly Matthew P Selwood http://arxiv.org/abs/2606.04637v1 Optimal designs for incomplete stepped wedge trials 2026-06-03T09:09:00Z

Background: Stepped wedge trials are longitudinal randomised evaluations, usually cluster-randomised, in which the experimental intervention is introduced in a staggered fashion. Incomplete stepped wedge designs focus the effort of data collection on particular periods in particular sequences. Methods: We suppose there is a cost for every period in every cluster where we collect data, and that there are a fixed number of individuals, m, with data available in each period in each cluster. If we are willing to pay the cost of data collection in that cluster-period then we collect the data on all m individuals, and if we are not willing to pay the cost then we collect no data in that cluster-period. We consider the problem of designing a trial to minimise the total number of cluster-periods of data collection needed to achieve given precision for the treatment effect estimator, or equivalently, to maximise precision for a given number of cluster-periods of data collection. Results: We present the solution for two-period trials, which has two distinct forms, depending on the correlation between two cluster-period means from the same cluster in different periods. We also present a conjecture on the form of the solution for multi-period trials, informed by results from a greedy search of the design space. Conclusions: A real-life stepped wedge design problem will involve trading off the costs of various design elements subject also to constraints on the scale of data collection. Nevertheless, the solutions to the problem considered here add significantly to our understanding of the optimal design of incomplete stepped wedge trials.

2026-06-03T09:09:00Z Richard Hooper Alan Girling http://arxiv.org/abs/2606.03863v2 Assessing the Impact of Intercurrent Events on Power and Sample Size for Estimands with Time-to-Event Endpoints 2026-06-03T08:21:13Z

The precise definition of a primary estimand, accounting for intercurrent events (IEs) as per the ICH E9(R1) addendum, is fundamental to the design and interpretation of clinical trials. Conventional power and sample size calculations, however, often do not adequately incorporate the impact of IEs and their corresponding handling strategies, creating a risk of over- or under-powered studies. While simulation-based approaches can address this complexity, they are often computationally intensive and may only explore a limited set of scenarios. In this paper, we introduce a set of formulae for calculating power for estimands with time-to-event endpoints, applied to trials with fixed follow-up durations. We focus on estimands that use treatment policy, hypothetical, composite, or a combination of strategies for handling IEs, under the assumption that IEs occur independently of each other and the primary endpoint. Validation against simulation-based estimates shows strong agreement, and we explore deviations in power estimates in scenarios where outcomes and IEs are dependent. We illustrate the practical application of our approach through a case study in nasal polyposis, examining the sensitivity of sample size requirements to varying IE rates and their impacts on post-IE outcomes. The proposed formulae facilitate rapid and accurate power and assurance calculations, enabling clinical trial designs to be more closely aligned with the estimand of interest.

2026-06-02T16:37:45Z Daniel J Bratton Fiona Guillard Sunita Rehal Thomas Drury