https://arxiv.org/api/vMFqEE4mE1g7ZHJ+kRzwAhaFBto2026-06-13T18:40:18Z2352210515http://arxiv.org/abs/2511.09890v3A Clustering Approach for Basket Trials Based on Treatment Response Trajectories2026-06-03T23:27:09ZHeterogeneity in efficacy is sometimes observed across baskets in basket trials. In this study, we propose a model-free clustering framework that groups baskets based on transition probabilities derived from the trajectories of treatment response, rather than relying solely on a single efficacy endpoint such as the objective response rate. The number of clusters is not predetermined but is automatically determined in a data-driven manner based on the similarity structure among baskets. After clustering, baskets within the same cluster are analyzed using a hierarchical Bayesian model. This framework aims to improve the estimation precision of efficacy endpoints and enhance statistical power while maintaining the type~I error rate at the nominal level. The performance of the proposed method was evaluated through simulation studies. The results demonstrated that the proposed method can accurately identify cluster structures in heterogeneous settings and, even under such conditions, maintain the type~I error rate at the nominal level while improving statistical power.2025-11-13T02:51:25ZMasahiro KojimaKeisuke HanadaAtsuya Satohttp://arxiv.org/abs/2606.05420v1Assessing the Carbon Emissions and Energy Consumption of U.S. Hyperscale Data Centers2026-06-03T20:38:10ZThe rapid proliferation of hyperscale data centers (HDCs) in the US, mainly driven by the adoption of artificial intelligence, has raised concerns about this industry's environmental footprint. We compiled facility-level information on 403 US hyperscale data centers operating between May 2024 and April 2025 and estimated their electricity consumption, electricity sources, and attributable CO2 emissions. Across different facility-load scenarios, these HDCs consumed approximately 68-99 TWh of electricity and were associated with about 37-54 million metric tons of CO2. Under the central scenario, HDC electricity demand corresponded to approximately 1.8% of total US electricity consumption, with roughly 54% of attributed generation supplied by fossil-fuel sources. The HDC electricity-weighted average carbon intensity was approximately 545 gCO2/kWh, about 48% above the contemporaneous US national grid-average carbon intensity of 370 gCO2/kWh. Our approach provides an attributional tool for assessing the environmental footprint of hyperscale data centers using the most recent EPA eGRID plant-level data.2026-06-03T20:38:10ZGianluca GuidiFrancesca DominiciTiziano SquartiniCallaway SprinkleJonathan GilmourKevin ButlerEric BellScott DelaneyFalco J. Bargagli-Stoffihttp://arxiv.org/abs/2604.26634v2Electricity price forecasting across Norway's five bidding zones in the post-crisis era2026-06-03T19:38:26ZNorway's electricity market is heavily dominated by hydropower, but the 2021-2022 energy crisis and stronger integration with Continental Europe have fundamentally altered price formation, reducing the reliability of forecasting models calibrated on historical data. Despite the critical need for updated models, a unified benchmark evaluating feature contributions across all structurally diverse Norwegian bidding zones remains lacking. Here we present a comprehensive evaluation of one-step-ahead forecasting of the Nord Pool market across all five Norwegian bidding zones. We constructed a multimodal hourly dataset spanning 2019-2025 and evaluated eight forecasting model families, including Light Gradient Boosting Machine (LightGBM), autoregressive models with exogenous variables, and advanced deep learning architectures, using a strictly causal test set. We implemented robust rolling-origin backtesting, leave-one-group-out feature ablation, and conditional regime analysis to dissect model performance and feature utility. Our results show that LightGBM achieves the best performance in every zone, with mean absolute error ranging from 1.60 to 5.58 euros per megawatt-hour, while a ridge-regularized autoregressive model with exogenous variables remains a highly competitive linear benchmark in northern zones. Feature ablation reveals that models relying solely on lagged prices and calendar variables achieve high accuracy and often match or closely approach the performance of the full multimodal model. However, conditional regime analysis demonstrates that external features like reservoir levels and gas prices remain crucial to stratify forecast errors, which consistently increase under stressed market regimes. This highlights the practical value of model interpretability and regime awareness for decision makers facing structural changes in market dynamics.2026-04-29T13:02:02ZThis version removes variables unavailable at prediction time to eliminate look-ahead leakage, clarifies the forecasting task definition, and updates the results and discussion accordingly. All tables and figures have been recomputedMy Thi Diem PhanTrung Tuyen TruongHoai Phuong HaDat Thanh Nguyenhttp://arxiv.org/abs/2606.05324v1Optimizing Irreversible Perturbations of the Unadjusted Langevin Algorithm2026-06-03T18:10:12ZIrreversible perturbations accelerate the convergence of Langevin dynamics, breaking detailed balance while preserving the invariant measure. The design of optimal irreversible perturbations has been studied in the continuous-time Gaussian setting, but extensions to non-Gaussian target distributions, and the impact of time discretization on the design of optimal perturbations, have not been well understood. Numerical discretizations of Langevin dynamics introduce bias, which is typically exacerbated by irreversible perturbations; handling this interaction demands a joint treatment of acceleration and accuracy. This paper develops a systematic framework for optimizing position-independent irreversible perturbations of the unadjusted Langevin algorithm (ULA). We formulate a constrained optimization problem that simultaneously accounts for mixing efficiency and discretization bias, where the former is characterized by a spectral gap analogue and the latter is quantified via a weighted expected squared jump distance. Within this framework, we derive an explicit characterization of the optimal position-independent irreversible perturbation. Extensive numerical experiments demonstrate that our design yields faster convergence with controlled bias, and improves mean squared estimation errors compared to other choices of irreversible perturbation.2026-06-03T18:10:12Z60 pages, 30 figures, 1 algorithm, 1 tableQianyu Julie ZhuYoussef MarzoukKonstantinos SpiliopoulosBenjamin Zhanghttp://arxiv.org/abs/2606.05308v1Statistically Reliable LLM-Based Ranking Evaluation via Prediction-Powered Inference2026-06-03T18:01:08ZWith PRECISE, we extended Prediction-Powered Inference to produce bias-corrected estimates of ranking evaluation metrics by combining a small human-labeled set with a large LLM-judged set. PPI is provably unbiased regardless of the LLM judge's error profile. We make it applicable to hierarchical metrics like Precision@K, where annotations are per-document but the metric is per-query, by reducing the output-space computation from O(2^|C|) to O(2^K). On the ESCI benchmark, augmenting 30 human annotations with Claude 3 Sonnet judgments reduces the standard error of Precision@4 estimates from 4.45 to 3.50 (a 21% relative reduction). In a production system, our framework correctly identified the best of three system variants from 100 human labels and 2 hours of domain-expert annotation; A/B testing confirmed this ranking with +407 bps in daily sales.2026-06-03T18:01:08ZAccepted at ACL 2026 - GEM WorkshopAbhishek Divekarhttp://arxiv.org/abs/2512.20753v2Algorithmic Bias in Lending: Evidence from a Fintech Audit2026-06-03T16:22:45ZAlgorithmic lending has transformed the consumer credit landscape, with machine learning models commonly facilitating underwriting decisions. To comply with fair lending laws, these algorithms exclude legally protected characteristics, such as race and gender. Yet algorithmic underwriting can still inadvertently favor certain groups, prompting concerns about whether lending algorithms exhibit discriminatory behavior. Using proprietary loan-level data from a major U.S. fintech platform, we audit lending decisions across approximately 80,000 personal loans. We find that loans made to men and Black borrowers yielded lower profits than loans to other groups, suggesting that men and Black borrowers benefited from relatively favorable pricing. We trace these disparities to miscalibration in the platform's underwriting model, which overestimates risk for women and underestimates risk for Black borrowers. We then show that one could correct this miscalibration -- and the corresponding disparities -- by including race and gender in underwriting models, illustrating a tension between competing notions of fairness.2025-12-23T20:26:38ZMadison CootsRobert BartlettJulian NyarkoSharad Goelhttp://arxiv.org/abs/2606.05258v1Harnessing Source Heterogeneity for Cluster-Structured Transfer Learning2026-06-03T16:09:15ZTransfer learning is a natural strategy when a target population has limited data but multiple related auxiliary sources are available. A central difficulty is source heterogeneity: auxiliary sources may not be equally useful, and their usefulness may vary in a structured, cluster-like fashion. Existing transfer-learning methods often reduce source selection to a binary informative/non-informative decision, overlooking subgroups of sources with differential transferability. Motivated by a suicide-risk study using data from the Connecticut Hospital Information Management Exchange (CHIME), comprising 636,758 patients across 27 hospitals, we propose Trans-GLMC, a cluster-structured transfer-learning procedure for generalized linear models. The CHIME setting illustrates the core challenge: hospital-specific risk models are unstable because suicide attempts are rare at any single facility, whereas indiscriminate pooling across hospitals can obscure facility-level differences in patient mix and risk profiles. Trans-GLMC first constructs a coefficient-based distance among the target and candidate sources to recover latent source clusters. It then combines global fusion, within-cluster refinement, and target debiasing to produce an estimator that adapts to the detected structure. We establish a non-asymptotic error bound that improves over its unclustered counterpart whenever a meaningful target cluster exists and matches the unclustered rate up to constants otherwise. In simulations and in the CHIME study, Trans-GLMC improves facility-specific prediction, identifies interpretable communities of hospitals with mutual transferability, and recovers clinically coherent suicide-risk factors.2026-06-03T16:09:15ZXiaohui YinJun JinShane J. SaccoRobert H. AseltineKun Chenhttp://arxiv.org/abs/2605.20657v2Cooling Channel Design Optimization for High Power Multi-Chip Packages2026-06-03T16:06:20ZThermal management is a major challenge in next-generation high-performance computing systems, particularly for heterogeneous multi-chip packages such as the NVIDIA GB200 Grace Blackwell Superchip. In this work, a physics-based computational framework is developed to optimize embedded cooling channel layouts for high-power multi-chip modules. The model couples steady-state heat conduction with a porous media-based representation of coolant transport, coupled with a row-wise coolant energy balance, to estimate chip temperature fields within microchannel networks. Unlike conventional designs, an interdigitated cooling architecture is parameterized using geometric variables, including channel count, width, and expansion over chip regions, enabling systematic design exploration. To enable efficient optimization, a surrogate-based approach is employed to approximate the relationship between geometric parameters and temperature metrics. The resulting model is optimized using a mixed-integer quadratic programming algorithm to minimize a weighted objective based on peak and average chip temperatures. To improve physical relevance, channel placement is further constrained to increase cooling coverage near GPU regions, where thermal loads are highest. The framework is applied to a representative multi-chip configuration based on NVIDIA GB200 architecture, consisting of two graphics processing units and one central processing unit. The results demonstrate that the optimal design reduces the peak chip temperature by 140.45°C and the average chip temperature by 35.87°C compared to the baseline configuration.2026-05-20T03:21:57Z9 pages, 8 figuresMichael AcquahZheng Liuhttp://arxiv.org/abs/2606.05026v1Removal of Multivariate Environmental Influences in Structural Health Monitoring through Conditional Covariances and Supervised Learning2026-06-03T15:55:24ZIn structural health monitoring (SHM) systems, data is collected from a multitude of sensors measuring, for example, vibration or strain in the structure, along with additional features that capture environmental or operational information. It is well known that changes in the measured sensor outputs do not necessarily originate from structural damage but are often induced by environmental changes. One popular approach to account for these effects is regressing the system outputs on the confounding factors, also known as "response surface modeling". Afterward, the predicted values are subtracted from the observed ones to obtain corrected data with the environmental effects (supposedly) removed. However, the evaluation of real-world SHM data shows that environmental conditions may affect not only the expected output values but also higher-order statistical moments, particularly the variances of and the covariances and correlations between the output quantities, such as eigenfrequencies of different modes or strain sensors at different locations. By construction, the (supervised) machine learning techniques commonly used for response surface modeling cannot account for those higher-order effects. To address these issues, we present and discuss several approaches for identifying and quantifying multivariate confounding effects on output covariances and correlations: a nonparametric, kernel-based estimator, a random forest, a semiparametric additive model, and a deep learning approach. Furthermore, we show how the resulting conditional covariance matrices can be used in an SHM pipeline. We compare the competing methods on both artificial data and real-world load test data from the Vahrendorfer Stadtweg bridge in Hamburg, Germany, as well as eigenfrequency data from the railway bridge KW51 near Leuven, Belgium.2026-06-03T15:55:24Z25 pages, 8 figuresLizzie NeumannPhilipp WittenbergJan Gertheisshttp://arxiv.org/abs/2605.25934v2Weighted NPMLE for the Marginal Mean of Recurrent Events with a Competing Terminal Event2026-06-03T14:37:34ZRegression modeling of recurrent and terminal events continues to present methodological challenges in survival analysis. Existing approaches either make unverifiable assumptions about the dependency structure between the two event types or rely on the proportional intensity assumption for the marginal mean. A semiparametric regression model is proposed that is based on a novel weighted likelihood function, thereby targeting directly the marginal mean of the recurrent event. Our general model captures a large class of semiparametric regression models and accommodates external time-dependent covariate effects on the marginal mean intensity. We establish the consistency and asymptotic normality of the estimators and propose a sandwich estimator of the variance. We propose a novel simulation procedure that directly targets the marginal mean intensity of the recurrent events. In simulation studies, we demonstrate a strong performance of the weighted NPMLE under independent right-censoring. The practical utility of the proposed methodology is demonstrated through application to data from the STATCOPE trial, a large randomized clinical trial that investigated the efficacy of simvastatin for COPD exacerbations. We provide personalized predictions for the number of exacerbations and reassess the effect of simvastatin treatment, accounting for death as a competing terminal event for patients with GOLD stage 4.2026-05-25T15:13:47ZAnna BellachMichael R. Kosorokhttp://arxiv.org/abs/2403.00965v2Binary Gaussian Copula Synthesis: an LLM-powered data augmentation framework for early dialysis prediction in chronic kidney disease2026-06-03T14:24:58ZOnly a small fraction of patients with chronic kidney disease (CKD) progress to dialysis, creating severe class imbalance that limits the performance of machine learning models for early dialysis prediction. This challenge is compounded by the binary structure of electronic health record (EHR) data, for which most existing augmentation methods were not designed. We propose Binary Gaussian Copula Synthesis (BGCS), a two-stage data augmentation method tailored to binary clinical data. BGCS first generates synthetic minority-class samples using a Gaussian copula framework that explicitly models pairwise dependencies among binary features, then applies a fine-tuned GPT-2 classifier to filter out clinically implausible samples before training. We evaluated BGCS on a real-world EHR dataset of 15,169 patients with CKD from West Virginia collected between 2008 and 2022, benchmarking it against SMOTE, CTGAN, and standard Gaussian Copula across four machine learning classifiers over 25 independent runs. BGCS consistently outperformed all comparison methods, achieving the highest minority-class recall for 90-day dialysis prediction, with median values ranging from 0.78 to 0.87 across classifiers, and the strongest distributional fidelity to real data, with a mean p-value of 0.68 across features. The best-performing BGCS-augmented model was integrated into an interpretable decision tree-based clinical decision support system for dialysis risk stratification, with electrolyte imbalances, cardiovascular comorbidities, and renal monitoring indicators emerging as the most influential predictive features. These findings suggest that augmentation methods designed for the structural properties of binary EHR data can meaningfully improve early dialysis risk prediction and support the development of interpretable clinical decision-support tools for CKD care.2024-03-01T20:32:17ZHamed KhosraviMilad KhanchiMobina NooriSrinjoy DasAbdullah Al-MamunImtiaz Ahmedhttp://arxiv.org/abs/2606.04900v1Multi-objective probabilistic forecast combination for inventory demand2026-06-03T14:01:13ZProbabilistic forecasts are essential for inventory management, where decisions depend on the full distribution of future demand. While probabilistic forecast combination is widely used to improve statistical accuracy, most existing approaches optimize statistical loss alone and overlook operational objectives. However, in inventory settings, higher forecast accuracy does not necessarily translate into better decision performance, especially under nonlinear cost structures and multiple, potentially conflicting, decision targets. To address this gap, we propose a multi-objective probabilistic forecast combination framework that simultaneously considers forecast accuracy and inventory decision performance. The framework formulates forecast combination as a multi-objective optimization problem and derives a set of Pareto-optimal combinations, enabling explicit trade-offs between forecasting and operational goals. Empirical studies using Walmart retail data and Royal Air Force spare parts data demonstrate that the proposed approach achieves more balanced and robust performance than individual models, simple averaging, and single-objective optimization. Our results provide a practical and flexible framework for aligning probabilistic forecasting with inventory decision-making.2026-06-03T14:01:13ZShengjie WangYanfei KangEvangelos SpiliotisFotios Petropouloshttp://arxiv.org/abs/2606.04879v1Bootstrap-based Hypothesis Test of 2D Contours using Elastic Shape Analysis2026-06-03T13:43:38ZShapes of objects in images are often complex, high-dimensional, and vary in ways not captured by standard Euclidean geometry and statistics. Statistical shape analysis encompasses methods for flexible and interpretable measurement of intrinsic shape and shape variability in geometric objects. Elastic Shape Analysis (ESA) is one such method that measures shape differences between objects, represented by contours, in a way that is invariant to rotation, scale, translation, and parameterization. Although ESA is useful for quantifying shape of objects in many image applications, formal methods for statistical inference in image-based ESA remain limited. This work introduces a hypothesis test procedure based on empirical confidence intervals for the elastic shape distance (ESD) between a proposed underlying true shape and an estimated shape. The confidence intervals are created using a bootstrap procedure for non-smooth functionals, which accounts for the non-differentiability of the ESD. The effectiveness of the method is illustrated through both numerical studies and real world image examples from inertial confinement fusion (ICF).2026-06-03T13:43:38Z35 pages, 11 figuresSusan GlennJustin StraitKelly MoranChris DanlyMatthew P Selwoodhttp://arxiv.org/abs/2606.04637v1Optimal designs for incomplete stepped wedge trials2026-06-03T09:09:00ZBackground: Stepped wedge trials are longitudinal randomised evaluations, usually cluster-randomised, in which the experimental intervention is introduced in a staggered fashion. Incomplete stepped wedge designs focus the effort of data collection on particular periods in particular sequences. Methods: We suppose there is a cost for every period in every cluster where we collect data, and that there are a fixed number of individuals, m, with data available in each period in each cluster. If we are willing to pay the cost of data collection in that cluster-period then we collect the data on all m individuals, and if we are not willing to pay the cost then we collect no data in that cluster-period. We consider the problem of designing a trial to minimise the total number of cluster-periods of data collection needed to achieve given precision for the treatment effect estimator, or equivalently, to maximise precision for a given number of cluster-periods of data collection. Results: We present the solution for two-period trials, which has two distinct forms, depending on the correlation between two cluster-period means from the same cluster in different periods. We also present a conjecture on the form of the solution for multi-period trials, informed by results from a greedy search of the design space. Conclusions: A real-life stepped wedge design problem will involve trading off the costs of various design elements subject also to constraints on the scale of data collection. Nevertheless, the solutions to the problem considered here add significantly to our understanding of the optimal design of incomplete stepped wedge trials.2026-06-03T09:09:00ZRichard HooperAlan Girlinghttp://arxiv.org/abs/2606.03863v2Assessing the Impact of Intercurrent Events on Power and Sample Size for Estimands with Time-to-Event Endpoints2026-06-03T08:21:13ZThe precise definition of a primary estimand, accounting for intercurrent events (IEs) as per the ICH E9(R1) addendum, is fundamental to the design and interpretation of clinical trials. Conventional power and sample size calculations, however, often do not adequately incorporate the impact of IEs and their corresponding handling strategies, creating a risk of over- or under-powered studies. While simulation-based approaches can address this complexity, they are often computationally intensive and may only explore a limited set of scenarios. In this paper, we introduce a set of formulae for calculating power for estimands with time-to-event endpoints, applied to trials with fixed follow-up durations. We focus on estimands that use treatment policy, hypothetical, composite, or a combination of strategies for handling IEs, under the assumption that IEs occur independently of each other and the primary endpoint. Validation against simulation-based estimates shows strong agreement, and we explore deviations in power estimates in scenarios where outcomes and IEs are dependent. We illustrate the practical application of our approach through a case study in nasal polyposis, examining the sensitivity of sample size requirements to varying IE rates and their impacts on post-IE outcomes. The proposed formulae facilitate rapid and accurate power and assurance calculations, enabling clinical trial designs to be more closely aligned with the estimand of interest.2026-06-02T16:37:45ZDaniel J BrattonFiona GuillardSunita RehalThomas Drury