Multi-output Extreme Spatial Model for Complex Aircraft Production Systems

2026-04-24T13:38:08Z

Problem definition: Data-driven models in machine learning have enabled efficient management of production systems. However, a majority of machine learning models are devoted to modeling the mean response or average pattern, which is inappropriate for studying abnormal extreme events that are often of primary interest in aircraft manufacturing. Since extreme events from heavy-tailed distributions give rise to prohibitive expenditures in system management, sophisticated extreme models are urgently needed to analyze complex extreme risks. Engineering applications of extreme models usually focus on individual extreme events, which is insufficient for complex systems with correlations. Methodology/results: We introduce an extreme spatial model for multi-output response control systems that efficiently captures the dynamics using a bilinear function on two spatial domains for control variables and measurement locations. Marginal parameter modeling and extremal dependence have been investigated. In addition, an efficient graph-assisted composite likelihood estimation and corresponding computational algorithms are developed to cope with high-dimensional outputs. The application to composite aircraft production shows that the proposed model enables comprehensive analyses with superior predictive performance on extreme events compared to canonical methods. Managerial implications: Our method shows how to use an extreme spatial model for predicting extreme events and managing extreme risks in complex production systems such as aircraft. This can help achieve better quality management and operation safety in aircraft production systems and beyond.

Online Distributional Regression

2026-04-24T13:10:33Z

Large-scale streaming data are common in modern machine learning applications and have led to the development of online learning algorithms. Many fields, such as supply chain management, weather and meteorology, energy markets, and finance, have pivoted toward probabilistic forecasting. This results in the need not only for accurate learning of the expected value but also for learning the conditional heteroskedasticity and conditional moments. Against this backdrop, we present a methodology for online estimation of regularized, linear distributional models. The proposed algorithm combines recent developments in online estimation of LASSO models with the well-known GAMLSS framework. We provide a case study on day-ahead electricity price forecasting, in which we show the competitive performance of the incremental estimation combined with strongly reduced computational effort. Our algorithms are implemented in a computationally efficient Python package ondil.

An `Inverse' Experimental Framework to Estimate Market Efficiency

2026-04-24T12:54:07Z

Digital marketplaces processing billions of dollars annually represent critical infrastructure in sociotechnical ecosystems, yet their performance optimization lacks principled measurement frameworks that can inform algorithmic governance decisions regarding market efficiency and fairness from complex market data. By looking at orderbook data from double auction markets alone, because bids and asks do not represent true maximum willingnesses to buy and true minimum willingnesses to sell, there is little an economist can say about the market's actual performance in terms of allocative efficiency. We turn to experimental data to address this issue, `inverting' the standard induced value approach of double auction experiments. Our aim is to predict key market features relevant to market efficiency, particularly allocative efficiency, using orderbook data only -- specifically bids, asks and price realizations, but not the induced reservation values -- as early as possible. Since there is no established model of strategically optimal behavior in these markets, and because orderbook data is highly unstructured, non-stationary and non-linear, we propose quantile-based normalization techniques that help us build general predictive models. We develop and train several models, including linear regressions and gradient boosting trees, leveraging quantile-based input from the underlying supply-demand model. Our models can predict allocative efficiency with reasonable accuracy from the earliest bids and asks, and these predictions improve with additional realized price data. The performance of the prediction techniques varies by target and market type. Our framework holds significant potential for application to real-world market data, offering valuable insights into market efficiency and performance, even prior to any trade realizations.

Tail-Greedy Unbalanced Haar Wavelet Segmentation for Copy Number Alteration Data

2026-04-24T08:54:40Z

Detecting copy number alterations (CNAs) from next-generation sequencing data remains challenging, particularly for short segments under noisy conditions. Existing segmentation methods often suffer from high false positive rates or fail to reliably detect short aberrations, especially in low-coverage data. In this study, we propose a modified tail-greedy unbalanced Haar (TGUHm) method that introduces a dual-thresholding strategy to improve segmentation accuracy. The proposed approach effectively suppresses spurious spikes while preserving sensitivity to both short and long CNA segments. Extensive simulation studies under Gaussian and heavy-tailed noise demonstrate that TGUHm consistently achieves higher true positive rates and lower false positive rates compared to state-of-the-art methods, including CBS, HaarSeg, and FDRSeg. In particular, the proposed method improves detection accuracy for short segments while maintaining competitive overall performance. Application to real cancer genomic data further confirms the practical utility of the method, revealing biologically meaningful CNAs associated with known cancer-related genes. These results suggest that TGUHm provides a robust and effective framework for CNA detection in challenging sequencing settings.

Finite element model updating of building structures under seismic excitation: A parallelized latent space-based Bayesian framework

2026-04-24T07:39:07Z

Enhancing seismic fragility and risk assessment of nuclear power plants relies on accurate prediction of reactor building responses to seismic hazards, which can be further improved through dynamic analysis of high-fidelity finite element (FE) models. However, FE models often exhibit non-negligible discrepancies from actual structures due to various sources of uncertainty, necessitating FE model updating with rigorous quantification of associated uncertainties. This paper presents a GPU-accelerated latent space--based Bayesian framework for FE model updating of building structures. In the proposed framework, high-dimensional structural response data (e.g., time histories or frequency response functions) are projected into a low-dimensional latent space using a multimodal variational autoencoder (MVAE), thereby enabling efficient and tractable likelihood evaluation without explicit modeling in the original observation space. Once trained, the surrogate enables amortized inference, allowing posterior sampling to be performed without additional simulator evaluations. We specifically employ a sequential Monte Carlo (SMC) sampler, whose population-based formulation allows parallel evaluation of the approximate likelihood on GPUs, resulting in computational efficiency and robustness against multimodal and complex posterior distributions. The proposed framework is validated through both numerical benchmarking and experimental data from a shaking table test of a reinforced concrete building structure. The results demonstrate that the method accurately estimates structural parameters with well-quantified uncertainties, while achieving fast and efficient inference through GPU-based parallelization, and enabling robust inference even in the presence of sparse observations that induce multimodal and highly complex posterior distributions.

Hierarchical Bayesian model updating using Dirichlet process mixtures for structural damage localization

2026-04-24T07:29:47Z

Bayesian model updating provides a rigorous probabilistic framework for calibrating finite element (FE) models with quantified uncertainties, thereby enhancing damage assessment, response prediction, and performance evaluation of engineering structures. Recent advances in hierarchical Bayesian model updating (HBMU) enable robust parameter estimation under ill-posed/ill-conditioned settings and in the presence of inherent variability in structural parameters due to environmental and operational conditions. However, most HBMU approaches overlook multimodality in structural parameters that often arises when a structure experiences multiple damage states over its service life. This paper presents an HBMU framework that employs a Dirichlet process (DP) mixture prior on structural parameters (DP-HBMU). DP mixtures are nonparametric Bayesian models that perform clustering without pre-specifying the number of clusters, incorporating damage state classification into FE model updating. We formulate the DP-HBMU framework and devise a Metropolis-within-Gibbs sampler that draws samples from the posterior by embedding Metropolis updates for intractable conditionals due to the FE simulator. The applicability of DP-HBMU to damage localization is demonstrated through both numerical and experimental examples. We consider moment-resisting frame structures with beam-end fractures and apply the method to datasets spanning multiple damage states, from an intact state to moderate or severe damage state. The clusters inferred by DP-HBMU align closely with the assumed or observed damage states. The posterior distributions of stiffness parameters agree with ground truth values or observed fractures while exhibiting substantially reduced uncertainty relative to a non-hierarchical baseline. These results demonstrate the effectiveness of the proposed method in damage localization.

Cross-Domain Offshore Wind Power Forecasting: Transfer Learning Through Meteorological Clusters

2026-04-24T07:12:56Z

Ambitious decarbonisation targets are rapidly increasing the commission of new offshore wind farms. For these newly commissioned plants to run, accurate power forecasts are needed from the onset. These allow grid stability, good reserve management and efficient energy trading. Despite machine learning models having strong performances, they tend to require large volumes of site-specific data that new farms do not yet have. To overcome this data scarcity, we propose a novel transfer learning framework that clusters power output according to covariate meteorological features. Rather than training a single, general-purpose model, we thus forecast with an ensemble of expert models, each trained on a cluster. As these pre-trained models each specialise in a distinct weather pattern, they adapt efficiently to new sites and capture transferable, climate-dependent dynamics. Our contributions are two-fold - we propose this novel framework and comprehensively evaluate it on eight offshore wind farms, achieving accurate cross-domain forecasting with under five months of site-specific data. Our experiments achieve a MAE of 3.52\%, providing empirical verification that reliable forecasts do not require a full annual cycle. Beyond power forecasting, this climate-aware transfer learning method opens new opportunities for offshore wind applications such as early-stage wind resource assessment, where reducing data requirements can significantly accelerate project development whilst effectively mitigating its inherent risks.

From specific-source feature-based to common-source score-based likelihood-ratio systems: ranking the stars

2026-04-24T07:04:26Z

This paper studies expected performance and practical feasibility of the most commonly used classes of source-level likelihood-ratio (LR) systems when applied to a trace-reference comparison problem. The paper compares performance of these classes of LR systems (used to update prior odds) to each other and to the use of prior odds only, using strictly proper scoring rules as performance measures. It also explores practical feasibility of the classes of LR systems. The present analysis allows for a ranking of these classes of LR systems: from specific-source feature-based to common-source anchored or non-anchored score-based. A trade-off between performance and practical feasibility is observed, meaning that the best performing class of LR systems is the hardest to realise in practice, while the least performing class is the easiest to realise in practice. The other classes of LR systems are in between the two extremes. The one positive exception is a common-source feature-based LR system, with good performance and relatively low experimental demands. The paper also argues against the claim that some classes of LR systems should not be used, by showing that all systems have merit (when updating prior odds) over just using the prior odds (i.e. not using the LR system).

Modeling Physical Activity Change as Smooth Transformations: Temporal and Amplitude Patterns Associated with Physical Function in Older Women

2026-04-23T23:57:53Z

Background: Minute-level accelerometer data capture rich diurnal physical activity (PA) patterns, but conventional summary metrics obscures clinically meaningful changes accumulated across a day. Building on Riemannian framework, we integrate multivariate functional principal component analysis (MFPCA) to identify main modes of PA change in older women and examine associations with physical function (PF). Method: A subset participant from OPACH as baseline and two WHISH follow-ups (W1, W2), yielded 3 accelerometer measurements; each participant's diurnal PA at each visit was represented as a smooth curve. Change between consecutive visits (defined as periods: baseline-W1, W1-W2) was modeled as a Riemannian deformation (RD) jointly capturing changes in PA timing and magnitude. Deformations were parameterized by initial momenta and summarized using MFPCA; participant-level changes were characterized by principal component (PC) scores and deformation energy (DE), a metric of overall pattern change. Associations with PF were assessed using linear mixed models. Results: Mean deformation in both periods showed overall downward shifts in PA magnitude with temporal redistribution between 10am and 7pm. Top 15 PCs explained >= 90% of variability in both periods; PC1 represented a pattern of PA increase/decrease throughout the day, explaining 22.4% (baseline-W1) and 20.8% (W1-W2). Among complete data (N=1157), an increase in PA in the mode of PC1 was positively associated with PF (p <0.0001). The interaction between DE and period was significantly associated with PF (p=0.003). Conclusions: Modeling longitudinal PA change as RDs and summarizing variability via MFPCA produced clinically interpretable phenotypes of diurnal PA change beyond standard metrics. The leading deformation mode was significantly associated with PF, and DE showed a stronger association with PF in the later period.

Zero-inflated modeling with smoothing on counting tensors

2026-04-23T21:40:43Z

We propose a unified probabilistic framework for sparse count tensors with excess zeros, motivated by single-cell Hi-C data. The observed data are naturally represented as a three-way tensor indexed by genomic loci pairs and cells, exhibiting pronounced sparsity, zero inflation, and cell-to-cell heterogeneity. We introduce a zero-inflated Poisson tensor model that integrates low-rank CP structure, cluster-specific latent embeddings, and smoothness along ordered genomic loci, thereby jointly capturing multiway dependence, heterogeneity, and structured variation. We develop a Bayes-optimal procedure for distinguishing structural from technical zeros, enabling principled inference and uncertainty quantification. We establish identifiability of the model parameters and derive consistency rates for the proposed estimators in a high-dimensional regime. Simulation studies and analyses of single-cell Hi-C data demonstrate improved performance in zero detection, latent structure recovery, and downstream tasks such as clustering and 3D chromatin organization inference. The proposed framework provides a flexible approach for multiway count data with excess zeros and structured dependencies, and suggests several directions for future work, including mixture-based modeling of cell populations and scalable computation for large-scale applications.

Hierarchical Probabilistic Principal Component Analysis of Longitudinal Data

2026-04-23T19:08:12Z

In many longitudinal studies, a large number of variables are measured repeatedly over time, with substantial missing data. Existing methods, such as probabilistic principal component analysis (PPCA), are ill-equipped to handle such incomplete, high-dimensional longitudinal data, as they fail to account for the nested sources of variation and temporal dependency inherent in repeated measures. We introduce hierarchical probabilistic principal component analysis (HPPCA), a two-level probabilistic factor model that explicitly separates between-subject variance from time-varying within-subject dynamics. The within-subject latent factors are modeled by a Gaussian process. We develop an EM algorithm to handle missing data and flexible covariance kernels, accelerated by computationally efficient initializers. Simulation studies demonstrated that HPPCA robustly recovers model parameters subspaces and substantially outperforms both standard PPCA and multivariate functional PCA in imputation accuracy, even under heavy missingness and model misspecification. An application to the long COVID symptoms in the Researching COVID to Enhance Recovery adult cohort revealed that HPPCA effectively captured the data's hierarchical structure and its learned features significantly improved the prediction of clinical outcomes and the recovery of masked clinical records compared to exisiting methods.

Contrast-Space Projection for Network Meta-Analysis: An Exact and Invariant Study-Based Decomposition of Direct and Indirect Contributions

2026-04-23T18:23:32Z

Network meta-analysis (NMA) combines direct and indirect comparisons across a connected treatment network to estimate relative treatment effects. However, there is a lack of exact contribution decompositions that reproduce NMA estimates, particularly in the presence of multi-arm trials that induce within-study correlations. We address this reproducibility gap by developing a contrast-space projection formulation of NMA. Working in the space of all estimable pairwise treatment contrasts, we express the NMA estimator as an explicit linear mapping of the observed contrasts onto the consistency-constrained contrast space induced by orthogonal projection. Building on this representation, we introduce a rigorous study-based definition of direct and indirect evidence through a canonical within-study reduction that removes algebraic redundancy and yields a unique, invariant decomposition. This leads to exact covariance-aware decompositions of the NMA estimator into study-level direct and indirect contributions, with indirect evidence further resolved into path-level components. The resulting weights are directly analogous to inverse-variance weights in pairwise meta-analysis and enable, to our knowledge, the first forest-plot representation that exactly reconstructs the NMA estimator. The framework also yields projection-based diagnostic and graphical tools, including forest plots, tension plots, and path-based visualizations. Applications to empirical datasets demonstrate how the proposed approach provides a reproducible and interpretable framework for understanding evidence contributions in network meta-analysis, supporting transparent interpretation and reporting.

Compositional regression using principal nested spheres

2026-04-23T16:47:18Z

Regression with compositional responses is challenging due to the nonlinear geometry of the simplex and the limitations of Euclidean methods. We propose a regression framework for manifold-valued data based on mappings to statistically tractable intermediate spaces. For compositional data, responses are embedded in the positive orthant of the sphere and analysed using Principal Nested Spheres (PNS), yielding a cylindrical intermediate space with a circular leading score and Euclidean higher-order scores. Regression is performed in this intermediate space and fitted values are mapped back to the simplex. A simulation study demonstrates good performance of PNS-based regression. An application to environmental chemical exposure data illustrates the interpretability and practical utility of the method.

Bayesian Sparsity Modeling of Shared Neural Response in Functional Magnetic Resonance Imaging Data

2026-04-23T13:42:44Z

Detecting shared neural activity from functional magnetic resonance imaging (fMRI) across individuals exposed to the same stimulus can reveal synchronous brain responses, functional roles of regions, and potential clinical biomarkers. Intersubject correlation (ISC) is the main method for identifying voxelwise shared responses and per-subject variability, but it relies on heavy data summarization and thousands of regional tests, leading to poor uncertainty quantification and multiple testing issues. ISC also does not directly estimate a shared neural response (SNR) function. We propose a model-based alternative applicable to both task-based and naturalistic fMRI that simultaneously identifies spatial regions of shared activity and estimates the SNR function. The model combines sparse Gaussian process estimation of the response function with a Bayesian sparsity prior inspired by the horseshoe prior to detect voxel activation. A spatially structured extension encourages neighboring voxels to exhibit similar activation patterns. We examine the model's properties, evaluate performance via simulations, and analyze two real-world fMRI datasets, including one task-based and one naturalistic dataset. The Bayesian framework provides principled uncertainty quantification for the shared response function and shows improved activation detection and response estimation compared to standard approaches. Model fits demonstrate comparable or superior performance relative to ISC, while the framework opens avenues for clinical applications.

Integrating opportunities and parametrized signatures for improved mutational processes estimation in extended sequence contexts

2026-04-23T13:24:56Z

Mutational signatures describe the pattern of mutations over the different mutation types. Each mutation type is determined by a base substitution and the flanking nucleotides to the left and right of that base substitution. Due to the widespread interest in mutational signatures, several efforts have been devoted to the development of methods for robust and stable signature estimation. Here, we combine various extensions of the standard framework to estimate mutational signatures. These extensions include (a) incorporating opportunities to the analysis, (b) allowing for extended sequence contexts, (c) using the Negative Binomial model, and (d) parametrizing the signatures. We show that the combination of these four extensions gives very robust and reliable mutational signatures. In particular, we highlight the importance of including mutational opportunities and parametrizing the signatures when the mutation types describe an extended sequence context with two or three flanking nucleotides to each side of the base substitution.