https://arxiv.org/api/K0K9owKFuE+29OXjgc3641sYzAk 2026-06-13T19:55:21Z 23522 120 15 http://arxiv.org/abs/2606.04546v1 Bivariate inverse Gaussian degradation processes with shared random effects and an application to fatigue cracks 2026-06-03T07:28:08Z

The inverse Gaussian (IG) process is a widely used model for univariate degradation data. For bivariate degradation data involving two performance characteristics (PCs), dependence is often introduced through an unobserved shared frailty factor combined with IG processes. Previous studies typically assume a specific frailty distribution, such as normal or gamma, although such choices are difficult to justify because the frailty is unobserved. This paper proposes a general IG GG framework for modeling bivariate degradation data with dependent PCs. Each degradation process is modeled using an IG process, while the shared frailty follows the generalized gamma (GG) family, which includes exponential, gamma, Weibull, and lognormal distributions as special cases. The proposed framework allows flexible selection of an appropriate frailty distribution within the GG family, leading to improved model fitting. Convenient parameter estimation procedures are developed and evaluated through simulation studies, demonstrating satisfactory performance. The proposed model is applied to fatigue crack data and compared with several existing frailty based and copula based models. Results show that the IG GG model provides a superior fit. System reliability estimation under the IG GG framework is also discussed.

2026-06-03T07:28:08Z Yuvraj Dutta Sandip Barui Debanjan Mitra Narayanaswamy Balakrishnan http://arxiv.org/abs/2606.04416v1 Powerful Multivariate Sensitivity Analysis via Sample Splitting in an Observational Study of the Effects of Poverty on Cardiovascular Disease Risk Factors 2026-06-03T03:50:09Z

When assessing the causal effect of an exposure on two or more outcomes in an observational study, a linear combination of outcomes may lessen the sensitivity of a test of the global null hypothesis to potential unmeasured biases. While all linear combinations of scored outcomes can be considered using Scheffe projections or constrained variants thereof, finding the combination that minimizes sensitivity to unmeasured biases requires corrections for multiple testing, which can erode power, especially when many outcomes are of interest. To mitigate this issue, we propose splitting the sample into a planning sample to identify an optimal linear combination and an analysis sample to conduct inference. We provide a novel characterization of the set of linear combinations for which this approach is guaranteed to achieve the same asymptotic power as full-sample alternatives and conduct extensive simulation studies that demonstrate enhanced power in finite samples. Finally, we apply our method to investigate the effects of poverty on the emergence of cardiovascular disease risk factors in children and adolescents. We discover adverse consequences on outcomes related to body composition, physical activity, and tobacco exposure. Although the impact of poverty on elevated tobacco exposure shows some robustness to unmeasured confounding, the other findings remain sensitive to potential biases.

2026-06-03T03:50:09Z William Bekerman Anurag Mehta Rebecca E. Hasson Leah E. Robinson Dylan S. Small Colin B. Fogarty http://arxiv.org/abs/2606.04334v1 Hybrid Particle Gaussian Mixture (H-PGM) Solution for Cislunar Target Tracking 2026-06-03T01:24:30Z

Gauss's method of orbit determination (OD) is one of the most popular, minimal assumption target tracking techniques in astrodynamics, especially for generating an initial state estimate. However, due to Gauss's method's assumption of Keplerian motion (part of the larger two-body problem), this method cannot be applied in a cislunar environment, where three body, non-planar effects dominate. In this work, we showcase a hybrid Particle Gaussian Mixture (H-PGM) filtering method, a purely recursive probabilistic OD framework that relies upon a sequential combination of the Markov Chain Monte Carlo (MCMC) based Particle Gaussian Mixture-II (PGM-II) and Kalman update based Particle Gaussian Mixture-I (PGM-I) filters. This method allows us to fuse probabilistic information with angles-only observations from terrestrial telescopes for short- and long-term cislunar target tracking. This method also allows us to fuse other target \textit{a priori} information in an effort to reduce target uncertainty in the short term. This hybrid filtering technique is demonstrated for several popular and important cislunar orbit regimes and compared with several homogeneous and hybrid filtering frameworks.

2026-06-03T01:24:30Z 38 pages, 14 figures, to be submitted to the Journal of Astronautical Sciences Ishan Paranjape Tarun Hejmadi Utkarsh Ranjan Mishra Suman Chakravorty http://arxiv.org/abs/2602.20651v3 Sparse Bayesian Deep Functional Learning with Structured Region Selection 2026-06-03T01:08:04Z

In modern applications such as ECG monitoring, neuroimaging, wearable sensing, and industrial equipment diagnostics, complex and continuously structured data are ubiquitous, presenting both challenges and opportunities for functional data analysis. However, existing methods face a critical trade-off: conventional functional models are limited by linearity, whereas deep learning approaches lack interpretable region selection for sparse effects. To bridge these gaps, we propose a sparse Bayesian functional deep neural network (sBayFDNN). It learns adaptive functional embeddings through a deep Bayesian architecture to capture complex nonlinear relationships, while a structured prior enables interpretable, region-wise selection of influential domains with quantified uncertainty. Theoretically, we establish rigorous approximation error bounds, posterior consistency, and region selection consistency. These results provide the first theoretical guarantees for a Bayesian deep functional model, ensuring its reliability and statistical rigor. Empirically, comprehensive simulations and real-world studies confirm the effectiveness and superiority of sBayFDNN. Crucially, sBayFDNN excels in recognizing intricate dependencies for accurate predictions and more precisely identifies functionally meaningful regions, capabilities fundamentally beyond existing approaches.

2026-02-24T07:53:59Z Xiaoxian Zhu Yingmeng Li Shuangge Ma Mengyun Wu http://arxiv.org/abs/2601.03569v3 Local Intrinsic Dimensionality of Ground Motion Data for Early Detection of Catastrophic Slope Failure 2026-06-03T00:37:23Z

Local Intrinsic Dimensionality (LID) has shown strong potential for anomaly detection in high-dimensional data, including landslide failure detection in granular media, where early and accurate identification of failure zones is crucial for effective geohazard mitigation. However, this task is still challenging due to the spatial correlations and temporal dynamics that are inherently present in surface displacement data. To address this gap, we propose a novel unsupervised framework called spatiotemporal LID (st-LID) that generalizes the LID for robust failure detection in landslide monitoring networks. Our approach introduces three key innovations: (1) Kinematic enhancement, incorporating velocity into the LID computation to capture instantaneous deformation rates and short-term temporal dynamics; (2) Bayesian spatial fusion, which aggregates LID values across spatial neighborhoods via Bayesian estimation, to embed spatial correlations and account for localized noise; and (3) Temporal modeling (t-LID), a new variant that characterizes long-term dynamics of displacement data, providing a robust temporal representation of displacement behavior. By unifying these components, st-LID identifies complex, multi-stage failure zones often overlooked by existing methods. Extensive experiments show that st-LID consistently outperforms state-of-the-art unsupervised baselines in detection precision and lead-time, providing a robust foundation for landslide early warning systems and targeted risk intervention to enhance community resilience and preparedness strategies.

2026-01-07T04:29:05Z 20 pages, 9 figures. ECML-PKDD 2026 Yuansan Liu James Bailey Antoinette Tordesillas http://arxiv.org/abs/2603.27095v3 Socioeconomic Drivers of Physical Morbidity Across U.S. Counties: A Spatial Causal Inference Approach 2026-06-03T00:14:26Z

Identifying the causal effects of socioeconomic determinants on population health is of many great interests - from statistical methodology development to public health practitioners and policy developments. The statistical side of the problem needs to address several questions: spatial autocorrelation in both exposures and outcomes, confounding between treatments and covariates, and the need for geographically logical inference. We address these jointly by using spectral basis functions - Moran Eigenvector Maps and ICAR precision matrix eigenvectors - within a doubly robust generalized propensity score estimator for continuous treatments. Applied to 2022 county health data across the U.S. counties, the framework identifies the effect of six chosen predictors on the average physically unhealthy days per month. Possible further applications and methodological extensions are also discussed as future directions from this research.

2026-03-28T02:30:25Z Ranadeep Daw Hunter N. Evans Indrabati Bhattacharya http://arxiv.org/abs/2605.28349v2 Robust Inference for Dyadic Data with Dependent Ordered Nodes 2026-06-02T21:52:20Z

Dyadic regression models are commonly analyzed under the conventional dyadic dependence framework, where two observations may be dependent only if the corresponding dyads share a node. This paper studies inference when nodes are ordered and nearby nodes are exposed to common latent shocks, so that dyads with no shared endpoint may still be dependent. Although each additional covariance term may be weak, the number of nearby-node dyad pairs grows with the sample size, making their aggregate contribution asymptotically non-negligible. We develop an inferential framework for dyadic arrays with ordered-node dependence and propose two variance estimators: a dependent-node dyadic cluster-robust variance estimator that retains covariance terms between dyads with nearby endpoints, and a row-column moving-block jackknife method that deletes adjacent blocks of nodes together with all dyads touching those nodes. We establish the asymptotic validity of both procedures under weak dependence along the ordered node index. Monte Carlo evidence shows improvements in size control, with the jackknife procedure displaying comparatively stable finite-sample performance. An application to international trade gravity regressions shows that accounting for ordered-node dependence substantially weakens the statistical evidence for free trade agreement effects.

2026-05-27T11:52:37Z Ulrich Hounyo Jiahao Lin Xiaojun Song http://arxiv.org/abs/2606.04215v1 Contextual Geospatial Features for Identifying Informal Environmental-Health Hazards Undetectable from Satellites: A ULAB Case Study 2026-06-02T21:02:14Z

Reliable, scalable detection of informal, small-scale environmental-health hazards (used lead-acid battery (ULAB) recycling, household-scale e-waste burning, indoor mercury amalgamation, brick kilns, small tanneries) remains an unsolved problem. These operations are invisible to satellites and absent from formal registries, yet disproportionately harm low-income populations in low- and middle-income countries. This paper articulates the problem class and explores a possible response: contextual geospatial features, with case-specific feature design informed by domain expertise. We use ULAB recycling as a demonstration case, drawing on 164 verified sites in Bangladesh and India from Pure Earth's Toxic Sites Identification Programme. At this sample size, five-fold cross-validation on the training set cannot statistically distinguish the engineered contextual features from a simple two-feature socio-demographic baseline. The added value only becomes visible when we evaluate outside the training set. On 172 held-out informal-recycling sites in non-NCR India and Bangladesh, the model assigns scores several times higher than to matched random urban controls; and on an independent set of 131 regulatory-confirmed formal recyclers, informal sites score materially higher than formal ones in non-NCR India, indicating that the model is picking up informal-recycler-specific structure rather than generic industrial signal. We frame these results as exploratory rather than confirmatory: label sparsity, gaps in point-of-interest coverage, and untested transfer beyond South Asia all remain open. We close with seven open problems and invite the environmental-health and geospatial machine-learning communities to engage with informal-hazard detection as a class of problems worth solving.

2026-06-02T21:02:14Z Naia Ormaza-Zulueta Zia Mehrabi http://arxiv.org/abs/2606.04175v1 Inferring cellular heterogeneity with mixture models for DNA methylation rates 2026-06-02T19:40:10Z

Cellular heterogeneity is a hallmark of biological tissues and plays a central role in disease progression, diagnosis, and prognosis. Yet, accurately characterizing this heterogeneity from bulk molecular profiles remains challenging because observed signals arise from mixtures of multiple cell populations. Cell deconvolution aim to recover the relative abundance of constituent cell types from such heterogeneous measurements, but most existing approaches implicitly rely on restrictive assumptions on residual errors, including independence, homoscedasticity, and normality. These assumptions are rarely satisfied in omics data, which are inherently bounded and overdispersed. In this work, we show that whole-genome cell-type specific DNA methylation profiles exhibit latent group structures that can substantially impair deconvolution accuracy when ignored. We therefore propose a mixture of non-negative Beta regression models estimated through an Expectation-Maximization algorithm for DNA methylation rates. Our framework naturally incorporates a feature selection mechanism through mixture component identification, making component selection a critical step of the inference procedure. We further propose a dedicated criterion for component selection and assess the performance of the approach through an extensive comparative study across several in vitro benchmark datasets. Our results demonstrate that deconvolution accuracy is highly sensitive to latent component structure and show that explicitly modeling this heterogeneity yields substantial improvements over standard whole-genome deconvolution strategies. Altogether, this work establishes mixture modeling of DNA methylation data as a powerful new direction for robust and accurate cell deconvolution.

2026-06-02T19:40:10Z Hugo Barbot IRMAR Yuna Blum IGDR Magali Richard APTIKAL, LIG David Causeur IRMAR http://arxiv.org/abs/2107.01629v3 From Live to Recording: Consumer Demand and Response to Price Across the Livestreaming Lifecycle 2026-06-02T19:38:43Z

Livestreaming has evolved into a thriving industry where creators can directly monetize and engage with their audiences and followers. In practice, creators and platforms typically concentrate their marketing efforts on the period leading up to the livestream. However, livestreaming events naturally transition into recorded formats once the event concludes, creating potential "residual" opportunities for monetization. This study systematically examines consumer demand for live events throughout the entire livestream life-cycle, using data from a large livestreaming platform that allows consumers to purchase the recorded version of a paid live event after the livestream ends. We find that the demand is surprisingly more price-sensitive during the pre-livestream period compared to the post-period. This is partly driven by two mechanisms: consumer self-selection (infrequent consumers who may have missed the live events exhibit a higher willingness to pay for recorded versions) and quality uncertainty (consumers face higher uncertainty in event quality during the pre-period than in the post-period). Our findings generate implications for the pricing and targeting strategies in livestreaming markets.

2021-07-04T13:50:54Z An earlier version of this paper was distributed under the title "The Role of 'Live' in Livestreaming Markets: Evidence Using Orthogonal Random Forest." Ziwei Cong Jia Liu Puneet Manchanda http://arxiv.org/abs/2606.04170v1 A Retrospective Benchmark of Spatiotemporal Covariates for Daily Active-Fire Detection in Cerrado Conservation Units 2026-06-02T19:34:01Z

Wildfires threaten biodiversity, carbon stocks, and management capacity in the Brazilian Cerrado, where Conservation Units and their official buffer zones must allocate prevention resources under a strong dry-season fire regime. This work develops a retrospective daily active-fire detection benchmark for the Cerrado portion of Minas Gerais, Brazil, using INPE BDQueimadas reference satellite labels (AQUA_M-T), constrained pseudo absences with same-year MapBiomas Collection 9 land-cover filtering, and four nested covariate stages extracted through Google Earth Engine. Logistic Regression, Random Forest, and XGBoost are evaluated under five-fold time-series cross-validation on a global training base and on independent imbalanced test sets spatially held out to Parque Estadual do Pau Furado and Parque Estadual da Serra do Cabral with their official buffer zones. AUC-PR is the primary metric, with AUC-ROC, threshold precision and recall, SHAP explanations, and retrospective score maps used as complementary diagnostics. Temporal cross-validation showed the highest mean AUC-PR at the complete temporal-memory stage for all three model families. Held-out AOI tests were weaker under the stricter 1:100 prevalence design: Random Forest peaked at Stage 3 in both AOIs, while XGBoost maps exposed high-recall, high-warning-volume behavior. The resulting baseline provides a reproducible reference for comparing atmospheric, surface, static spatial, and short-term memory covariates in daily CU-scale active-fire detection ranking. Because several stages use same-day covariates, the study is a retrospective classification benchmark rather than a prospective forecast.

2026-06-02T19:34:01Z 26 pages, 19 figures, 7 tables Juliano Eleno Silva Pádua Alexandre Luis Magalhães Levada Fredy João Valente http://arxiv.org/abs/2512.06553v2 A Latent Variable Framework for Scaling Laws in Large Language Models 2026-06-02T19:03:16Z

We propose a statistical framework built on latent variable modeling for scaling laws of large language models (LLMs). Our work is motivated by the rapid emergence of numerous new LLM families with distinct architectures and training strategies, evaluated on an increasing number of benchmarks. This heterogeneity makes a single global scaling curve inadequate for capturing how performance varies across families and benchmarks. To address this, we propose a latent variable modeling framework in which each LLM family is associated with a latent variable that captures the common underlying features in that family. An LLM's performance on different benchmarks is then driven by its latent skills, which are jointly determined by the latent variable and the model's own observable features. We develop an estimation procedure for this latent variable model and establish its statistical properties. We also design efficient numerical algorithms that support estimation and various downstream tasks. Empirically, we evaluate the approach on 12 widely used benchmarks from the Open LLM Leaderboard (v1/v2).

2025-12-06T19:49:31Z Peiyao Cai Chengyu Cui Felipe Maia Polo Seamus Somerstep Leshem Choshen Mikhail Yurochkin Yuekai Sun Kean Ming Tan Gongjun Xu http://arxiv.org/abs/2606.03961v1 A Neural Estimation Framework for Aggregated Relational Data under Intractable Likelihoods 2026-06-02T17:49:58Z

Aggregated relational data (ARD) consists of survey responses to questions of the form ``how many people do you know who~$X$?'' and is widely used in survey statistics for indirect inference about populations and social networks. The dominant ARD inference target is hidden-population size estimation via the Network Scale-Up Method (NSUM), but ARD is also used for personal-network-size estimation, mixing-pattern recovery, and inference about latent network structure. Bayesian inference for ARD almost universally assumes that, conditional on a respondent's degree, the counts reported for different subpopulations are independent. There are, however, reasons to question this assumption, as homophily, latent-space clustering, and imperfect recall may all induce cross-population dependence. We develop a simulation-based neural estimation framework for ARD which requires only a simulator, so it can be applied to generative models whose likelihood cannot be written down or efficiently evaluated. The framework trains a permutation-invariant neural Bayes estimator that returns, for each marginal parameter, a posterior median and a 95% credible interval, by minimising a multi-quantile pinball loss with a cumulative-gap construction that rules out quantile crossing by design. We demonstrate the framework on three structurally distinct intractable extensions of NSUM-style ARD inference: a stochastic block model, a latent-space model, and a recall-subset model. We apply the framework to ARD Household Survey collected in Rwanda. The framework provides inference on any new survey drawn from the training distribution, and extends the reach of ARD modelling to network-structure and cognitive-process assumptions beyond those currently accessible to likelihood-based inference.

2026-06-02T17:49:58Z 33 pages, 3 figures, 2 tables Rowland G Seymour Joseph Marsh http://arxiv.org/abs/2501.01324v4 Fast data inversion for high-dimensional Ornstein-Uhlenbeck processes from noisy measurements 2026-06-02T16:49:03Z

In this work, we develop a scalable approach for a flexible latent factor model for high-dimensional dynamical systems. Each latent factor process has its own correlation and variance parameters, and the orthogonal factor loading matrix can be either fixed or estimated. We utilize an orthogonal factor loading matrix that avoids computing the inversion of the posterior covariance matrix at each time of the Kalman filter, and derive closed-form expressions in an expectation-maximization algorithm for parameter estimation, which substantially reduces the computational complexity without approximation. Our approach has several applications, including noise filtering for high-dimensional time series, estimating nonseparable covariance structure between different time series, and estimating latent physical processes from real-world measurements. Extensive simulated studies illustrate higher accuracy and scalability of our approach compared to alternatives. Furthermore, by applying our method to geodetic measurements to estimate slow slip events from geodetic data in the Cascadia region, our estimated slip better agrees with independently measured seismic data of tremor events. The substantial acceleration from our method enables the use of massive noisy data for geological hazard quantification and other applications.

2025-01-02T16:25:57Z Yizi Lin Xubo Liu Paul Segall Mengyang Gu http://arxiv.org/abs/2606.03880v1 Principal Components Decomposition of Fraction of Variance Explained in High Dimensional Linear Models with Strong Correlation 2026-06-02T16:47:00Z

The fraction of variance explained (FVE) in a linear model quantifies the extent to which predictors account for outcome variability. In high-dimensional settings, where traditional FVE estimators do not apply, modern FVE estimators such as GWASH or linear mix-effect model estimated through the restricted maximum likelihood (LMM-REML) struggle with strong correlation among predictors, often found, for example, in brain imaging data. We propose a decomposition framework that partitions the FVE into two components: a low-dimensional component capturing the strong correlation, estimable by low dimensional methods, and a high-dimensional component with remaining weak correlation, estimable by high dimensional methods. Simulations demonstrate that decomposing dominant principal components (PCs) and estimating the high-dimensional FVE using GWASH or LMM-REML leads to improved bias reduction compared to directly applying standard approaches such as GWASH and LMM-REML. Our method shows consistent performance asymptotically as both the number of predictors and the number of samples increase. We illustrate the method in an analysis of the Adolescent Brain Cognitive Development (ABCD) brain imaging dataset, capturing nuanced heritability signals in the FVE of cognitive measures predicted by high-resolution brain imaging data.

2026-06-02T16:47:00Z Man Luo Chun Chieh Fan David Azriel Armin Schwartzman