https://arxiv.org/api/K0K9owKFuE+29OXjgc3641sYzAk2026-06-13T19:55:21Z2352212015http://arxiv.org/abs/2606.04546v1Bivariate inverse Gaussian degradation processes with shared random effects and an application to fatigue cracks2026-06-03T07:28:08ZThe inverse Gaussian (IG) process is a widely used model for univariate degradation data. For bivariate degradation data involving two performance characteristics (PCs), dependence is often introduced through an unobserved shared frailty factor combined with IG processes. Previous studies typically assume a specific frailty distribution, such as normal or gamma, although such choices are difficult to justify because the frailty is unobserved. This paper proposes a general IG GG framework for modeling bivariate degradation data with dependent PCs. Each degradation process is modeled using an IG process, while the shared frailty follows the generalized gamma (GG) family, which includes exponential, gamma, Weibull, and lognormal distributions as special cases. The proposed framework allows flexible selection of an appropriate frailty distribution within the GG family, leading to improved model fitting. Convenient parameter estimation procedures are developed and evaluated through simulation studies, demonstrating satisfactory performance. The proposed model is applied to fatigue crack data and compared with several existing frailty based and copula based models. Results show that the IG GG model provides a superior fit. System reliability estimation under the IG GG framework is also discussed.2026-06-03T07:28:08ZYuvraj DuttaSandip BaruiDebanjan MitraNarayanaswamy Balakrishnanhttp://arxiv.org/abs/2606.04416v1Powerful Multivariate Sensitivity Analysis via Sample Splitting in an Observational Study of the Effects of Poverty on Cardiovascular Disease Risk Factors2026-06-03T03:50:09ZWhen assessing the causal effect of an exposure on two or more outcomes in an observational study, a linear combination of outcomes may lessen the sensitivity of a test of the global null hypothesis to potential unmeasured biases. While all linear combinations of scored outcomes can be considered using Scheffe projections or constrained variants thereof, finding the combination that minimizes sensitivity to unmeasured biases requires corrections for multiple testing, which can erode power, especially when many outcomes are of interest. To mitigate this issue, we propose splitting the sample into a planning sample to identify an optimal linear combination and an analysis sample to conduct inference. We provide a novel characterization of the set of linear combinations for which this approach is guaranteed to achieve the same asymptotic power as full-sample alternatives and conduct extensive simulation studies that demonstrate enhanced power in finite samples. Finally, we apply our method to investigate the effects of poverty on the emergence of cardiovascular disease risk factors in children and adolescents. We discover adverse consequences on outcomes related to body composition, physical activity, and tobacco exposure. Although the impact of poverty on elevated tobacco exposure shows some robustness to unmeasured confounding, the other findings remain sensitive to potential biases.2026-06-03T03:50:09ZWilliam BekermanAnurag MehtaRebecca E. HassonLeah E. RobinsonDylan S. SmallColin B. Fogartyhttp://arxiv.org/abs/2606.04334v1Hybrid Particle Gaussian Mixture (H-PGM) Solution for Cislunar Target Tracking2026-06-03T01:24:30ZGauss's method of orbit determination (OD) is one of the most popular, minimal assumption target tracking techniques in astrodynamics, especially for generating an initial state estimate. However, due to Gauss's method's assumption of Keplerian motion (part of the larger two-body problem), this method cannot be applied in a cislunar environment, where three body, non-planar effects dominate. In this work, we showcase a hybrid Particle Gaussian Mixture (H-PGM) filtering method, a purely recursive probabilistic OD framework that relies upon a sequential combination of the Markov Chain Monte Carlo (MCMC) based Particle Gaussian Mixture-II (PGM-II) and Kalman update based Particle Gaussian Mixture-I (PGM-I) filters. This method allows us to fuse probabilistic information with angles-only observations from terrestrial telescopes for short- and long-term cislunar target tracking. This method also allows us to fuse other target \textit{a priori} information in an effort to reduce target uncertainty in the short term. This hybrid filtering technique is demonstrated for several popular and important cislunar orbit regimes and compared with several homogeneous and hybrid filtering frameworks.2026-06-03T01:24:30Z38 pages, 14 figures, to be submitted to the Journal of Astronautical SciencesIshan ParanjapeTarun HejmadiUtkarsh Ranjan MishraSuman Chakravortyhttp://arxiv.org/abs/2602.20651v3Sparse Bayesian Deep Functional Learning with Structured Region Selection2026-06-03T01:08:04ZIn modern applications such as ECG monitoring, neuroimaging, wearable sensing, and industrial equipment diagnostics, complex and continuously structured data are ubiquitous, presenting both challenges and opportunities for functional data analysis. However, existing methods face a critical trade-off: conventional functional models are limited by linearity, whereas deep learning approaches lack interpretable region selection for sparse effects. To bridge these gaps, we propose a sparse Bayesian functional deep neural network (sBayFDNN). It learns adaptive functional embeddings through a deep Bayesian architecture to capture complex nonlinear relationships, while a structured prior enables interpretable, region-wise selection of influential domains with quantified uncertainty. Theoretically, we establish rigorous approximation error bounds, posterior consistency, and region selection consistency. These results provide the first theoretical guarantees for a Bayesian deep functional model, ensuring its reliability and statistical rigor. Empirically, comprehensive simulations and real-world studies confirm the effectiveness and superiority of sBayFDNN. Crucially, sBayFDNN excels in recognizing intricate dependencies for accurate predictions and more precisely identifies functionally meaningful regions, capabilities fundamentally beyond existing approaches.2026-02-24T07:53:59ZXiaoxian ZhuYingmeng LiShuangge MaMengyun Wuhttp://arxiv.org/abs/2601.03569v3Local Intrinsic Dimensionality of Ground Motion Data for Early Detection of Catastrophic Slope Failure2026-06-03T00:37:23ZLocal Intrinsic Dimensionality (LID) has shown strong potential for anomaly detection in high-dimensional data, including landslide failure detection in granular media, where early and accurate identification of failure zones is crucial for effective geohazard mitigation. However, this task is still challenging due to the spatial correlations and temporal dynamics that are inherently present in surface displacement data. To address this gap, we propose a novel unsupervised framework called spatiotemporal LID (st-LID) that generalizes the LID for robust failure detection in landslide monitoring networks. Our approach introduces three key innovations: (1) Kinematic enhancement, incorporating velocity into the LID computation to capture instantaneous deformation rates and short-term temporal dynamics; (2) Bayesian spatial fusion, which aggregates LID values across spatial neighborhoods via Bayesian estimation, to embed spatial correlations and account for localized noise; and (3) Temporal modeling (t-LID), a new variant that characterizes long-term dynamics of displacement data, providing a robust temporal representation of displacement behavior. By unifying these components, st-LID identifies complex, multi-stage failure zones often overlooked by existing methods. Extensive experiments show that st-LID consistently outperforms state-of-the-art unsupervised baselines in detection precision and lead-time, providing a robust foundation for landslide early warning systems and targeted risk intervention to enhance community resilience and preparedness strategies.2026-01-07T04:29:05Z20 pages, 9 figures. ECML-PKDD 2026Yuansan LiuJames BaileyAntoinette Tordesillashttp://arxiv.org/abs/2603.27095v3Socioeconomic Drivers of Physical Morbidity Across U.S. Counties: A Spatial Causal Inference Approach2026-06-03T00:14:26ZIdentifying the causal effects of socioeconomic determinants on population health is of many great interests - from statistical methodology development to public health practitioners and policy developments. The statistical side of the problem needs to address several questions: spatial autocorrelation in both exposures and outcomes, confounding between treatments and covariates, and the need for geographically logical inference. We address these jointly by using spectral basis functions - Moran Eigenvector Maps and ICAR precision matrix eigenvectors - within a doubly robust generalized propensity score estimator for continuous treatments. Applied to 2022 county health data across the U.S. counties, the framework identifies the effect of six chosen predictors on the average physically unhealthy days per month. Possible further applications and methodological extensions are also discussed as future directions from this research.2026-03-28T02:30:25ZRanadeep DawHunter N. EvansIndrabati Bhattacharyahttp://arxiv.org/abs/2605.28349v2Robust Inference for Dyadic Data with Dependent Ordered Nodes2026-06-02T21:52:20ZDyadic regression models are commonly analyzed under the conventional dyadic dependence framework, where two observations may be dependent only if the corresponding dyads share a node. This paper studies inference when nodes are ordered and nearby nodes are exposed to common latent shocks, so that dyads with no shared endpoint may still be dependent. Although each additional covariance term may be weak, the number of nearby-node dyad pairs grows with the sample size, making their aggregate contribution asymptotically non-negligible. We develop an inferential framework for dyadic arrays with ordered-node dependence and propose two variance estimators: a dependent-node dyadic cluster-robust variance estimator that retains covariance terms between dyads with nearby endpoints, and a row-column moving-block jackknife method that deletes adjacent blocks of nodes together with all dyads touching those nodes. We establish the asymptotic validity of both procedures under weak dependence along the ordered node index. Monte Carlo evidence shows improvements in size control, with the jackknife procedure displaying comparatively stable finite-sample performance. An application to international trade gravity regressions shows that accounting for ordered-node dependence substantially weakens the statistical evidence for free trade agreement effects.2026-05-27T11:52:37ZUlrich HounyoJiahao LinXiaojun Songhttp://arxiv.org/abs/2606.04215v1Contextual Geospatial Features for Identifying Informal Environmental-Health Hazards Undetectable from Satellites: A ULAB Case Study2026-06-02T21:02:14ZReliable, scalable detection of informal, small-scale environmental-health hazards (used lead-acid battery (ULAB) recycling, household-scale e-waste burning, indoor mercury amalgamation, brick kilns, small tanneries) remains an unsolved problem. These operations are invisible to satellites and absent from formal registries, yet disproportionately harm low-income populations in low- and middle-income countries. This paper articulates the problem class and explores a possible response: contextual geospatial features, with case-specific feature design informed by domain expertise. We use ULAB recycling as a demonstration case, drawing on 164 verified sites in Bangladesh and India from Pure Earth's Toxic Sites Identification Programme. At this sample size, five-fold cross-validation on the training set cannot statistically distinguish the engineered contextual features from a simple two-feature socio-demographic baseline. The added value only becomes visible when we evaluate outside the training set. On 172 held-out informal-recycling sites in non-NCR India and Bangladesh, the model assigns scores several times higher than to matched random urban controls; and on an independent set of 131 regulatory-confirmed formal recyclers, informal sites score materially higher than formal ones in non-NCR India, indicating that the model is picking up informal-recycler-specific structure rather than generic industrial signal. We frame these results as exploratory rather than confirmatory: label sparsity, gaps in point-of-interest coverage, and untested transfer beyond South Asia all remain open. We close with seven open problems and invite the environmental-health and geospatial machine-learning communities to engage with informal-hazard detection as a class of problems worth solving.2026-06-02T21:02:14ZNaia Ormaza-ZuluetaZia Mehrabihttp://arxiv.org/abs/2606.04175v1Inferring cellular heterogeneity with mixture models for DNA methylation rates2026-06-02T19:40:10ZCellular heterogeneity is a hallmark of biological tissues and plays a central role in disease progression, diagnosis, and prognosis. Yet, accurately characterizing this heterogeneity from bulk molecular profiles remains challenging because observed signals arise from mixtures of multiple cell populations. Cell deconvolution aim to recover the relative abundance of constituent cell types from such heterogeneous measurements, but most existing approaches implicitly rely on restrictive assumptions on residual errors, including independence, homoscedasticity, and normality. These assumptions are rarely satisfied in omics data, which are inherently bounded and overdispersed. In this work, we show that whole-genome cell-type specific DNA methylation profiles exhibit latent group structures that can substantially impair deconvolution accuracy when ignored. We therefore propose a mixture of non-negative Beta regression models estimated through an Expectation-Maximization algorithm for DNA methylation rates. Our framework naturally incorporates a feature selection mechanism through mixture component identification, making component selection a critical step of the inference procedure. We further propose a dedicated criterion for component selection and assess the performance of the approach through an extensive comparative study across several in vitro benchmark datasets. Our results demonstrate that deconvolution accuracy is highly sensitive to latent component structure and show that explicitly modeling this heterogeneity yields substantial improvements over standard whole-genome deconvolution strategies. Altogether, this work establishes mixture modeling of DNA methylation data as a powerful new direction for robust and accurate cell deconvolution.2026-06-02T19:40:10ZHugo BarbotIRMARYuna BlumIGDRMagali RichardAPTIKAL, LIGDavid CauseurIRMARhttp://arxiv.org/abs/2107.01629v3From Live to Recording: Consumer Demand and Response to Price Across the Livestreaming Lifecycle2026-06-02T19:38:43ZLivestreaming has evolved into a thriving industry where creators can directly monetize and engage with their audiences and followers. In practice, creators and platforms typically concentrate their marketing efforts on the period leading up to the livestream. However, livestreaming events naturally transition into recorded formats once the event concludes, creating potential "residual" opportunities for monetization. This study systematically examines consumer demand for live events throughout the entire livestream life-cycle, using data from a large livestreaming platform that allows consumers to purchase the recorded version of a paid live event after the livestream ends. We find that the demand is surprisingly more price-sensitive during the pre-livestream period compared to the post-period. This is partly driven by two mechanisms: consumer self-selection (infrequent consumers who may have missed the live events exhibit a higher willingness to pay for recorded versions) and quality uncertainty (consumers face higher uncertainty in event quality during the pre-period than in the post-period). Our findings generate implications for the pricing and targeting strategies in livestreaming markets.2021-07-04T13:50:54ZAn earlier version of this paper was distributed under the title "The Role of 'Live' in Livestreaming Markets: Evidence Using Orthogonal Random Forest."Ziwei CongJia LiuPuneet Manchandahttp://arxiv.org/abs/2606.04170v1A Retrospective Benchmark of Spatiotemporal Covariates for Daily Active-Fire Detection in Cerrado Conservation Units2026-06-02T19:34:01ZWildfires threaten biodiversity, carbon stocks, and management capacity in the Brazilian Cerrado, where Conservation Units and their official buffer zones must allocate prevention resources under a strong dry-season fire regime. This work develops a retrospective daily active-fire detection benchmark for the Cerrado portion of Minas Gerais, Brazil, using INPE BDQueimadas reference satellite labels (AQUA_M-T), constrained pseudo absences with same-year MapBiomas Collection 9 land-cover filtering, and four nested covariate stages extracted through Google Earth Engine. Logistic Regression, Random Forest, and XGBoost are evaluated under five-fold time-series cross-validation on a global training base and on independent imbalanced test sets spatially held out to Parque Estadual do Pau Furado and Parque Estadual da Serra do Cabral with their official buffer zones. AUC-PR is the primary metric, with AUC-ROC, threshold precision and recall, SHAP explanations, and retrospective score maps used as complementary diagnostics. Temporal cross-validation showed the highest mean AUC-PR at the complete temporal-memory stage for all three model families. Held-out AOI tests were weaker under the stricter 1:100 prevalence design: Random Forest peaked at Stage 3 in both AOIs, while XGBoost maps exposed high-recall, high-warning-volume behavior. The resulting baseline provides a reproducible reference for comparing atmospheric, surface, static spatial, and short-term memory covariates in daily CU-scale active-fire detection ranking. Because several stages use same-day covariates, the study is a retrospective classification benchmark rather than a prospective forecast.2026-06-02T19:34:01Z26 pages, 19 figures, 7 tablesJuliano Eleno Silva PáduaAlexandre Luis Magalhães LevadaFredy João Valentehttp://arxiv.org/abs/2512.06553v2A Latent Variable Framework for Scaling Laws in Large Language Models2026-06-02T19:03:16ZWe propose a statistical framework built on latent variable modeling for scaling laws of large language models (LLMs). Our work is motivated by the rapid emergence of numerous new LLM families with distinct architectures and training strategies, evaluated on an increasing number of benchmarks. This heterogeneity makes a single global scaling curve inadequate for capturing how performance varies across families and benchmarks. To address this, we propose a latent variable modeling framework in which each LLM family is associated with a latent variable that captures the common underlying features in that family. An LLM's performance on different benchmarks is then driven by its latent skills, which are jointly determined by the latent variable and the model's own observable features. We develop an estimation procedure for this latent variable model and establish its statistical properties. We also design efficient numerical algorithms that support estimation and various downstream tasks. Empirically, we evaluate the approach on 12 widely used benchmarks from the Open LLM Leaderboard (v1/v2).2025-12-06T19:49:31ZPeiyao CaiChengyu CuiFelipe Maia PoloSeamus SomerstepLeshem ChoshenMikhail YurochkinYuekai SunKean Ming TanGongjun Xuhttp://arxiv.org/abs/2606.03961v1A Neural Estimation Framework for Aggregated Relational Data under Intractable Likelihoods2026-06-02T17:49:58ZAggregated relational data (ARD) consists of survey responses to questions of the form ``how many people do you know who~$X$?'' and is widely used in survey statistics for indirect inference about populations and social networks. The dominant ARD inference target is hidden-population size estimation via the Network Scale-Up Method (NSUM), but ARD is also used for personal-network-size estimation, mixing-pattern recovery, and inference about latent network structure. Bayesian inference for ARD almost universally assumes that, conditional on a respondent's degree, the counts reported for different subpopulations are independent. There are, however, reasons to question this assumption, as homophily, latent-space clustering, and imperfect recall may all induce cross-population dependence. We develop a simulation-based neural estimation framework for ARD which requires only a simulator, so it can be applied to generative models whose likelihood cannot be written down or efficiently evaluated. The framework trains a permutation-invariant neural Bayes estimator that returns, for each marginal parameter, a posterior median and a 95% credible interval, by minimising a multi-quantile pinball loss with a cumulative-gap construction that rules out quantile crossing by design. We demonstrate the framework on three structurally distinct intractable extensions of NSUM-style ARD inference: a stochastic block model, a latent-space model, and a recall-subset model. We apply the framework to ARD Household Survey collected in Rwanda. The framework provides inference on any new survey drawn from the training distribution, and extends the reach of ARD modelling to network-structure and cognitive-process assumptions beyond those currently accessible to likelihood-based inference.2026-06-02T17:49:58Z33 pages, 3 figures, 2 tablesRowland G SeymourJoseph Marshhttp://arxiv.org/abs/2501.01324v4Fast data inversion for high-dimensional Ornstein-Uhlenbeck processes from noisy measurements2026-06-02T16:49:03ZIn this work, we develop a scalable approach for a flexible latent factor model for high-dimensional dynamical systems. Each latent factor process has its own correlation and variance parameters, and the orthogonal factor loading matrix can be either fixed or estimated. We utilize an orthogonal factor loading matrix that avoids computing the inversion of the posterior covariance matrix at each time of the Kalman filter, and derive closed-form expressions in an expectation-maximization algorithm for parameter estimation, which substantially reduces the computational complexity without approximation. Our approach has several applications, including noise filtering for high-dimensional time series, estimating nonseparable covariance structure between different time series, and estimating latent physical processes from real-world measurements. Extensive simulated studies illustrate higher accuracy and scalability of our approach compared to alternatives. Furthermore, by applying our method to geodetic measurements to estimate slow slip events from geodetic data in the Cascadia region, our estimated slip better agrees with independently measured seismic data of tremor events. The substantial acceleration from our method enables the use of massive noisy data for geological hazard quantification and other applications.2025-01-02T16:25:57ZYizi LinXubo LiuPaul SegallMengyang Guhttp://arxiv.org/abs/2606.03880v1Principal Components Decomposition of Fraction of Variance Explained in High Dimensional Linear Models with Strong Correlation2026-06-02T16:47:00ZThe fraction of variance explained (FVE) in a linear model quantifies the extent to which predictors account for outcome variability. In high-dimensional settings, where traditional FVE estimators do not apply, modern FVE estimators such as GWASH or linear mix-effect model estimated through the restricted maximum likelihood (LMM-REML) struggle with strong correlation among predictors, often found, for example, in brain imaging data. We propose a decomposition framework that partitions the FVE into two components: a low-dimensional component capturing the strong correlation, estimable by low dimensional methods, and a high-dimensional component with remaining weak correlation, estimable by high dimensional methods. Simulations demonstrate that decomposing dominant principal components (PCs) and estimating the high-dimensional FVE using GWASH or LMM-REML leads to improved bias reduction compared to directly applying standard approaches such as GWASH and LMM-REML. Our method shows consistent performance asymptotically as both the number of predictors and the number of samples increase. We illustrate the method in an analysis of the Adolescent Brain Cognitive Development (ABCD) brain imaging dataset, capturing nuanced heritability signals in the FVE of cognitive measures predicted by high-resolution brain imaging data.2026-06-02T16:47:00ZMan LuoChun Chieh FanDavid AzrielArmin Schwartzman