Exponential thermalisation of viscous fluids on negatively curved manifolds

2026-06-01T14:08:51Z

The deterministic incompressible Navier-Stokes equations are physically incomplete: any viscous fluid at finite temperature must exhibit thermal fluctuations whose form is dictated by the fluctuation-dissipation relation. We formulate the stochastic Navier-Stokes equations with the kinematically selected deformation Laplacian on compact Riemannian manifolds with strictly negative Ricci curvature. The fluctuation-dissipation relation, derived from a topological (Poincaré lemma) argument, uniquely determines the noise from the viscous operator. For the spectrally truncated system, we prove that the unique stationary distribution is the Gibbs measure (Gaussian in the mode amplitudes, because the nonlinear convective terms preserve energy), and that convergence to equilibrium is exponentially fast with rate at least $2νλ_\Def$, where $ν$ is the kinematic viscosity and $λ_\Def$ is the spectral gap of the deformation Laplacian. The spectral gap satisfies $λ_\Def \geq κ^2$ when $\Ric \leq -κ^2 g$, and is independent of the volume of the domain. On flat space, the analogous thermalisation rate vanishes in the infinite-volume limit. The equilibrium velocity-velocity correlation function decays exponentially in geodesic distance, in contrast to the algebraic decay on flat space. These results provide a rigorous statistical-mechanical foundation for viscous fluids on negatively curved manifolds and illustrate how the geometry of the domain controls not only the deterministic dynamics but also the approach to thermal equilibrium.

It does what it says on the tin: safe synthetic data from coarsened margins

2026-06-01T11:32:10Z

This paper proposes a method of creating synthetic data (SD) that will have two important advantages for the user compared to other methods currently available. The first is transparency; unlike other methods, the person in receipt of the SD will know which of the relationships between variables in the original data will be approximately maintained in the SD. The second is a guarantee that the SD is derived from information that has already been judged to be free of disclosure risk. This is achieved by first defining and calculating the margins where relationships between variables will be maintained in the SD. Each margin will then be subject to statistical disclosure control (SDC) to the standards defined by the data custodian, e.g. top-coding and bottom-coding, combination of small categories and/or modifying small counts. Further adjustment of the curated margins is advised by coarsening all counts in the table to multiples of the disclosure limit. These adjusted margins are used to create SD by the Iterative Proportional Fitting (IPF) algorithm. The practical steps involved in creating such SD are illustrated using data from the 1901 Census of Scotland.

General Seemingly Unrelated Local Projections

2026-06-01T10:27:12Z

We develop a flexible framework for Bayesian estimation of impulse responses using Local Projections (LPs) with instrumental variables. It accommodates multiple shocks and instruments, accounts for autocorrelation in multi-step forecasts by jointly modeling all LPs as a seemingly unrelated system of equations, defines a flexible yet parsimonious joint prior for impulse responses based on a Gaussian Process, and allows for joint inference about the entire vector of impulse responses. We show via Monte Carlo simulations that our approach delivers more accurate point and uncertainty estimates than standard methods. To address misspecification, we propose an optional robustification step based on power posteriors.

Simultaneous estimation of the effective reproduction number and the time series of daily infections: Application to Covid-19

2026-06-01T10:21:21Z

The time-varying effective reproduction number is an important parameter for communication and policy decisions during an epidemic. In this paper, we present new statistical methods for estimating the reproduction number based on the popular model of \citet{cori2013new} which defines the effective reproduction number based on self-exciting dynamics of new infections. Such a model is conceptually simple and less susceptible to misspecifications than more complicated multi-compartment models. However, statistical inference is challenging, and the previous literature has either relied on proxy data and/or a two-step approach in which the number of infections is first estimated. In contrast, we present a coherent Bayesian method that approximates the joint posterior of daily new infections and reproduction numbers using a novel Markov chain Monte Carlo (MCMC) algorithm. Comparing our method to the state-of-the-art three-step estimation procedure of \citet{huisman2022estimation}, both using daily confirmed cases from Switzerland in the Covid-19 epidemic and simulated data, we find that our method is more accurate in terms of point estimates and uncertainty quantification, especially near the beginning and end of an observation period.

PliableBVS: A flexible Bayesian variable selection method for modeling interactions with mandatory modifying variables

2026-06-01T10:07:53Z

High-dimensional interaction models are useful for studying, for example, how a large set of variables of interest, such as gene expression or other omics features, interact with a smaller set of modifying variables, such as clinical covariates. In this context, the pliable lasso has recently been proposed as an efficient method for screening large numbers of potential interaction terms under an asymmetric weak hierarchical constraint. In this work, we extend this framework by introducing PliableBVS, a Bayesian variable selection approach that preserves the hierarchical structure of the pliable lasso while inducing sparsity through spike-and-slab priors. The proposed model combines the continuous shrinkage effect of Bayesian lasso with a hierarchical spike-and-slab prior formulation that has two layers of decision variables: one governing the inclusion of main effects and another controlling the inclusion of interaction effects which is conditional on the inclusion of the corresponding main effects. This structure enables simultaneous selection of high-dimensional main and interaction effects within a coherent probabilistic framework. In simulation studies the proposed method outperforms the original pliable lasso in identifying active main and interaction effects, reducing false discoveries, and improving prediction accuracy in most scenarios. Applications with data from a labor onset study and a preeclampsia study demonstrate that PliableBVS selects biologically meaningful features and interactions.

Spatial Capture-Recapture With Penalized Regression Splines to Flexibly Model Wildlife Density and Distribution

2026-06-01T09:00:41Z

Spatial capture-recapture models are routinely used to estimate the abundance and distribution of wild animal populations and involve a latent spatial point process of animal activity centres that describes the spatial distribution of individuals. While traditional spatial capture-recapture models use a Poisson process, the assumption of conditional independence between points is often violated in practice due to factors not included in the point process, such as social clustering, territoriality, or preferential selection of habitat due to unobserved covariates. Log-Gaussian Cox processes are commonly used in spatial statistics to overcome weaknesses of Poisson processes, but methods to fit them within spatial capture-recapture do not currently exist. Here, we present a spatial capture-recapture framework that allows for the use of penalized regression splines to describe the activity centre distribution, with model fitting via a Laplace-approximate penalized marginal maximum likelihood approach. Our method approximates using a log-Gaussian Cox process for activity centres, and allows flexible modelling of nonlinear effect of covariates on density. We illustrate the use of our method with a simulation study and two case-studies. We demonstrate that, while population size estimates of traditional approaches are robust to density model misspecification, our approach substantially improves the estimation of spatial animal distributions.

A longitudinal Bayesian framework for estimating causal dose-response relationships

2026-06-01T08:23:36Z

Existing causal methods for time-varying exposure and time-varying confounding focus on estimating the average causal effect of a time-varying binary treatment on an end-of-study outcome, offering limited tools for characterizing marginal causal dose-response relationships under continuous exposures. We propose a scalable, nonparametric Bayesian framework for estimating marginal longitudinal causal dose-response functions with repeated outcome measurements. Our approach targets the average potential outcome at any fixed dose level and accommodates time-varying confounding through the generalized propensity score. The proposed approach embeds a Dirichlet process specification within a generalized estimating equations structure, capturing temporal correlation while making minimal assumptions about the functional form of the continuous exposure. We apply the proposed methods to monthly metro ridership and COVID-19 case data from major international cities, identifying causal relationships and the dose-response patterns between higher ridership and increased case counts.

Mapping the Storm: Geospatial Impacts of Severe Weather on LEO Network Performance

2026-06-01T05:41:55Z

LEO satellite constellations, led by deployments such as Starlink, are playing an increasingly pivotal role in enabling global broadband connectivity. However, the reliability and performance of these space-based networks are highly sensitive to environmental dynamics, particularly localized weather phenomena that exhibit strong spatio-temporal variability. In this study, we present a continental-scale geospatial analysis of weather-induced performance degradation in the Starlink LEO network, with a focus on the contiguous United States. Leveraging a unique dataset comprising more than 870,000 terminal hours of minute-level telemetry from 1,292 Starlink terminals, we integrate high-resolution localized weather observations to quantify the impact of various meteorological conditions. We evaluated key performance indicators (KPIs)-including ping latency, ping drop rate, and signal quality-using spatial join techniques and time-aligned correlation with classified weather events. Our analysis reveals that severe weather events, such as thunderstorms with heavy rain or snow, have a pronounced effect on network performance. In particular, more than 55% affected terminals experienced substantial degradation. Temporal continuity analysis at the minute level shows that such degradation can lead to sustained impairments or full service outages lasting from several minutes to multiple hours.This work contributes to the first large-scale empirical study linking LEO satellite Internet performance with fine-grained weather data in both space and time. Our findings offer actionable insights for geospatial predictive modeling, weather-aware network provisioning, and resilient satellite communication system design. We also propose a framework for incorporating weather-inferred performance variability into future geospatial planning and service-level forecasting tools for LEO-based Internet systems.

Geometry-preserving and interpretable dimension reduction for compositional data

2026-06-01T01:59:02Z

High-dimensional compositional data pose unique statistical challenges due to the simplex constraint and excess zeros. While dimension reduction is indispensable for analyzing such data, conventional approaches often rely on log-ratio transformations that compromise interpretability and distort the data through ad hoc zero replacements. To address these issues, we introduce a geometry-preserving framework for dimension reduction of compositional data, mapping high-dimensional compositions directly to a lower-dimensional simplex. This framework is interpretable as a softened amalgamation of compositions and enables dual visualization -- showing both projected data and how variables contribute to reduced components -- for at-a-glance interpretation. Within this geometry, we define a new sufficient dimension reduction (SDR) approach for compositional predictors, whose identifiable object, termed the central compositional subspace, differs from the classical central subspace in Euclidean SDR. For estimation, we propose a kernel-based method that yields sparse solutions and comes with an intrinsic predictive model for direct downstream analyses. We prove consistency through a new subspace-comparison argument that allows the estimated and target subspaces to have different dimensions. Applications to real microbiome datasets demonstrate that our approach provides a powerful graphical exploration tool for uncovering meaningful biological patterns in high-dimensional compositional data.

Multiview Graph Fusion with Covariates

2026-06-01T01:25:56Z

Joint modeling of multiview graphs with a common set of nodes between views and auxiliary predictors is an essential, yet less explored, area in statistical methodology. Traditional approaches often treat graphs in different views as independent or fail to adequately incorporate predictors, potentially missing complex dependencies within and across graph views and leading to reduced inferential accuracy. Motivated by such methodological shortcomings, we introduce an integrative Bayesian approach for joint learning of a multiview graph with vector-valued predictors. Our modeling framework assumes a common set of nodes for each graph view while allowing for diverse interconnections or edge weights between nodes across graph views, accommodating both binary and continuous valued edge weights. By adopting a hierarchical Bayesian modeling approach, our framework seamlessly integrates information from diverse graphs through carefully designed prior distributions on model parameters. This approach enables the estimation of crucial model parameters defining the relationship between these graph views and predictors, as well as offers predictive inference of the graph views. Crucially, the approach provides uncertainty quantification in all such inferences. Theoretical analysis establishes that the posterior predictive density for our model asymptotically converges to the true data-generating density, under mild assumptions on the true data-generating density and the growth of the number of graph nodes relative to the sample size. Simulation studies validate the inferential advantages of our approach over predictor-dependent tensor learning and independent learning of different graph views with predictors. We further illustrate model utility by analyzing functional connectivity graphs in neuroscience under cognitive control tasks, relating task-related brain connectivity with phenotypic measures.

The Information Content of Quasar Variability Light Curves: How Well Can we Infer Stochastic Model Parameters?

2026-05-31T23:28:30Z

Quasar variability, driven by multi-scale physical processing within a relativistic accretion disk, is commonly modelled with stochastic time series models. The simplest of these is the Damped Random Walk (DRW), also known as the Ornstein-Uhlenbeck (OU) process. Here, we demonstrate that, when fitting such a model to quasar light curve data, the mean of the light curve, $μ$, should not be fixed (which is the typical approach), as this leads to overconfident inferences about the variability timescale $τ$, with substantially underestimated uncertainties. However, the short term volatility parameter $η$ is typically very well constrained from short light curves. Through simulations, we compute information theoretic quantities such as the conditional entropy and the mutual information, confirming that light curves provide much more information about $η$ than about $τ$. As a result, we recommend that future quasar variability studies focus on $η$ rather than $τ$. To demonstrate this approach, we fit a hierarchical Bayesian regression model for $η$ as a function of bolometric luminosity and rest wavelength to a dataset of 570 light curves measured over decades. We perform the fit using a likelihood function that uses the light curves directly, rather than using intermediate $η$ values from individual light curve fits. We find that volatility decreases as a function of both bolometric luminosity and rest wavelength. The volatility also decreases more steeply with redshift than time dilation alone would suggest, pointing to an increase in intrinsic volatility as quasars evolve over cosmic time.

Model complexity in econometrics - a combinatorial analysis

2026-05-31T23:14:18Z

Regression models and Vector Autoregressive Models (VARs) play crucial roles in econometrics by allowing the analysis of multiple variables simultaneously. Despite their utility, these models face challenges like underfitting and overfitting, especially when determining the optimal model specification, which can lead to significant computational costs. To address these challenges, econometricians often rely on widely adopted model selection criteria such as the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC). These criteria help balance model complexity and goodness of fit, aiding in the selection of the most suitable model specification for the given data. Nonetheless, there is a notable gap in existing research concerning the correct specification of these models, particularly in determining the optimal number of states a system can assume. Addressing this gap, we introduce a combinatorial framework designed to calculate the potential number of states in such econometric models. Our approach involves delineating four distinct stages in model development, each offering a range of specifications. This method enables a comprehensive combinatorial calculation of all possible states. The aim of this paper is to highlight this overlooked aspect of model specification and to spark a constructive dialogue within the empirical research community. By doing so, we hope to inspire further research that enhances the precision and applicability of econometric models. A theoretical complexity criterion is necessary to elucidate fundamental limitations and propose new objectives to pursue.

Logistic regression is not enough: The need for Bayesian nonparametric modelling for causal inference using observational data, exemplified by the 'gateway' effect

2026-05-31T22:41:50Z

Introduction: Logistic regression (LR)-type model limitations for causal inference are explained theoretically and empirically through the lens of the purported gateway effect from e-cigarette use to smoking. Previous studies have reported that baseline e-cigarette use quadruples odds of follow-up smoking (binarized) in LR-type models of adolescent longitudinal cohorts (LCs), such that increased e-cigarette use would counteract smoking declines. However, US population-level trends show accelerated smoking declines to record-lows when e-cigarette use increased, presenting an apparent paradox. Methods: Population Assessment of Tobacco and Health (USA) Youth Waves 3 to 4 were analyzed with Bayesian Additive Regression Trees (BART) to model baseline e-cigarette use (treatment) and change in number of days smoking from baseline to follow-up (numerical response) among never- and ever-smoking respondents (group effects), adjusting for confounding risk factors (socio-demographic, intra-individual, behavioural, peer influence, and family background). Unlike LR-type models, BART provides nonlinear, nonparametric modelling with counterfactuals and provides causal effect estimates with principled uncertainty estimation. Results: The average effect of e-cigarette use on smoking was both clinically and statistically significant among ever-smoking adolescents (-2 days smoking [diversionary effect; opposite to gateway]) and was not clinically significant among never-smoking adolescents (<1-day absolute change in days smoking [null effect]). Conclusions: When LC data are analyzed with causal inference techniques, the gateway effect disappears, consistent with population-level trends. This likely explains why gateway effects predicted in previous LR-type studies have not materialized in a population-level reversal/unexpected slowing of the US adolescent smoking decline, resolving the paradox.

Quantifying Evidential Rigor in Meta-Analytic Corpora: A Simulation-Characterized, Bias-Robust Bayesian Workflow with a Nutrition Case Study

2026-05-31T19:56:12Z

Conventional meta-analysis summarizes evidence through pooled estimates, intervals, and p-values, but these outputs do not directly measure evidence for an effect, evidence for no effect, or the degree to which conclusions depend on publication selection or small-study effects. We introduce a corpus-scale Bayesian evidential-audit workflow for meta-analytic corpora. The workflow reconstructs or accepts study-level effects and standard errors, harmonizes directions, fits a matched Bayesian random-effects baseline and a bias-aware model-averaged ensemble, and reports paired estimates with component and joint model-family evidence. The central estimand is rigor: a joint Bayes-factor summary combining resolved effect/no-effect evidence with absence of an explicit bias component in the fitted ensemble. Rigor is not a positive-finding score; no-effect evidence can score highly, whereas inconclusive or bias-dependent evidence scores poorly. We characterize the workflow using an ADEMP-framed simulation/resampling design with known-cell synthetic simulation, empirical registry resampling, and empirical fitted-profile-weighted synthetic sampling. A nutrition intervention corpus provides the worked case study, where bias-aware fitting often attenuates conventional estimates and many nominally meaningful effects lose clean evidential support. A public companion repository provides empirical inputs, generated artifacts, simulation source/design files, and documentation for reproducing and adapting the audit.

Domain-Shift-Aware Conformal Prediction for Large Language Models

2026-05-31T19:40:48Z

Large language models have achieved impressive performance across diverse tasks. However, their tendency to produce overconfident and factually incorrect outputs, known as hallucinations, poses risks in real-world applications. Conformal prediction provides finite-sample, distribution-free coverage guarantees, but standard conformal prediction breaks down under domain shift, often leading to under-coverage and unreliable prediction sets. We propose a new framework called Domain-Shift-Aware Conformal Prediction (DS-CP). Our framework adapts conformal prediction to large language models under domain shift, by systematically reweighting calibration samples based on their proximity to the test prompt, thereby preserving validity while enhancing adaptivity. Our theoretical analysis and experiments on the MMLU benchmark demonstrate that the proposed method delivers more reliable coverage than standard conformal prediction, especially under substantial distribution shifts, while maintaining efficiency. This provides a practical step toward trustworthy uncertainty quantification for large language models in real-world deployment.