Assessing (im)balance in signed brain networks

2026-05-26T15:40:08Z

Many complex systems - be they financial, natural, or social - are composed of units - such as stocks, neurons, or agents - whose joint activity can be represented as a multivariate time series. An issue of both practical and theoretical importance concerns the possibility of inferring the presence of a static relationship between any two units solely from their dynamic state. The present contribution aims at tackling such an issue within the frame of traditional hypothesis testing: briefly speaking, our suggestion is that of linking any two units if behaving in a sufficiently similar way. To achieve such a goal, we project a multivariate time series onto a signed graph by i) comparing the empirical properties of the former with those expected under a suitable benchmark and ii) linking any two units with a positive (negative) edge in case the corresponding series shares a significantly large number of concordant (discordant) values. To define our benchmarks, we adopt an information-theoretic approach that is rooted into the constrained maximisation of Shannon entropy, a procedure inducing an ensemble of multivariate time series that preserves some of the empirical properties on average, while randomising everything else. We showcase the possible applications of our method by addressing one of the most timely issues in the domain of neurosciences, i.e. that of determining if brain networks are frustrated or not, and, if so, to what extent. As our results suggest, this is indeed the case, with the major contribution to the underlying negative subgraph coming from the subcortical structures (and, to a lesser extent, from the limbic regions). At the mesoscopic level, the minimisation of the Bayesian Information Criterion, instantiated with the Signed Stochastic Block Model, reveals that brain areas gather into modules aligning with the statistical variant of the Relaxed Balance Theory.

Posterior Quantification of Borrowing from Multiple Historical Control Data in Bayesian Dynamic Borrowing Methods: A Scoping Review

2026-05-26T15:36:28Z

Bayesian dynamic borrowing methods incorporate historical control data into current clinical trial analyses while allowing the degree of borrowing to depend on the compatibility between historical and current data. Although many methods have been proposed, the degree of borrowing is often difficult to interpret, especially when multiple historical control sources are available. This scoping review focuses on posterior quantification of borrowing from multiple historical controls. We discuss overall borrowing summaries based on effective historical sample size, together with method-specific source-level summaries of borrowing, information contribution, or compatibility arising from power priors, unit information priors, multisource exchangeability models, Dirichlet process mixture models, and potential bias models. We distinguish posterior borrowing measures from quantities describing prior information allocation or source-specific conflict. Two case studies, one with a binary endpoint and one with a continuous endpoint, illustrate that methods with broadly similar posterior treatment effect estimates may differ in both the overall amount and source-specific pattern of borrowing. These examples show that large overall borrowing may reflect selective borrowing from compatible historical sources rather than uniform borrowing from all sources. We recommend reporting treatment effect estimates together with overall and source-specific borrowing summaries, when available, to improve transparency in posterior inference.

Bernstein-von Mises Theorem for Sparse Generalized Linear Model

2026-05-26T15:06:11Z

We study spike-and-slab priors for generalized linear models with possible grouped sparsity. The main result is an oracle Bernstein--von Mises theorem for the fractional posterior under supportwise likelihood assumptions. The proof develops sparse local asymptotic normality and Laplace approximation around support-specific pseudo-true centers, and combines them with fixed-prior mass, support penalization, recovery geometry, and beta-min separation to obtain contraction, support recovery, Gaussian mixture approximation, and collapse to the oracle Gaussian law. Model-entry verifications are given for Gaussian regression and for logistic, Poisson, probit, Gamma log-link, and negative-binomial log-link regression under stated sufficient conditions. The ordinary posterior is treated only through restricted Gaussian and canonical-link extensions, with coverage under additional active-dimension and moment conditions.

Copula and spatial-regularized variational autoencoder for mapping disease comorbidity in West Africa

2026-05-26T14:55:51Z

Geospatial health disproportionality remains a critical public health concern, as communities face heterogeneous illness risks due to varying exposures to adverse socioeconomic and environmental conditions. While statistical models have been adopted to identify risk factors, studies that account for the complex, non-linear dependencies and spatial regularities inherent in comorbid disease patterns are underdeveloped. In this work, we propose a novel spatially regularized variational autoencoder (VAE) to characterize and map the geospatial disproportion of childhood comorbidity in West Africa, focusing on diarrhea, fever, and acute respiratory infection (ARI). To model dependence between these conditions, this study integrates a bivariate Gumbel copula into the VAE framework, enabling flexible modeling of asymmetric dependence and quantification of joint and conditional morbidity risks. Additionally, covariate effects within the framework were quantified to facilitate epidemiological interpretation of risk factors. The proposed method was benchmarked against commonly used methods and applied to characterize comorbidity in West Africa using the Demographic and Health Survey data. Findings reveal pronounced spatial heterogeneity in the likelihood of comorbidity among West African children, with the strongest co-occurrence observed between fever and ARI. Household wealth, maternal education, and access to improved water sources were associated with the likelihood of comorbidity. These patterns highlight high-risk areas and underscore the need for targeted, location-specific public health interventions.

Estimation and Inference for Win Measures with Multiple Ordinal Endpoints Subject to Missingness

2026-05-26T14:34:45Z

Win measures, including the win ratio (WR), win odds (WO), net benefit (NB), and desirability of outcome ranking (DOOR), are increasingly used in randomized clinical trials with multiple hierarchical ordinal endpoints. In practice, however, one or more component endpoints may have missing data. The standard pairwise-comparison approach, which treats pairs with missing outcomes as ties, can produce biased estimates, even if the data are missing completely at random (MCAR). Although inverse probability of censoring weighting (IPCW) methods have been developed for censored survival endpoints, corresponding methods for addressing missing hierarchical ordinal endpoints are not yet available. To address this gap, we develop inverse probability weighting (IPW) and augmented IPW (AIPW) estimators for win measures with hierarchical ordinal endpoints subject to missing data, allowing missingness to depend on treatment assignment and baseline covariates. The IPW estimator corrects bias by reweighting complete observed outcomes using joint non-missingness probabilities involved in estimating the joint cell probabilities that define the win measures. The AIPW estimator additionally incorporates outcome modeling, improving efficiency and achieving double robustness. For inference, we derive closed-form variance estimators for both methods based on influence functions. Simulation studies show that the standard approach can be substantially biased, whereas the proposed IPW and AIPW estimators remain consistent with near-nominal coverage. Furthermore, the AIPW estimator is generally more efficient than IPW estimator. Applications to the SCOUT-CAP and ACTT-1 trials illustrate the practical utility of the proposed methods. An R package, WinMO, is provided for implementation.

Causal Representation Learning for Generalisable Recommendation

2026-05-26T13:58:36Z

Predictive models trained on observational data often fail to generalise to the distributions they encounter when deployed, especially when the training data is a product of the system being optimised. Recommender systems are a canonical example: they are trained on interaction logs confounded by the deployed policy, past user behaviour, and platform filtering. As a result, the training distribution differs substantially from the candidate distribution scored at serving time, a gap that makes offline metrics unreliable predictors of online performance. We address the distribution shift problem with a method motivated by causal representation learning (CRL). We propose an information-theoretic disentanglement criterion and prove that its optimum depends only on the causal components of the input. We then derive a tractable variational lower bound that makes the criterion optimisable from finite observational data alone. The scope of our method is narrower than that of much of the CRL literature, in that we target better generalisation under distribution shift, not full identification of all latent causal factors. This narrower target is what makes the method practical, requiring only the existing confounded logs, applying to any standard supervised model, and adding no inference-time cost. Our headline evaluation is an A/B test with millions of users on Spotify, applied to a production ranker for personalised playlist generation. A capacity-matched CRL variant performed on par offline but delivered substantial online gains in listener engagement. Complementary evidence on the public KuaiRand recommendation dataset and a synthetic benchmark with known causal structure shows the same pattern: offline parity with baseline, gains under distribution shift. Across all three settings, adding our causal disentanglement objective yields meaningfully better out-of-distribution generalisation.

Conformalized Large-Scale Selective Inference with Informative and Trustworthy Prediction Sets

2026-05-26T13:29:40Z

In large-scale prediction problems, exhaustively following up on all test units is often impractical and inefficient, motivating a selective reporting strategy that fulfills the dual requirements of informativeness and trustworthiness. Within the InfoFCR (Informative prediction with False Coverage Rate control) framework, we propose SCIP (Selective Conformal Inference for Informative Predictions), a procedure built on three key components: (i) an informative set constructor that tailors prediction sets to individual test units according to user-specified informativeness constraints; (ii) a trust score that provides a principled quantification of the trustworthiness of candidate informative sets; and (iii) generalized conformal p-values that are used to perform FCR analysis for selecting the most promising candidates. We establish that SCIP guarantees finite-sample FCR control and is asymptotically anti-conservative, achieving higher statistical power than existing methods. The framework is highly versatile, accommodating a wide range of error metrics across both regression and classification tasks. Extensive numerical experiments on simulated and real data demonstrate the effectiveness of our approach.

Towards Continuous-time Causal Foundation Models

2026-05-26T12:06:04Z

Extending discrete-time causal Prior-data Fitted Networks for time series to continuous time invites writing the mechanism as a stochastic differential equation (SDE) -- but if the SDE is integrated \emph{once per observation gap}, the trajectory law depends on when it is observed, and the prior remains a discrete-time Markov model in SDE clothing. We propose a precise continuity criterion -- trajectory-law invariance to the observation schedule -- together with a three-tier taxonomy (discrete; naive observation-grid integration; fine-grid integration with decoupled observation) and a construction realising the top tier on a random DAG with OU or small-MLP nonlinear drifts, irregular observation schedules, and hard / soft / time-varying interventions. A $2 \times 2$ encoder $\times$ integrator ablation, run independently on a linear and a nonlinear prior, finds fine-grid integration beats naive on 8/8 cells (sign-consistency $p < 1/256$) with the gap growing as the eval grid refines; the encoder axis is null with fine integration but time-aware-leading with naive. We release the prior and a preliminary zero-shot protocol on pharmacokinetic and physical-system data.

Robust ensemble Kalman filtering under observation noise misspecification via diffusion score matching

2026-05-26T11:42:55Z

We address the problem of observation noise misspecification in Bayesian filtering of dynamical systems via recent advances in generalised Bayesian inference. Mis-match in tail decay between the true data generating process and an assumed observation model, often showing via frequent outliers, can strongly impact Bayesian updates and analysis in Kalman filtering. Existing approaches often employ detect-and-delete-schemes or covariance inflation to avoid assimilation of influential instances of mis-specification. In challenging settings where the analysis updates are barely sufficient to counteract the induced forecast uncertainty, these strategies may destabilize or struggle to provide reliable uncertainty quantification. We consider a novel Kalman filter adjusting information processing in the analysis step by employing diffusion score matching for inference to obtain robustness while maintaining well-quantified uncertainties. We provide theoretical properties of the diffusion score matching Kalman filter in linear Gaussian state space systems covering conjugacy and closed form parameter update in the analysis step, robustness, covariance stability, and tuning as well as high-dimensional consistency. We derive ensemble approximations via stochastic and deterministic coupling as well as implementing localization to obtain EnKF, ESRF and LETKF varieties. We evaluate the methods in appropriate simulation studies on target-tracking, the chaotic Lorenz 63 system and the Lorenz 96 system in 40 dimensions. Our insights highlight a critical trade-off between robustness and stability in Bayesian filtering. Methods employing generalized Bayesian inference can navigate this balance and improve data assimilation in challenging environments combining non-linear dynamics and potentially non-Gaussian observation noise.

A warning system for risk prediction of metabolic syndrome in a healthy population of blood donors

2026-05-26T10:56:30Z

Metabolic syndrome is a complex clinical condition characterized by the simultaneous presence of multiple metabolic risk factors and represents a major public health concern. The syndrome develops silently and may remain undiagnosed for long periods, highlighting the importance of investigating early metabolic alterations before overt disease onset. Longitudinal monitoring of predominantly healthy individuals may help identify metabolic risk early. The paper proposes a Bayesian statistical model to estimate the probability of metabolic syndrome among blood donors during pre-donation screening, incorporating information collected at previous visits. Using longitudinal data from one of the main blood donor associations in Italy, AVIS Milan, we analyze repeated clinical and lifestyle measurements from a predominantly healthy population of donors. In particular, we fit a Bayesian multivariate model that jointly represents the logarithm of the five diagnostic components of metabolic syndrome. The model accounts for within-donor dependence across repeated visits and provides probabilistic estimates of individual risk. Our framework aims to provide clinicians at AVIS Milan with an interpretable traffic-light warning system (low, intermediate, high risk) during pre-donation screening to facilitate the identification of individuals at risk of metabolic syndrome at future visits and to support targeted preventive interventions during routine donor assessment, ultimately contributing to a long-term reduction in healthcare costs for the Italian national healthcare system.

Tweedie's Formula and Score-Driven Updating

2026-05-26T09:55:03Z

Score-driven models update time-varying parameters using conditional likelihood scores. This paper develops a Bayesian interpretation of such updates through Tweedie's formula, which connects posterior mean corrections with marginal scores. In Gaussian signal extraction, this gives an exact posterior-correction identity. For natural exponential families, related identities characterize posterior means in natural- and expectation-parameter spaces. Building on these identities, we show that conjugate Bayesian filtering in expectation space coincides exactly with an inverse-Fisher-scaled conditional score update under local precision discounting. For general conditional densities, the exact Bayesian correction involves a generally unavailable predictive-marginal score. A local Gaussian approximation shows that the conditional likelihood score provides the leading approximation to this posterior correction; under local precision discounting, the predictive covariance becomes proportional to inverse Fisher information, yielding the familiar inverse-Fisher-scaled score recursion. The results clarify when score-driven updates are exact Bayesian filters and when they should instead be viewed as tractable local approximations.

Marginal likelihoods for finite-support Huber contamination

2026-05-26T09:03:09Z

For Huber contamination on a known finite sample space, the unrestricted contaminating law is a probability vector on the support atoms, and domination over all measurable subsets reduces to atomwise inequalities. Placing a Dirichlet prior on this probability vector and a Beta prior on the contamination proportion gives an exact marginal likelihood for the structural parameter after analytic integration of both nuisance quantities. The likelihood is a finite weighted sum over allocations of the observed counts between the structural and contaminating components. For fixed support size, this sum and its score can be evaluated by a dynamic program with quadratic cost in the sample size, enabling gradient-based posterior sampling.

Statistical Inference and Stability Boundaries of Multi-cellular Interaction Hypergraphs from Asynchronous Event Streams

2026-05-26T06:45:04Z

We introduce the Hyperedge-triggered Hawkes (HTH) process for inferring higher-order interaction structure in multi-cellular systems from asynchronous event-time data. Beyond standard pairwise excitation, the HTH intensity includes a term activated by the simultaneous co-firing of a cell group within a temporal window. We derive a closed-form Expectation-Maximisation algorithm whose key ingredient is a piecewise compensator that eliminates the systematic bias present in the naive integral formulation. A CP tensor decomposition reduces the hyperedge parameter count from O(N^K) to O(NR). Across eleven synthetic experiments the framework achieves pairwise recovery error below 5%, while revealing a systematic -22% bias on hyperedge weights that is non-monotonic in the kernel decay rate, ruling out a simple temporal-overlap explanation and motivating adaptive kernel methods. On multi-electrode recordings of mouse retinal ganglion cells, the model yields a +20.6 nat likelihood gain over the pairwise baseline, providing suggestive but not decisive evidence for higher-order interactions. Code and all experiments are publicly available at https://github.com/Hanii0210/hypergraph-hawkes.

Using Transcripts for Nonparametric Monitoring of Serial Dependence

2026-05-26T05:44:37Z

Control charts for process monitoring are widely used in practice. Most control charts require the monitored (residuals) process to be serially independent (and to satisfy specified distributional assumptions), whereas undetected dependence (or violations of distributional assumptions) may severely affect the charts' performances. Therefore, (distribution-free) control charts for monitoring serial dependence are of utmost relevance for practice. Recently, various nonparametric control charts have been proposed for this purpose, which are based on ordinal patterns, and which showed an appealing performance in detecting different types of serial dependence. In this research, we further progress in this direction and develop novel nonparametric control charts being based on transcripts and algebraic distances (as derived from ordinal patterns). The performance of the newly proposed control charts is evaluated in a simulation study, and their application in practice is illustrated with a real-world data example from chemical industry.

Target-Oriented Statistical Compression: Sufficiency, Reverse Martingales, and Sequential Monitoring

2026-05-26T05:37:10Z

Statistical procedures rarely retain all features of the observed data. A sufficient statistic removes information irrelevant to a parameter; a maximum likelihood estimate compresses an empirical objective into an optimizing point; and a hidden state in a sequential model compresses past observations into a learned representation. This article develops these practices under the unified notion of \emph{target-oriented statistical compression}: a useful summary preserves what matters for an inferential, predictive, or decision-relevant target, rather than every detail of the realized data path. The central object is the conditional target process $M_n=\E(Z\given\G_n)$, where $Z$ is the target and $\G_n=σ(T_n)$ is the information retained by the compression map $T_n$. When $(\G_n)$ is a decreasing filtration, $(M_n)$ is a reverse martingale with limit $M_\infty=\E(Z\given\G_\infty)$. Exact sufficiency corresponds to lossless compression, while approximate summaries such as penalized estimators, principal components, and neural-network hidden states produce reverse quasi-martingale defects measuring coherence loss across compression levels. The diagnostic $r_n=|M_n-M_{n-1}|$ is treated as an observable stability proxy, not as an unbiased estimator of the theoretical defect. Boundary degeneracy in sequential binary problems is developed as a central application. Practical boundary claims require joint assessment of boundary closeness, uncertainty control, and trajectory stability. The companion paper \citet{chang2025rm} develops the corresponding stopping procedures, finite-sample bounds, and numerical evidence; the present paper provides the broader theoretical infrastructure and extends the framework to Gaussian, Poisson, and quasi-martingale monitoring problems.