https://arxiv.org/api//kf+O7v3seK1e6Iv486fLEkvQbY 2026-06-09T20:30:57Z 23494 0 15 http://arxiv.org/abs/2606.09638v1 Data-driven discovery of governing differential equations across physical systems 2026-06-08T15:35:06Z

Differential equations play a critical role in scientific discovery because they provide a mathematical framework to describe the behaviour of physical phenomena. As a promising alternative to traditional first principles, data-driven differential equation discovery has attracted increasing attention for its ability to infer governing laws directly from experimental or simulated data, especially when the underlying physics is unclear. However, the field has expanded rapidly along diverse methodological directions, particularly with the emergence of AI-based approaches, and still lacks a clear organizing perspective. In this Review, we propose a problem-oriented perspective on data-driven differential equation discovery. We first introduce a two-dimensional phase diagram of equation discoverability, where discovery problems are organized according to structural complexity and coefficient complexity. This phase diagram shows how the field has moved from the discovery of sparse equations with simple coefficients toward more complex governing laws with richer structures and more flexible parameterizations. It also clarifies why different methodological families succeed or fail in different problem settings. We then present the representation-evaluation-optimization (REO) framework as a fundamental abstraction of the discovery process. By identifying the core problems of equation discovery that persist across algorithmic variations, REO shifts the discussion from individual algorithms to the fundamental principles that determine discoverability. We connect these perspectives to applications across physics and adjacent sciences, and argue that the next challenge is not merely recovering equations, but using them to revise existing theories, distil mechanisms and form new scientific concepts.

2026-06-08T15:35:06Z Siyu Lou Hao Xu Wenguan Wang Lu Lu Hao Sun Yang Liu Linfeng Zhang Dongxiao Zhang Yuntian Chen http://arxiv.org/abs/2504.01148v5 Methodological insights in Bayesian Age-Period-Cohort analysis: an application to the case of Puerto Rico's fertility decline 2026-06-08T14:04:12Z

Age-Period-Cohort (APC) models are of special importance in Demography and Epidemiology for analyzing panel data according to three different factors: biological (age), technological (period) and cultural (cohort). The main goal of APC modeling is to separate the explanation of both period and cohort effects to the phenomenon. The objective of this paper is to develop a Bayesian Age-Period-Cohort framework that can model a wide range of demographic and epidemiological phenomena and improve upon existing statistical methodologies. The APC framework consists of addressing three main challenges: (1) the identification problem of all APC models, usually managed by imposing constraints on effect groups, (2) considering expert knowledge in the model definition, and (3) efficient solution of computational issues. By allowing full parameter uncertainty, use of robust priors, and an efficient computational implementation, a Bayesian methodology manages these concerns. Bayesian models also produce results that allow intuitive implementation and support theoretical knowledge. Our original methodology consists of the use of (i) a Scaled Beta2 prior distribution for the scale parameters, (ii) imposing different period and cohort constraints and comparing them,(iii) user-friendly implementation that can be easily adapted to the event, and (iv) various model comparison criteria that leads to reasonable interpretation of APC effects. We examine the dramatic collapse of fertility in Puerto Rico, an application that is difficult to model due to the accelerated changes and has interesting demographic implications that challenge the predominance of period effects in lowest-low fertility countries, emphasizing the cohort (cultural) momentum. The scope of the methodology introduced here is wide, including applications to obesity or smoking studies, for example.

2025-04-01T19:30:01Z Jomarie Jiménez González Angélica M. Rosario Santos Luis R. Pericchi Guerra Hernando Mattei http://arxiv.org/abs/2507.18683v2 Bayesian Deep Gaussian Processes for Correlated Functional Data: A Case Study in Cosmological Matter Power Spectra 2026-06-08T13:08:43Z

Understanding the structure of our universe and the distribution of matter is an area of active research. As cosmological surveys grow in complexity, the development of emulators to efficiently and effectively predict matter power spectra is essential. We are particularly motivated by the Mira-Titan Universe simulation suite that, for a specified cosmological parameterization (termed a "cosmology"), provides multiple response curves of various fidelities, including correlated functional realizations. Our objective is two-fold. First, we estimate the underlying matter power spectra, with appropriate uncertainty quantification (UQ), from all of the provided curves. To this end, we propose a novel Bayesian deep Gaussian process (DGP) hierarchical model which synthesizes all the simulation information to estimate the underlying matter power spectra while providing effective UQ. Our model extends previous work on Bayesian DGPs from scalar responses to correlated functional outputs. Second, we leverage our predicted power spectra from various cosmologies in order to accurately predict the entire matter power spectra for an unobserved cosmology. For this task, we use basis function representations of the functional spectra to train a separate Gaussian process emulator. Our method performs well in synthetic exercises and against the benchmark cosmological emulator (Cosmic Emu).

2025-07-24T16:48:38Z 22 pages, 14 figures. Revised and accepted version for publication in Data Science in Science Stephen A. Walsh Annie S. Booth David Higdon Jared Clark Kelly R. Moran Katrin Heitmann http://arxiv.org/abs/2602.05483v2 Toward Operationalizing Rasmussen: Drift Observability on the Simplex for Evolving Systems 2026-06-08T12:39:21Z

Software operations increasingly rely on SLOs, traces, deployment specifications, and change events, yet dashboards and thresholding practices often expose share-like operational signals as separate scalar panels or baseline distances. This can create false alarms under benign redistribution and miss movement toward policy boundaries. Rasmussen's dynamic safety model motivates drift under competing pressures, but operationalizing it for software is difficult because relevant state variables (remaining margin, engineering effort, and risk/impact) are often compositional and their parts evolve. We formulate an automated, artifact-derived drift-monitor design that maps changing software artifacts into a stable compositional monitoring state: it extracts a current part inventory and policy constraints, maps telemetry to a positive composition, stabilizes splits, merges, and renames through lineage-aware canonical groups, and analyzes boundary-directed drift in log-ratio coordinates. The proposed monitor would report drift direction, step-to-boundary, balance-level attribution, and model-health indicators under architectural churn. We specify the approach, identify its zero/noise/lineage assumptions, and report a reproducible synthetic sanity check of boundary-aware drift and controlled part churn.

2026-02-05T09:41:49Z Anatoly A. Krasnovsky http://arxiv.org/abs/2504.05912v3 Financial resilience of agricultural and food production companies in Spain: A compositional cluster analysis of the impact of the Ukraine-Russia war (2021-2023) 2026-06-08T11:48:47Z

This study analyses the financial resilience of agricultural and food production companies in Spain amid the Ukraine-Russia war using cluster analysis based on financial ratios. This research utilizes centred log-ratios to transform financial ratios for compositional data analysis. The dataset comprises financial information from 1197 firms in Spain's agricultural and food sectors over the period 2021-2023. The analysis reveals distinct clusters of firms with varying financial performance, characterized by metrics of solvency and profitability. The results highlight an increase in resilient firms by 2023, underscoring sectoral adaptation to the conflict's economic challenges. These findings together provide insights for stakeholders and policymakers to improve sectorial stability and strategic planning.

2025-04-08T11:08:51Z European Accounting and Management Review, 11, 1 (2025), 55-80 Mike Hernandez-Romero Germà Coenders 10.26595/eamr.v11i1.03 http://arxiv.org/abs/2603.24215v3 Adapting Altman's bankruptcy prediction model to the compositional data methodology 2026-06-08T11:43:46Z

Using standard financial ratios as variables in statistical analyses has been related to several serious problems, such as extreme outliers, asymmetry, non-normality, and non-linearity. The compositional-data methodology has been successfully applied to solve these problems and has always yielded substantially different results when compared to standard financial ratios. An under-researched area is the use of financial log-ratios computed with the compositional-data methodology to predict bankruptcy or the related terms of business default, insolvency or failure. Another under-researched area is the use of machine learning methods in combination with compositional log-ratios. The present article adapts the classical Altman bankruptcy prediction model and some of its extensions to the compositional methodology with pairwise log-ratios and three common statistical and machine learning tools: logistic regression models, k-nearest neighbours, and random forests, and compares the results with standard financial ratios. Data from the sector in the Spanish economy with the largest number of bankrupt firms according to the first two digits of the NACE code (46XX "wholesale trade, except of motor vehicles and motorcycles") were obtained from the Iberian Balance sheet Analysis System. The sample size (31,131 firms, of which 97 were bankrupt) was divided into a training and a validation dataset. The training dataset was downsampled to one healthy firm to each bankrupt firm. No outliers were removed. Focusing on predictive performance, the results show that compositional methods are better than standard ratios in terms of sensitivity (recall), with mixed results regarding specificity, compositional random forests and compositional logistic regression behaving the best.

2026-03-25T11:44:20Z 22 pages, 2 figures Fatemeh Keivani Universitat de Girona Germà Coenders Universitat de Girona Geòrgia Escaramís Universitat de Girona CEEISCAT. Department of Health. Government of Catalonia http://arxiv.org/abs/2606.09313v1 Machine-Learning Emulation of Satellite Greenhouse Gas Retrievals: Stability over Time 2026-06-08T10:19:11Z

Retrieval algorithms are used to estimate atmospheric concentrations of greenhouse gases (GHGs), such as carbon dioxide (CO2) and methane (CH4), by solving inverse problems from high-spectral-resolution satellite radiance measurements. However, these algorithms are computationally expensive, which makes real-time estimation at scale difficult. Machine-learning models have therefore been proposed as fast emulators of retrieval algorithms. Most existing studies, however, evaluate them only on test data from the same period as the training data. We study the stability over time of such emulators using data from the Greenhouse Gases Observing SATellite (GOSAT). We show that prediction accuracy generally deteriorates when the test period moves away from the training period. We also show that including time as an input feature substantially improves XCH4 prediction for Lasso and neural-network models. Among the methods considered, a simple Lasso model performs as well as or better than more complex methods such as neural networks, and yields more stable predictions over time. We further validate the results using the Total Carbon Column Observing Network (TCCON), a ground-based observation network. On the TCCON-matched dataset, the time-augmented Lasso achieves errors against TCCON that are comparable to the disagreement between GOSAT and TCCON for both XCO2 and XCH4.

2026-06-08T10:19:11Z 48 pages, 9 figures, 15 tables Nugzar Gognadze Motonobu Kanagawa Yu Someya Hisashi Yashiro http://arxiv.org/abs/2606.09307v1 Robust high-dimensional Bayesian regression with non-Gaussian errors under global--local shrinkage priors 2026-06-08T10:15:23Z

Multivariate regression with many correlated responses and predictors commonly violates Gaussian error assumptions due to heavy tails, outliers, and asymmetry. Gaussian procedures then lose efficiency in coefficient estimation and produce biased estimates of conditional dependence graphs. We develop a robust Bayesian framework using a scale-location mixture error distribution and horseshoe+ global-local priors on both the regression coefficients and off-diagonals of the error precision matrix, coupling sparsity in the regression map with sparsity in the residual dependence structure. Theoretical contributions include joint posterior contraction, selection consistency for both supports, a Kullback-Leibler risk bound showing the dominance of horseshoe+ over horseshoe, and bounded sensitivity, ensuring that a single large outlier has vanishing influence under t errors. Simulations across four error regimes, contamination, and varying dimensions show that our estimator matches Gaussian procedures under normality and dominates them under heavy tails and skewness. Applications to FRED-MD macroeconomic data and S&P 500 daily returns recover interpretable sparse coefficient maps and residual dependence graphs while automatically down-weighting crisis-period observations.

2026-06-08T10:15:23Z 21 pages, 9 figures, 6 tables Mohammad Arashi http://arxiv.org/abs/2606.09283v1 Towards personalised intervention: A causal-dynamical framework to determine psychological treatment trajectories 2026-06-08T09:51:31Z

For approximately half of the individuals receiving mental health care, the results are suboptimal, even when treatments align with evidence-based guidelines. These limited effects may partly stem from how clinical decisions on treatment focus are made in mental health care. Typically, treatment strategy is guided by the diagnostic classification combined with the individualized case conceptualization. While standard, this approach may fall short for several reasons such as biases on the part of both the patient and therapist, and treatment guidelines being based on average effects that may not (exactly) suit the individual patient. To address these challenges, we propose a novel framework that reduces biases in clinical decision-making and makes it genuinely possible to tailor treatment focus to the individual patient. This framework involves (a) constructing causal graphs and estimating causal effects from intensively collected, longitudinal patient data, (b) simulating new time series based upon the causal relationships, and (c) using these simulations to identify the most effective treatment focus for the individual patient. By simulating and comparing different intervention strategies and examining both the estimated individual's responsiveness and its long-term effectiveness, this approach may generate useful insights to guide treatment focus and strategy, which can lead to a significant improvement of treatment outcomes in mental health care.

2026-06-08T09:51:31Z Lourens Waldorp Titus Mürtz Anita Jansen Jonas Haslbeck http://arxiv.org/abs/2410.23786v3 Conformal inference for cell type annotation with graph-structured constraints 2026-06-08T08:19:12Z

Conformal prediction is a framework for constructing prediction sets for machine learning models, relying solely on the exchangeability of training and test data and without requiring to specify a parametric distribution. Despite its wide applicability and popularity, its application in single-cell transcriptomics remains underexplored. This paper addresses this gap by developing an approach that leverages the rich information about cell-type relations, encoded in the graph structure of cell ontologies, to enhance the interpretability of reference-based cell-type annotation. Leveraging conformal risk control, we develop a novel conformal algorithm for graph-structured predictions and we demonstrate how incorporating graph constraints can improve the interpretation of cell-type predictions. This approach aims to generate more coherent conformal sets that align with the inherent relationships among classes, facilitating clearer and more intuitive interpretations of model predictions. Additionally, we provide a technique to address non-exchangeability, particularly when the cell-type distribution changes between training and test datasets. We implemented our method in the open-source R package scConform, available at https://bioconductor.posit.co/packages/release/bioc/html/scConform.html.

2024-10-31T10:00:40Z Daniela Corbetta Livio Finos Ludwig Geistlinger Davide Risso http://arxiv.org/abs/2512.10250v2 Time-Averaged Drift Approximations are Inconsistent for Inference in Drift Diffusion Models 2026-06-08T03:22:53Z

Drift diffusion models (DDMs) have found widespread use in computational neuroscience, cognitive science, mathematical psychology as well as other fields. They model evidence accumulation in simple decision tasks as a stochastic process drifting towards decision barriers. In models where the drift is both time-varying within a trial and variable across trials, the high computational cost for accurate likelihood evaluation has often led to the use of a computationally convenient surrogate for parameter inference, the time-averaged drift approximation (TADA). In each trial, TADA assumes that the time-varying drift rate can be replaced by its temporal average throughout the trial. This approach enables fast parameter inference using analytical likelihood formulas for DDMs with constant drift. In this work, we show that such an estimator is inconsistent: it does not converge to the true drift, posing a risk of biasing scientific conclusions when parameter estimates are obtained by TADA and similar approximations. We provide an elementary proof of this inconsistency in what is perhaps the simplest possible setting: a Brownian motion with piecewise constant drift hitting a one-sided upper boundary. Furthermore, numerical examples based on an attentional DDM (aDDM) show that using TADA leads to systematic misestimation of attentional effects in decision making and can lead to false conclusions in scientific hypothesis testing.

2025-12-11T03:18:55Z 37 pages. Includes updates for the first revision Sicheng Liu Alexander Fengler Michael J. Frank Matthew T. Harrison http://arxiv.org/abs/2606.08934v1 Backward Coherence and Hidden-State Stability in Recurrent Neural Networks: A Quasi-Reverse-Martingale Theory 2026-06-08T02:20:29Z

Recurrent neural networks maintain a hidden state $h_t$, but its probabilistic meaning is often unclear. We study hidden-state stability through \emph{backward coherence}: the extent to which $h_t$ can be reconstructed from $h_{t+1}$ by a learned backward projector $g_φ$. Under contraction and summable backward drift, the hidden-state sequence forms a quasi-reverse-martingale. This yields almost-sure convergence, rates under mixing, an interpretable limiting representation, finite pathwise stopping times, and a theoretical framework for time-uniform confidence sequences. Simulations support the theory. Backward-coherence regularisation reduces the empirical quasi-martingale total $\hat Q$ by $43$--$58%$, reaches stability $28$--$44%$ earlier than an unregularised RNN, and gives tracking-error recovery consistent with geometric bounds. Additional tests confirm echo-state forgetting rates bounded by $ρ$ and verify the increment-sum tube $R_t$ with $100%$ simultaneous coverage, although $R_t$ is conservative; in practice, the defect-tail proxy $\hat Q_t$ is the more useful monitor. The backward-coherence loss is also equivalent to minimising a Kullback--Leibler divergence in a Gaussian backward model, linking the method to variational inference. Extensions cover $φ$-mixing inputs, change-point tracking, and finite-sample concentration. Three real-data studies further validate the approach. On PhysioNet 2012 ICU data, the Reverse Martingale RNN (RMRNN) matches RNN mortality-prediction AUC while reaching stable representations 13 hours earlier. On FRED-MD, it reduces one-month-ahead forecast error by about fourfold under concept drift. On UCI Human Activity Recognition, it maintains lower post-transition tracking error with geometric decay. The guarantees apply under the stated assumptions; universality is not claimed.

2026-06-08T02:20:29Z Yuan-chin Ivan Chang http://arxiv.org/abs/2408.02122v2 Graph-Enabled Efficient Federated Bayesian Modeling 2026-06-08T02:19:29Z

Federated Bayesian modeling requires combining evidence from distributed users into a coherent global posterior while keeping users' raw data on-device. We propose Federated Latent Graph MCMC (FLaG-MCMC), a computationally efficient framework for federated learning in which historical posterior samples of a shared global parameter are encoded into a learned low-dimensional latent space, connected via a $k$-nearest-neighbor graph, and transferred sequentially to new users as a nonparametric prior. Each user runs graph-based MCMC in the latent space guided by their own likelihood, returns updated global samples to the server, and retains local latent variables on-device. We demonstrate FLaG-MCMC on Bayesian meta-analysis for opioid use disorder prevalence estimation and on federated topic modeling, where the federated posterior closely approximates the pooled full-data posterior for both global parameters and local user-level inference.

2024-08-04T19:37:09Z 20 pages, 7 figures Chenyang Zhong Shouxuan Ji Tian Zheng http://arxiv.org/abs/2606.08923v1 Scalable Network-Aware Experiment Design for Two-Sided Marketplaces 2026-06-08T02:01:30Z

Measuring causal effects in networked two-sided marketplaces is challenging due to treatment interference between market participants on different sides. When treatment is applied to one side (e.g., job seekers), their interactions with the other side (e.g., job posters) introduce spillover effects that violate the Stable Unit Treatment Value Assumption (SUTVA) and bias causal estimates. While cluster-based randomization mitigates this problem, prior approaches struggle with a fundamental trade-off: reducing spillover requires isolated clusters that will reduce the number of qualifying clusters, which decreases statistical power. This paper introduces EgoCluster V3, an iterative clustering algorithm that reduces spillover by 3x compared to prior versions while preserving node coverage and doubling test power. We further introduce MultiEgoCluster, which extends V3 through a two-stage procedure that first groups highly connected egos into multi-ego clusters before applying the iterative clustering algorithm. This achieves an additional ~56% spillover reduction and ~38% increase in sample size. Both methods are deployed in production at LinkedIn and have systematically enabled high-impact two-sided marketplace experiments. Since residual bias cannot be fully eliminated through clustering alone, we derive a theoretical bias correction method for average treatment effect (ATE) estimation based on graph structure and propose an approach to generalize results to the general population.

2026-06-08T02:01:30Z Yi Su Zhen Yan 10.1145/3770855.3818478 http://arxiv.org/abs/1910.07712v3 Estimating Spatially-Smoothed Fiber Orientation Distribution from Diffusion-MRI Experiments 2026-06-08T00:06:11Z

Diffusion-weighted magnetic resonance imaging (D-MRI) is a noninvasive in vivo technique for probing the microstructural architecture of biological tissues. At each voxel, the fiber orientation distribution (FOD) characterizes local fiber configurations and orientations and is therefore a central object of estimation in D-MRI analysis. We propose the Nearest-Neighbor Adaptive Regression Model (NARM), a spatially adaptive framework for FOD estimation that performs weighted local likelihood estimation over nested spatial neighborhoods, where the weights jointly encode spatial proximity and similarity among neighboring FODs, measured by either the optimal transport or Hellinger distance. To prevent over-smoothing while preserving structural heterogeneity, we introduce a voxel-wise rescaling scheme and a data-driven stopping rule based on minimum nearest-neighbor dissimilarity. We further develop a configuration-aware strategy for selecting the similarity-smoothing parameter, allowing the smoothing strength to adapt to local fiber complexity. Simulation studies demonstrate that NARM improves FOD estimation accuracy relative to voxel-wise methods and the existing spatial smoothing approach PMARM. Application to test-retest data from the Human Connectome Project additionally shows that NARM yields more reproducible FOD estimates. Implementation details and scripts for the simulation and real data analyses are available at https://github.com/jie108/NARM

2019-10-17T05:13:49Z Jilei Yan Seungyong Hwan Mengjie Shi Jie Peng