https://arxiv.org/api/DEAktAycGsbamIwoPp4Tc2BNWic 2026-06-13T13:41:24Z 23522 45 15 http://arxiv.org/abs/2606.10224v1 Spatial Prediction of Local Soil Erosion Distribution in the Wasserstein Space 2026-06-08T22:27:24Z

Obtaining precise erosion measurements requires costly fieldwork, making it infeasible to directly survey large domains such as a province or river basin. To extend fieldwork results across such extensive domains, we propose a novel spatial prediction method that treats local erosion distributions as objects in the Wasserstein space. These distributions are mapped into square-integrable trajectories and represented via basis expansion, forming a multivariate random field that captures spatial dependence. By applying local regression and Kriging in this representation, our approach flexibly models and predicts erosion distributions at arbitrary locations. This framework improves prediction for functionals of the distribution, such as the mean and exceedance probabilities. Simulation studies demonstrate that the proposed method outperforms a misspecified parametric alternative and existing Fréchet regression approaches. We illustrate the approach with a detailed erosion analysis in Shaanxi province, China, where local measurements from surveyed watersheds are extended to predict erosion distributions across the entire province using covariates such as land use and elevation.

2026-06-08T22:27:24Z To appear in the Annals of Applied Statistics Jiaming Qiu Xiongtao Dai Zhengyuan Zhu Shuiqing Yin http://arxiv.org/abs/2410.12936v4 Development of COVID-19 Booster Vaccine Policy by Microsimulation and Q-learning 2026-06-08T20:09:53Z

The COVID-19 pandemic highlighted the urgent need for effective vaccine policies, but traditional clinical trials often lack sufficient data to capture the diverse population characteristics necessary for comprehensive public health strategies. Ethical concerns around randomized trials during a pandemic further complicate policy development for public health. Reinforcement Learning (RL) offers a promising alternative for vaccine policy development. However, direct online RL exploration in real-world scenarios can result in suboptimal and potentially harmful decisions. This study proposes a novel framework combining tabular Q-learning with microsimulation, where a Recurrent Neural Network (RNN) serves as a digital twin environment simulator of the target population. This digital twin captures temporal associations between infection and patient characteristics to generate realistic individual disease trajectories, enabling safe and efficient policy learning without real-world interaction. Our tabular Q-learning model produces an interpretable policy table that balances the risks of severe infection against vaccination side effects. Applied to COVID-19 booster policies, the learned Q-learning-based policy outperforms current practices, offering a path toward more effective vaccination strategies. A project webpage introducing our work, including links to the software, a brief introductory video, and a step-by-step tutorial video, is available at https://public.websites.umich.edu/~jiankang/software/dtpl_website_umich/index.html.

2024-10-16T18:22:41Z Guoxuan Ma Sicong Xie Lili Zhao Jian Kang 10.1080/01621459.2026.2682540 http://arxiv.org/abs/2606.10093v1 Predicting Hospitalization from a Whole-Person Health Score with Incomplete Electronic Health Records Data: A Case Study 2026-06-08T19:17:46Z

Embedding a standardized whole-person health measure in electronic health records (EHR) could be instrumental to preventative care. The allostatic load index (ALI), calculated from ten component stressors across three body systems, offers a promising snapshot of holistic health. The ALI can be calculated from EHR data, but many components are missing, since not all patients undergo all tests. Using statistical modeling and machine learning, EHR data for $1000$ patients from a large academic health system were used to predict in-patient hospitalization (as a count or binary) from ALI, controlling for age and sex. Various methods were evaluated to fill in information gaps for patients' missing ALI components, including summary measures combining components or using them separately. Performance was measured using receiver operating characteristic (ROC) curves and corresponding areas under the ROC curve (AUC). Count modeling of hospitalization did not improve upon binary, and logistic regression beat random forest. Overall, summary measures performed similarly, with the complete-case proportion (i.e., the proportion of non-missing components that were "unhealthy") performing best (AUC $= 0.64$) but by $\leq 0.01$. When using components separately, the pattern submodel approach most accurately predicted hospitalization (AUC $= 0.73$) in sample, but did not cross-validate as well (AUC $= 0.63$). All summary measures performed similarly. However, when including the ALI components separately, tailoring models to subsets of patients with the same missing data pattern performed best. Next steps include EHR implementation to enable prediction and support clinician decision-making at scale.

2026-06-08T19:17:46Z 13 pages, 5 figures, 2 tables, R code and simulated dataset available on GitHub Grayson E. Weavil Joseph Rigdon Sarah C. Lotspeich http://arxiv.org/abs/2606.09638v1 Data-driven discovery of governing differential equations across physical systems 2026-06-08T15:35:06Z

Differential equations play a critical role in scientific discovery because they provide a mathematical framework to describe the behaviour of physical phenomena. As a promising alternative to traditional first principles, data-driven differential equation discovery has attracted increasing attention for its ability to infer governing laws directly from experimental or simulated data, especially when the underlying physics is unclear. However, the field has expanded rapidly along diverse methodological directions, particularly with the emergence of AI-based approaches, and still lacks a clear organizing perspective. In this Review, we propose a problem-oriented perspective on data-driven differential equation discovery. We first introduce a two-dimensional phase diagram of equation discoverability, where discovery problems are organized according to structural complexity and coefficient complexity. This phase diagram shows how the field has moved from the discovery of sparse equations with simple coefficients toward more complex governing laws with richer structures and more flexible parameterizations. It also clarifies why different methodological families succeed or fail in different problem settings. We then present the representation-evaluation-optimization (REO) framework as a fundamental abstraction of the discovery process. By identifying the core problems of equation discovery that persist across algorithmic variations, REO shifts the discussion from individual algorithms to the fundamental principles that determine discoverability. We connect these perspectives to applications across physics and adjacent sciences, and argue that the next challenge is not merely recovering equations, but using them to revise existing theories, distil mechanisms and form new scientific concepts.

2026-06-08T15:35:06Z Siyu Lou Hao Xu Wenguan Wang Lu Lu Hao Sun Yang Liu Linfeng Zhang Dongxiao Zhang Yuntian Chen http://arxiv.org/abs/2504.01148v5 Methodological insights in Bayesian Age-Period-Cohort analysis: an application to the case of Puerto Rico's fertility decline 2026-06-08T14:04:12Z

Age-Period-Cohort (APC) models are of special importance in Demography and Epidemiology for analyzing panel data according to three different factors: biological (age), technological (period) and cultural (cohort). The main goal of APC modeling is to separate the explanation of both period and cohort effects to the phenomenon. The objective of this paper is to develop a Bayesian Age-Period-Cohort framework that can model a wide range of demographic and epidemiological phenomena and improve upon existing statistical methodologies. The APC framework consists of addressing three main challenges: (1) the identification problem of all APC models, usually managed by imposing constraints on effect groups, (2) considering expert knowledge in the model definition, and (3) efficient solution of computational issues. By allowing full parameter uncertainty, use of robust priors, and an efficient computational implementation, a Bayesian methodology manages these concerns. Bayesian models also produce results that allow intuitive implementation and support theoretical knowledge. Our original methodology consists of the use of (i) a Scaled Beta2 prior distribution for the scale parameters, (ii) imposing different period and cohort constraints and comparing them,(iii) user-friendly implementation that can be easily adapted to the event, and (iv) various model comparison criteria that leads to reasonable interpretation of APC effects. We examine the dramatic collapse of fertility in Puerto Rico, an application that is difficult to model due to the accelerated changes and has interesting demographic implications that challenge the predominance of period effects in lowest-low fertility countries, emphasizing the cohort (cultural) momentum. The scope of the methodology introduced here is wide, including applications to obesity or smoking studies, for example.

2025-04-01T19:30:01Z Jomarie Jiménez González Angélica M. Rosario Santos Luis R. Pericchi Guerra Hernando Mattei http://arxiv.org/abs/2507.18683v2 Bayesian Deep Gaussian Processes for Correlated Functional Data: A Case Study in Cosmological Matter Power Spectra 2026-06-08T13:08:43Z

Understanding the structure of our universe and the distribution of matter is an area of active research. As cosmological surveys grow in complexity, the development of emulators to efficiently and effectively predict matter power spectra is essential. We are particularly motivated by the Mira-Titan Universe simulation suite that, for a specified cosmological parameterization (termed a "cosmology"), provides multiple response curves of various fidelities, including correlated functional realizations. Our objective is two-fold. First, we estimate the underlying matter power spectra, with appropriate uncertainty quantification (UQ), from all of the provided curves. To this end, we propose a novel Bayesian deep Gaussian process (DGP) hierarchical model which synthesizes all the simulation information to estimate the underlying matter power spectra while providing effective UQ. Our model extends previous work on Bayesian DGPs from scalar responses to correlated functional outputs. Second, we leverage our predicted power spectra from various cosmologies in order to accurately predict the entire matter power spectra for an unobserved cosmology. For this task, we use basis function representations of the functional spectra to train a separate Gaussian process emulator. Our method performs well in synthetic exercises and against the benchmark cosmological emulator (Cosmic Emu).

2025-07-24T16:48:38Z 22 pages, 14 figures. Revised and accepted version for publication in Data Science in Science Stephen A. Walsh Annie S. Booth David Higdon Jared Clark Kelly R. Moran Katrin Heitmann http://arxiv.org/abs/2602.05483v2 Toward Operationalizing Rasmussen: Drift Observability on the Simplex for Evolving Systems 2026-06-08T12:39:21Z

Software operations increasingly rely on SLOs, traces, deployment specifications, and change events, yet dashboards and thresholding practices often expose share-like operational signals as separate scalar panels or baseline distances. This can create false alarms under benign redistribution and miss movement toward policy boundaries. Rasmussen's dynamic safety model motivates drift under competing pressures, but operationalizing it for software is difficult because relevant state variables (remaining margin, engineering effort, and risk/impact) are often compositional and their parts evolve. We formulate an automated, artifact-derived drift-monitor design that maps changing software artifacts into a stable compositional monitoring state: it extracts a current part inventory and policy constraints, maps telemetry to a positive composition, stabilizes splits, merges, and renames through lineage-aware canonical groups, and analyzes boundary-directed drift in log-ratio coordinates. The proposed monitor would report drift direction, step-to-boundary, balance-level attribution, and model-health indicators under architectural churn. We specify the approach, identify its zero/noise/lineage assumptions, and report a reproducible synthetic sanity check of boundary-aware drift and controlled part churn.

2026-02-05T09:41:49Z Anatoly A. Krasnovsky http://arxiv.org/abs/2504.05912v3 Financial resilience of agricultural and food production companies in Spain: A compositional cluster analysis of the impact of the Ukraine-Russia war (2021-2023) 2026-06-08T11:48:47Z

This study analyses the financial resilience of agricultural and food production companies in Spain amid the Ukraine-Russia war using cluster analysis based on financial ratios. This research utilizes centred log-ratios to transform financial ratios for compositional data analysis. The dataset comprises financial information from 1197 firms in Spain's agricultural and food sectors over the period 2021-2023. The analysis reveals distinct clusters of firms with varying financial performance, characterized by metrics of solvency and profitability. The results highlight an increase in resilient firms by 2023, underscoring sectoral adaptation to the conflict's economic challenges. These findings together provide insights for stakeholders and policymakers to improve sectorial stability and strategic planning.

2025-04-08T11:08:51Z European Accounting and Management Review, 11, 1 (2025), 55-80 Mike Hernandez-Romero Germà Coenders 10.26595/eamr.v11i1.03 http://arxiv.org/abs/2603.24215v3 Adapting Altman's bankruptcy prediction model to the compositional data methodology 2026-06-08T11:43:46Z

Using standard financial ratios as variables in statistical analyses has been related to several serious problems, such as extreme outliers, asymmetry, non-normality, and non-linearity. The compositional-data methodology has been successfully applied to solve these problems and has always yielded substantially different results when compared to standard financial ratios. An under-researched area is the use of financial log-ratios computed with the compositional-data methodology to predict bankruptcy or the related terms of business default, insolvency or failure. Another under-researched area is the use of machine learning methods in combination with compositional log-ratios. The present article adapts the classical Altman bankruptcy prediction model and some of its extensions to the compositional methodology with pairwise log-ratios and three common statistical and machine learning tools: logistic regression models, k-nearest neighbours, and random forests, and compares the results with standard financial ratios. Data from the sector in the Spanish economy with the largest number of bankrupt firms according to the first two digits of the NACE code (46XX "wholesale trade, except of motor vehicles and motorcycles") were obtained from the Iberian Balance sheet Analysis System. The sample size (31,131 firms, of which 97 were bankrupt) was divided into a training and a validation dataset. The training dataset was downsampled to one healthy firm to each bankrupt firm. No outliers were removed. Focusing on predictive performance, the results show that compositional methods are better than standard ratios in terms of sensitivity (recall), with mixed results regarding specificity, compositional random forests and compositional logistic regression behaving the best.

2026-03-25T11:44:20Z 22 pages, 2 figures Fatemeh Keivani Universitat de Girona Germà Coenders Universitat de Girona Geòrgia Escaramís Universitat de Girona CEEISCAT. Department of Health. Government of Catalonia http://arxiv.org/abs/2606.09313v1 Machine-Learning Emulation of Satellite Greenhouse Gas Retrievals: Stability over Time 2026-06-08T10:19:11Z

Retrieval algorithms are used to estimate atmospheric concentrations of greenhouse gases (GHGs), such as carbon dioxide (CO2) and methane (CH4), by solving inverse problems from high-spectral-resolution satellite radiance measurements. However, these algorithms are computationally expensive, which makes real-time estimation at scale difficult. Machine-learning models have therefore been proposed as fast emulators of retrieval algorithms. Most existing studies, however, evaluate them only on test data from the same period as the training data. We study the stability over time of such emulators using data from the Greenhouse Gases Observing SATellite (GOSAT). We show that prediction accuracy generally deteriorates when the test period moves away from the training period. We also show that including time as an input feature substantially improves XCH4 prediction for Lasso and neural-network models. Among the methods considered, a simple Lasso model performs as well as or better than more complex methods such as neural networks, and yields more stable predictions over time. We further validate the results using the Total Carbon Column Observing Network (TCCON), a ground-based observation network. On the TCCON-matched dataset, the time-augmented Lasso achieves errors against TCCON that are comparable to the disagreement between GOSAT and TCCON for both XCO2 and XCH4.

2026-06-08T10:19:11Z 48 pages, 9 figures, 15 tables Nugzar Gognadze Motonobu Kanagawa Yu Someya Hisashi Yashiro http://arxiv.org/abs/2606.09307v1 Robust high-dimensional Bayesian regression with non-Gaussian errors under global--local shrinkage priors 2026-06-08T10:15:23Z

Multivariate regression with many correlated responses and predictors commonly violates Gaussian error assumptions due to heavy tails, outliers, and asymmetry. Gaussian procedures then lose efficiency in coefficient estimation and produce biased estimates of conditional dependence graphs. We develop a robust Bayesian framework using a scale-location mixture error distribution and horseshoe+ global-local priors on both the regression coefficients and off-diagonals of the error precision matrix, coupling sparsity in the regression map with sparsity in the residual dependence structure. Theoretical contributions include joint posterior contraction, selection consistency for both supports, a Kullback-Leibler risk bound showing the dominance of horseshoe+ over horseshoe, and bounded sensitivity, ensuring that a single large outlier has vanishing influence under t errors. Simulations across four error regimes, contamination, and varying dimensions show that our estimator matches Gaussian procedures under normality and dominates them under heavy tails and skewness. Applications to FRED-MD macroeconomic data and S&P 500 daily returns recover interpretable sparse coefficient maps and residual dependence graphs while automatically down-weighting crisis-period observations.

2026-06-08T10:15:23Z 21 pages, 9 figures, 6 tables Mohammad Arashi http://arxiv.org/abs/2606.09283v1 Towards personalised intervention: A causal-dynamical framework to determine psychological treatment trajectories 2026-06-08T09:51:31Z

For approximately half of the individuals receiving mental health care, the results are suboptimal, even when treatments align with evidence-based guidelines. These limited effects may partly stem from how clinical decisions on treatment focus are made in mental health care. Typically, treatment strategy is guided by the diagnostic classification combined with the individualized case conceptualization. While standard, this approach may fall short for several reasons such as biases on the part of both the patient and therapist, and treatment guidelines being based on average effects that may not (exactly) suit the individual patient. To address these challenges, we propose a novel framework that reduces biases in clinical decision-making and makes it genuinely possible to tailor treatment focus to the individual patient. This framework involves (a) constructing causal graphs and estimating causal effects from intensively collected, longitudinal patient data, (b) simulating new time series based upon the causal relationships, and (c) using these simulations to identify the most effective treatment focus for the individual patient. By simulating and comparing different intervention strategies and examining both the estimated individual's responsiveness and its long-term effectiveness, this approach may generate useful insights to guide treatment focus and strategy, which can lead to a significant improvement of treatment outcomes in mental health care.

2026-06-08T09:51:31Z Lourens Waldorp Titus Mürtz Anita Jansen Jonas Haslbeck http://arxiv.org/abs/2410.23786v3 Conformal inference for cell type annotation with graph-structured constraints 2026-06-08T08:19:12Z

Conformal prediction is a framework for constructing prediction sets for machine learning models, relying solely on the exchangeability of training and test data and without requiring to specify a parametric distribution. Despite its wide applicability and popularity, its application in single-cell transcriptomics remains underexplored. This paper addresses this gap by developing an approach that leverages the rich information about cell-type relations, encoded in the graph structure of cell ontologies, to enhance the interpretability of reference-based cell-type annotation. Leveraging conformal risk control, we develop a novel conformal algorithm for graph-structured predictions and we demonstrate how incorporating graph constraints can improve the interpretation of cell-type predictions. This approach aims to generate more coherent conformal sets that align with the inherent relationships among classes, facilitating clearer and more intuitive interpretations of model predictions. Additionally, we provide a technique to address non-exchangeability, particularly when the cell-type distribution changes between training and test datasets. We implemented our method in the open-source R package scConform, available at https://bioconductor.posit.co/packages/release/bioc/html/scConform.html.

2024-10-31T10:00:40Z Daniela Corbetta Livio Finos Ludwig Geistlinger Davide Risso http://arxiv.org/abs/2512.10250v2 Time-Averaged Drift Approximations are Inconsistent for Inference in Drift Diffusion Models 2026-06-08T03:22:53Z

Drift diffusion models (DDMs) have found widespread use in computational neuroscience, cognitive science, mathematical psychology as well as other fields. They model evidence accumulation in simple decision tasks as a stochastic process drifting towards decision barriers. In models where the drift is both time-varying within a trial and variable across trials, the high computational cost for accurate likelihood evaluation has often led to the use of a computationally convenient surrogate for parameter inference, the time-averaged drift approximation (TADA). In each trial, TADA assumes that the time-varying drift rate can be replaced by its temporal average throughout the trial. This approach enables fast parameter inference using analytical likelihood formulas for DDMs with constant drift. In this work, we show that such an estimator is inconsistent: it does not converge to the true drift, posing a risk of biasing scientific conclusions when parameter estimates are obtained by TADA and similar approximations. We provide an elementary proof of this inconsistency in what is perhaps the simplest possible setting: a Brownian motion with piecewise constant drift hitting a one-sided upper boundary. Furthermore, numerical examples based on an attentional DDM (aDDM) show that using TADA leads to systematic misestimation of attentional effects in decision making and can lead to false conclusions in scientific hypothesis testing.

2025-12-11T03:18:55Z 37 pages. Includes updates for the first revision Sicheng Liu Alexander Fengler Michael J. Frank Matthew T. Harrison http://arxiv.org/abs/2606.08934v1 Backward Coherence and Hidden-State Stability in Recurrent Neural Networks: A Quasi-Reverse-Martingale Theory 2026-06-08T02:20:29Z

Recurrent neural networks maintain a hidden state $h_t$, but its probabilistic meaning is often unclear. We study hidden-state stability through \emph{backward coherence}: the extent to which $h_t$ can be reconstructed from $h_{t+1}$ by a learned backward projector $g_φ$. Under contraction and summable backward drift, the hidden-state sequence forms a quasi-reverse-martingale. This yields almost-sure convergence, rates under mixing, an interpretable limiting representation, finite pathwise stopping times, and a theoretical framework for time-uniform confidence sequences. Simulations support the theory. Backward-coherence regularisation reduces the empirical quasi-martingale total $\hat Q$ by $43$--$58%$, reaches stability $28$--$44%$ earlier than an unregularised RNN, and gives tracking-error recovery consistent with geometric bounds. Additional tests confirm echo-state forgetting rates bounded by $ρ$ and verify the increment-sum tube $R_t$ with $100%$ simultaneous coverage, although $R_t$ is conservative; in practice, the defect-tail proxy $\hat Q_t$ is the more useful monitor. The backward-coherence loss is also equivalent to minimising a Kullback--Leibler divergence in a Gaussian backward model, linking the method to variational inference. Extensions cover $φ$-mixing inputs, change-point tracking, and finite-sample concentration. Three real-data studies further validate the approach. On PhysioNet 2012 ICU data, the Reverse Martingale RNN (RMRNN) matches RNN mortality-prediction AUC while reaching stable representations 13 hours earlier. On FRED-MD, it reduces one-month-ahead forecast error by about fourfold under concept drift. On UCI Human Activity Recognition, it maintains lower post-transition tracking error with geometric decay. The guarantees apply under the stated assumptions; universality is not claimed.

2026-06-08T02:20:29Z Yuan-chin Ivan Chang