https://arxiv.org/api/VlrZyx+DZcur2ImiVn7UC92H/Zg2026-06-21T19:45:16Z1325818015http://arxiv.org/abs/2605.22755v1Assessing global drivers of forest transpiration using clustered machine learning models2026-05-21T17:22:38ZUnderstanding the environmental drivers of forest transpiration is critical for improving global predictions of water availability and ecosystem health. Due to many competing controls on plant water stress and ecosystem transpiration, however, these drivers may vary widely across tree species which have adapted hydraulically to local climate conditions. Here, clustered machine learning models were used to analyze global drivers of forest transpiration rates using the SAPFLUXNET database. Sap flux data from a total of ninety-five sites spanning seven biomes were grouped using two clustering strategies: by biome and by plant functional type. Two supervised machine learning algorithms, a random forest algorithm and a neural network algorithm, were used to predict rates of sap flux for each cluster. The performance and feature importance in each model were analyzed and compared to evaluate the environmental variables that control each cluster's performance. By defining site clusters, these models are able to predict transpiration and its environmental drivers across a wide variety of geographical sites and tree species. Unlike models trained on the entire dataset, high-performing clustered models achieved R$^2$ values to measurement data in the range of 0.74 to 0.90, with the highest performance being achieved in mid-sized clusters of up to thirty-six sites. There was high variance in feature importance between clusters, indicating that key predictors of transpiration varied strongly across both plant functional type and biome. Overall, water-limited climates tended to be more controlled by soil moisture, whereas climates with high mean annual temperature tended to be more controlled by solar radiation and less dependent on air temperature. These findings provide insights into how forest transpiration responds to environmental factors across a wide range of climate types and tree species.2026-05-21T17:22:38ZMorgan ThornwellDavid YangCheng-Wei HuangPeyman AbbaszadehSamantha Hartzellhttp://arxiv.org/abs/2604.24365v2Persistent and anti-persistent stride-to-stride fluctuations: an ARFIMA decomposition consistent with closed-loop sensorimotor control2026-05-21T13:34:12ZStride-to-stride fluctuations in human walking carry a fractal correlation structure that reverses sign under external cueing: self-paced gait is persistent, whereas metronomic or visually cued gait is anti-persistent. Three decades of detrended fluctuation analysis (DFA) have established this reversal as a scaling-exponent shift, but DFA cannot distinguish genuine long-memory dynamics from short-memory autoregressive moving-average (ARMA) processes that produce the same apparent exponent. We fit the full eight-model ARFIMA(1,d,1) family to stride interval and stride speed series from three datasets (N = 70 subjects) spanning overground walking, fixed-speed treadmill walking, metronomic and visual cueing, and graded positional constraint. Model evidence is aggregated through BIC-based Schwarz weights, and the fractional differencing parameter d together with the autoregressive and moving-average coefficients phi and theta are estimated by Bayesian model averaging. Three findings emerge. (i) Long-memory specifications decisively outweigh ARMA alternatives under both persistent and anti-persistent conditions, establishing cued gait anti-persistence as a genuine fractional phenomenon. (ii) DFA alpha overestimates d + 0.5 by 0.25 to 0.34 units, a discrepancy jointly attributable to short-memory components that DFA conflates with long-memory persistence and to a finite-sample negative bias inherent to exact ML-ARFIMA estimation. (iii) The estimated (d, phi, theta) parameters are consistent with a corrective sensorimotor model in which a fractal intrinsic generator, a reactive feedback correction, and a motor-delay component together shape stride-to-stride fluctuations. Whether a single mechanistic model can account quantitatively for the observed parameter ranges across rhythmic, spatial, and unconstrained conditions is a question that the present analysis motivates but cannot alone resolve.2026-04-27T12:00:29ZMain article: pp. 1-42 (5 figures, 3 tables). Supplementary Materials appended: S1 - Effect of series length on ARFIMA and DFA outcomes (Hausdorff Tier 3), pp. 43-47; S2 - Morris elementary-effects screening of the ARFIMA/DFA pipeline, pp. 48-59. Reproduction archive: doi:10.5281/zenodo.19676064Philippe Terrierhttp://arxiv.org/abs/2605.22075v1Can Breath Biomarkers Causally Influence Blood Glucose? Investigating VOC-Mediated Modulation in Diabetes2026-05-21T07:13:28ZDiabetes is a global health burden, and early detection is critical for timely intervention. This study explores a non-invasive, data-driven framework to identify individuals at risk of diabetes using Volatile Organic Compounds (VOCs) and lifestyle variables. We use causal inference techniques to estimate the impact of VOCs such as acetone, isopropanol, isoprene, and ethanol on blood glucose levels. Additionally, we designed a classifier to distinguish diabetics from non-diabetics using non-invasive markers. We created a risk-based ranking system for individuals in the "gray zone," and identified natural clusters in the population using Gaussian Mixture Model. Our results suggest that specific VOCs exhibit a strong causal influence on glucose levels and that machine learning models can reliably classify and stratify individuals at high risk. This integrated causal-explainable analysis can support the development of tool for non-invasive early screening of diabetes.2026-05-21T07:13:28ZProceedings of the IJCAI workshop on Advanced Neural Systems for Next-Generation Biomedical Intelligence, 2025Varsha SharmaPrasanta K. GuhaAvik Ghosehttp://arxiv.org/abs/2601.14577v2FBApro: A fast, simple linear transformation for diverse metabolic modeling tasks2026-05-21T05:42:29ZConstraint-based metabolic modeling is the predominant framework for simulating cellular metabolism. The central assumption of these models is that metabolism operates at a steady state, meaning that the production and consumption rates of each metabolite are balanced. This assumption imposes linear constraints on the fluxes of biochemical reactions. Flux Balance Analysis (FBA), a fundamental method in the field, is formulated as an optimization problem maximizing a cellular objective (e.g., growth) over the resulting linear subspace of steady state fluxes. Many other methods in the field are expressed either as a modification to FBA, or use FBA as a black box within an algorithm. Here, we propose a general alternative to optimization called FBApro. For any given vector of reference fluxes, FBApro finds the closest flux vector within the steady-state subspace, and accounts for both partially given reference fluxes and exact constraints on reactions. While FBApro is the solution to a quadratic program, we show that it can be implemented as a single linear operation using orthogonal projections to corresponding affine spaces and sets of linear equations. The overall approach is computationally efficient, does not require a cellular objective, and is easy to implement. We formally derive the closed-form expressions for FBApro and simpler variants, and validate it on both synthetic and real cancer cell line data.2026-01-21T01:25:21Z23 pages, 6 figuresAriel BrunerMona Singhhttp://arxiv.org/abs/2605.22009v1SDFStent: Real-time interactive virtual stenting via SDF deformation fields2026-05-21T05:12:07ZStenting is among the most common transcatheter interventions for congenital heart disease (CHD). Patient-specific computational fluid dynamics (CFD) simulations can predict hemodynamic outcomes of intervention scenarios but require post-operative vascular geometries that reflect stent-induced shape changes, which existing tools either model inadequately or require extensive time or manual effort to generate. We present SDFStent, a signed distance function (SDF) based mesh deformation method for virtual stenting that operates in real time, maintains mesh integrity, and preserves junction geometry. The stent is modeled as a pipe surface composed of piecewise-capsule SDFs joined by a smooth-minimum operator. Mesh vertices near the expanding SDF surface are displaced along the SDF gradient with a compactly supported fall-off function and an alpha blending mask. SDFStent was benchmarked against three existing approaches and validated on three tetralogy of Fallot (ToF) patients and three coarctation of the aorta (CoA) patients using rigid-wall steady-state CFD simulations against clinical catheterization measurements. Against a prescribed diameter of 6.0 mm, the method produced a mean stented diameter of 5.92 $\pm$ 0.08 mm in 1.5 s, over 100$\times$ faster than the best stenting-specific comparator. All output meshes were watertight and self-intersection-free. CFD-simulated post-operative pressure drops agreed with clinical measurements within 4 mmHg (mean error 2 mmHg). SDFStent produces simulation-ready post-stent models that match prescribed stent dimensions at interactive speeds, from pre-operative anatomy and catheterization data alone. The implementation is open-source and available in 3D Slicer. Its scriptable architecture enables automated generation of large synthetic cohorts for data-driven surrogate modeling.2026-05-21T05:12:07Z39 pages, 12 figures, 4 tables. Under review at Computer Methods and Programs in BiomedicineBohan J. LiNicholas C. DornAndras LassoMatthew A. JolleyJeffrey A. FeinsteinDoug L. JamesAlison L. Marsdenhttp://arxiv.org/abs/2509.06503v3An AI system to help scientists write expert-level empirical software2026-05-21T01:47:23ZThe cycle of scientific discovery is frequently bottlenecked by the slow, manual creation of software to support computational experiments\cite{hannay2009how}. To address this, we present Empirical Research Assistance (ERA), an AI system that creates expert-level scientific software whose goal is to maximize a quality metric. The system uses a Large Language Model (LLM) and Tree Search (TS)\cite{silver2016mastering} to systematically improve the quality metric and intelligently navigate the large space of possible solutions. ERA achieves expert-level results when it explores and integrates complex research ideas from external sources. The effectiveness of tree search is demonstrated across a diverse range of tasks. In bioinformatics, ERA discovered 40 novel methods for single-cell data analysis that outperformed the top human-developed methods on a public leaderboard. In epidemiology, ERA generated 14 models that outperformed the CDC ensemble and all other individual models for forecasting COVID-19 hospitalizations. ERA also produced expert-level software for geospatial analysis, neural activity prediction in zebrafish, and numerical solution of integrals, and a novel rule-based construction for time series forecasting. By devising and implementing novel solutions to diverse tasks, ERA represents a significant step towards accelerating scientific progress.2025-09-08T10:08:36Z78 pages, 31 figures, 22 tablesEser AygünJulieAnastasiya BelyaevaJulieGheorghe ComaniciJulieMarc CoramJulieHao CuiJulieJake GarrisonJulieRenee Johnston Anton KastJulieCory Y. McLeanJuliePeter NorgaardJulieZahra ShamsiJulieDavid SmallingJulieJames ThompsonJulieSubhashini VenugopalanJulieBrian P. WilliamsJulieChujun HeJulieSarah MartinsonJulieMartyna PlomeckaJulieLai WeiJulieYuchen ZhouJulieQian-Ze ZhuJulieMatthew AbrahamJulieErica BrandJulieAnna BulanovaJulieJeffrey A. CardilleJulieChris CoJulieScott EllsworthJulieGrace JosephJulieMalcolm KaneJulieRyan KruegerJulieJohan KartiwaJulieDan LieblingJulieJan-Matthis LueckmannJuliePaul RaccugliaJulie XuefeiJulie WangKatherine ChouJames ManyikaYossi MatiasJohn C. PlattLizzie DorfmanShibl MouradMichael P. Brennerhttp://arxiv.org/abs/2605.21859v1PhylaFlow: Hybrid Flow Matching in Billera-Holmes-Vogtmann Tree Space for Phylogenetic Inference2026-05-21T01:13:46ZPhylogenetic trees are hybrid objects: branch lengths vary continuously, while topologies change discretely through edge contractions and expansions. Billera-Holmes-Vogtmann (BHV) tree space provides a canonical geometry for this structure, representing each resolved topology as a Euclidean orthant and topological changes as motion across shared lower-dimensional boundaries. We introduce PhylaFlow, a hybrid flow-matching model that learns posterior-basin transport in BHV tree space. PhylaFlow is trained on BHV geodesic paths from random starting trees to short-run posterior samples, coupling continuous branch-length motion within orthants with learned boundary events and discrete topology transitions. We evaluate the learned geometry operationally: if the flow reaches posterior-relevant regions, finite-budget Bayesian refinement initialized from, or guided by, its terminal trees should recover posterior-supported topologies more efficiently. Across DS1-DS8 phylogenetic posterior benchmarks, PhylaFlow substantially reduces initial Tree-KL relative to classical initializers. After finite-budget MrBayes refinement, direct PhylaFlow improves early and intermediate topology-recovery trajectories on most datasets, while split-guided PhylaFlow-MCMC obtains the strongest hard-case results. The best PhylaFlow variant outperforms short-warmup on seven of eight datasets and PhyloGFN on five of eight under the same refinement budget. In a joint sequence-conditioned experiment, sequence embeddings steer posterior split recovery, although exact posterior topology recovery remains preliminary. These results show that hybrid flow matching can learn actionable transport in BHV tree space and provide a geometry-aware proposal mechanism for Bayesian phylogenetic inference.2026-05-21T01:13:46Z9 pages, 3 figuresYasha EktefaieLeo CuiShrey JainMarinka ZitnikPardis Sabetihttp://arxiv.org/abs/2405.17032v4Exact phylodynamic likelihood via structured Markov genealogy processes2026-05-20T21:34:15ZWe show that each member of a broad class of Markovian population models induces a unique stochastic process on the space of genealogies. We construct this genealogy process and derive exact expressions for the likelihood of an observed genealogy in terms of a filter equation, the structure of which is completely determined by the population model. We show that existing phylodynamic methods based on the coalescent and linear birth-death processes are special cases. We derive some properties of filter equations and describe a class of algorithms that can be used to numerically solve them. Importantly, because these algorithms rely only on simulation of the population model, they retain the plug-and-play property upon which simulation-based inference depends. Our results open the door to statistically efficient likelihood-based phylodynamic inference for a much wider class of models than is currently possible.2024-05-27T10:39:18ZAaron A. KingQianying LinEdward L. Ionideshttp://arxiv.org/abs/2605.16108v2Estimating Association Between Paired Outcomes in Clustered Data with Informative Subgroup Size2026-05-20T20:43:50ZInformative cluster size (ICS) and informative subgroup size (ISS) can distort marginal association estimates when the number of observed units, or their distribution across outcome-defined categories, is related to the outcomes under study. This issue is especially relevant for paired outcomes, where the observed association can depend on cluster size, paired-category composition, and the process by which units become available for analysis. We propose three weighted estimating approaches for marginal association between paired outcomes in clustered data. The weights are derived from within-cluster resampling arguments and extend inverse cluster-size and subgroup-size weighting to paired outcome categories. We also modify an existing ISS testing procedure by utilizing Stouffer's method to reduce computational burden. To evaluate the methods, we develop a simulator for clustered paired outcomes that separates unit-level association, latent cluster-level association, and outcome-dependent retention. Simulations show that pair-based weighting can reduce bias when association arises through unit-level dependence and subgroup composition is informative, but can attenuate association carried by latent cluster-level structure. Typical inverse-cluster weighting remains more stable when the association is primarily cluster-level. Application to NHANES oral-health data shows small positive periodontal and caries associations overall, with filled-surface outcomes showing stronger ISS evidence and greater sensitivity to pair-based weighting than decayed-surface outcomes. These results indicate that marginal association under ICS and ISS should be interpreted in relation to the source of association, observed-unit structure, and assumptions used to choose the weighting scheme.2026-05-15T15:56:03ZOwen VisserSomnath Dattahttp://arxiv.org/abs/2603.21743v4CellFluxRL: Biologically-Constrained Virtual Cell Modeling via Reinforcement Learning2026-05-20T19:52:16ZBuilding virtual cells with generative models to simulate cellular behavior in silico is emerging as a promising paradigm for accelerating drug discovery. However, prior image-based generative approaches can produce implausible cell images that violate basic physical and biological constraints. To address this, we propose to post-train virtual cell models with reinforcement learning (RL), leveraging biologically meaningful evaluators as reward functions. We design seven rewards spanning three categories-biological function, structural validity, and morphological correctness-and optimize the state-of-the-art CellFlux model to yield CellFluxRL. CellFluxRL consistently improves over CellFlux across all rewards, with further performance boosts from test-time scaling. Overall, our results present a virtual cell modeling framework that enforces physically-based constraints through RL, advancing beyond "visually realistic" generations towards "biologically meaningful" ones.2026-03-23T09:33:18ZDongxia WuShiye SuYuhui ZhangElaine SuiEmma LundbergEmily B. FoxSerena Yeung-Levyhttp://arxiv.org/abs/2605.21454v1ProtoPathway: Biologically Structured Prototype-Pathway Fusion for Multimodal Cancer Survival Prediction2026-05-20T17:43:43ZWe introduce ProtoPathway, an interpretable-by-design multimodal framework for cancer survival prediction that unifies whole slide imaging and transcriptomics through encoders producing biologically grounded representations on both sides of the fusion. On the histopathology side, $K$ learnable morphological prototypes, trained end-to-end with the survival objective, serve as the slide representation itself: patches flow into prototype tokens via soft assignment, compressing variable-length patch sets into fixed task-adaptive tokens. On the genomic side, a bipartite graph neural network encodes gene expression within the Reactome pathway hierarchy, producing pathway embeddings that reflect both constituent genes and their broader biological context through bidirectional message passing over a shared gene--pathway graph. Cross-modal attention then operates over a compact prototype $\times$ pathway matrix in which prototypes query pathways, modeling the biological direction in which molecular programs give rise to tissue morphology. Because both axes carry stable task-learned identity, the attention matrix is itself an interpretability output, yielding native inference-time attribution across the full biological hierarchy, from genes through pathways and prototypes to spatial tissue maps. We evaluate on five TCGA cancer cohorts, demonstrating competitive or superior survival prediction with substantially improved biological interpretability and reduced computational cost, with interpretability claims validated through fold-stratified rank-based population-level analysis. Our source code, model weights, and Reactome pathways, together with a unified codebase reimplementing all multimodal survival baselines under identical preprocessing and evaluation, are available at: https://github.com/AmayaGS/ProtoPathway.2026-05-20T17:43:43ZCurrently under peer reviewAmaya Gallagher-SyedCostantino PitzalisMyles J. LewisMichael R. BarnesGregory Slabaughhttp://arxiv.org/abs/2602.04916v3AFD-INSTRUCTION: A Comprehensive Antibody Instruction Dataset with Functional Annotations for LLM-Based Understanding and Design2026-05-20T16:39:37ZLarge language models (LLMs) have significantly advanced protein representation learning. However, their capacity to interpret and design antibodies through natural language remains limited. To address this challenge, we present AFD-Instruction, the first large-scale instruction dataset with functional annotations tailored to antibodies. This dataset encompasses two key components: antibody understanding, which infers functional attributes directly from sequences, and antibody design, which enables de novo sequence generation under functional constraints. These components provide explicit sequence-function alignment and support antibody design guided by natural language instructions. Extensive instruction-tuning experiments on general-purpose LLMs demonstrate that AFD-Instruction consistently improves performance across diverse antibody-related tasks. By linking antibody sequences with textual descriptions of function, AFD-Instruction establishes a new foundation for advancing antibody modeling and accelerating therapeutic discovery.2026-02-04T05:09:39ZLing LuoWenbin JiangHongyuan ChangXinkang WangXushi ZhangYueting XiongMengsha TongRongshan Yuhttp://arxiv.org/abs/2605.03690v2Graph Neural Network based Hierarchy-Aware Embeddings of Knowledge Graphs: Applications to Yeast Phenotype Prediction2026-05-20T14:35:34ZWe present a method for finding hierarchy-aware embeddings of knowledge graphs (KGs) using graph neural networks (GNNs) enriched with a semantic loss derived from underlying ontologies. This method yields embeddings that better reflect domain knowledge. To demonstrate their utility, we predict and interpret the effects of gene deletions in the yeast Saccharomyces cerevisiae and learn box embeddings for KGs in the absence of a prediction task. We further show how box embeddings can serve as the basis for evaluating KG revisions.
Our yeast KG is constructed from community databases and ontology terms. Low-dimensional box embeddings combined with GNNs are used to predict cell growth for double gene knockouts. Over 10-fold cross validation, these predictions have a mean $R^2$~score~of~0.360, significantly higher than baseline comparisons, demonstrating that high-level qualitative knowledge is informative about experimental outcomes. Incorporating semantic loss terms in the training of the models improves their predictive performance ($R^2$=0.377) by aligning embeddings with ontology structure. This shows that class hierarchies from ontologies can be exploited for quantitative prediction. We also test the trained models on triple gene knockouts, showing they generalise to data beyond those seen in training.
Additionally, by identifying co-occurring relations in the yeast KG important for the cell-growth predictions, we construct hypotheses about interacting traits in yeast. A biological experiment validates one such finding, revealing an association between inositol utilisation and osmotic stress resistance, highlighting the model's potential to guide biological discovery.2026-05-05T12:34:45ZFilip KronströmAlexander H. GowerDaniel BrunnsåkerIevgeniia A. TiukovaRoss D. Kinghttp://arxiv.org/abs/2508.10821v3SimAQ: Mitigating Experimental Artifacts in Soft X-Ray Tomography using Simulated Acquisitions2026-05-20T12:10:03ZSoft X-ray tomography provides detailed structural insight into whole cells but is hindered by experimental artifacts such as the missing wedge and by limited availability of annotated datasets. We present SimAQ, a simulation pipeline that generates realistic yeast phantoms and applies synthetic imaging artifacts to produce paired noisy volumes, sinograms, and reconstructions. We validate our approach by training a neural network primarily on synthetic data and demonstrate effective few-shot and zero-shot transfer learning on real X-ray tomograms. Our model delivers accurate segmentations, enabling quantitative analysis of noisy tomograms without relying on large labeled datasets.2025-08-14T16:47:10ZJacob EgebjergDaniel Wüstnerhttp://arxiv.org/abs/2605.20885v1Training distribution determines the ceiling of drug-blind cancer sensitivity prediction2026-05-20T08:24:56ZPrecision oncology requires predicting which drugs will suppress a specific tumor from its molecular profile, but drug-blind sensitivity prediction has plateaued despite increasingly complex drug representations. Here we show that this stagnation reflects a metric artifact rather than a representational bottleneck. The standard benchmark, global Pearson r, is dominated by between-drug potency differences that a trivial drug-mean predictor captures without any cell-specific learning. Per-drug Pearson r, which isolates within-drug cell ranking, reveals that no drug encoding improves over cell-only features across four independent datasets. A controlled experiment channeling mechanism-of-action identity as either a drug feature or a training-distribution constraint identifies the cause. Supplying MoA as a feature yields negligible benefit, whereas using it to stratify training raises per-drug r substantially for targeted kinase inhibitors, because pan-cancer co-training suppresses pathway-specific sensitivity signals. Mechanism-stratified training and response matching from pilot observations provide two deployable strategies that together recover the principal sources of predictive gain in drug-blind sensitivity prediction.2026-05-20T08:24:56ZTaekyung Heo