https://arxiv.org/api/k+Tu/MWrNqIoTen6YQYXO0BPzYg 2026-03-16T06:36:36Z 12843 0 15 http://arxiv.org/abs/2603.13051v1 Causal Cellular Context Transfer Learning (C3TL): An Efficient Architecture for Prediction of Unseen Perturbation Effects 2026-03-13T15:02:49Z Predicting the effects of chemical and genetic perturbations on quantitative cell states is a central challenge in computational biology, molecular medicine and drug discovery. Recent work has leveraged large-scale single-cell data and massive foundation models to address this task. However, such computational resources and extensive datasets are not always accessible in academic or clinical settings, hence limiting utility. Here we propose a lightweight framework for perturbation effect prediction that exploits the structured nature of biological interventions and specific inductive biases/invariances. Our approach leverages available information concerning perturbation effects to allow generalization to novel contexts and requires only widely-available bulk molecular data. Extensive testing, comparing predictions of context-specific perturbation effects against real, large-scale interventional experiments, demonstrates accurate prediction in new contexts. The proposed approach is competitive with SOTA foundation models but requires simpler data, much smaller model sizes and less time. Focusing on robust bulk signals and efficient architectures, we show that accurate prediction of perturbation effects is possible without proprietary hardware or very large models, hence opening up ways to leverage causal learning approaches in biomedicine generally. 2026-03-13T15:02:49Z 12 Pages, 3 figures, Keywords: perturbation prediction, context transfer, lightweight, machine learning Michael Scholkemper Sach Mukherjee http://arxiv.org/abs/2603.12694v1 RXNRECer Enables Fine-grained Enzymatic Function Annotation through Active Learning and Protein Language Models 2026-03-13T06:20:14Z A key challenge in enzyme annotation is identifying the biochemical reactions catalyzed by proteins. Most existing methods rely on Enzyme Commission (EC) numbers as intermediaries: they first predict an EC number and then retrieve the associated reactions. This indirect strategy introduces ambiguity due to the complex many-to-many mappings among proteins, EC numbers, and reactions, and is further complicated by frequent updates to EC numbers and inconsistencies across databases. To address these challenges, we present RXNRECer, a transformer-based ensemble framework that directly predicts enzyme-catalyzed reactions without relying on EC numbers. It integrates protein language modeling and active learning to capture both high-level sequence semantics and fine-grained transformation patterns. Evaluations on curated cross-validation and temporal test sets demonstrate consistent improvements over six EC-based baselines, with gains of 16.54% in F1 score and 15.43% in accuracy. Beyond accuracy gains, the framework offers clear advantages for downstream applications, including scalable proteome-wide reaction annotation, enhanced specificity in refining generic reaction schemas, systematic annotation of previously uncurated proteins, and reliable identification of enzyme promiscuity. By incorporating large language models, it also provides interpretable rationales for predictions. These capabilities make RXNRECer a robust and versatile solution for EC-free, fine-grained enzyme function prediction, with potential applications across multiple areas of enzyme research and industrial applications. 2026-03-13T06:20:14Z Zhenkun Shi Jun Zhu Dehang Wang BoYu Chen Qianqian Yuan Zhitao Mao Fan Wei Weining Wu Xiaoping Liao Hongwu Ma http://arxiv.org/abs/2401.02739v5 Denoising Diffusion Variational Inference: Diffusion Models as Expressive Variational Posteriors 2026-03-13T02:13:07Z We propose denoising diffusion variational inference (DDVI), a black-box variational inference algorithm for latent variable models which relies on diffusion models as flexible approximate posteriors. Specifically, our method introduces an expressive class of diffusion-based variational posteriors that perform iterative refinement in latent space; we train these posteriors with a novel regularized evidence lower bound (ELBO) on the marginal likelihood inspired by the wake-sleep algorithm. Our method is easy to implement (it fits a regularized extension of the ELBO), is compatible with black-box variational inference, and outperforms alternative classes of approximate posteriors based on normalizing flows or adversarial networks. We find that DDVI improves inference and learning in deep latent variable models across common benchmarks as well as on a motivating task in biology -- inferring latent ancestry from human genomes -- where it outperforms strong baselines on the Thousand Genomes dataset. 2024-01-05T10:27:44Z published at AAAI 2025; the first two authors contribute equally to this work; code available at https://github.com/topwasu/DDVI Wasu Top Piriyakulkij Yingheng Wang Volodymyr Kuleshov http://arxiv.org/abs/2507.18557v4 Deep Learning for Blood-Brain Barrier Permeability Prediction: From Discriminative Models to Mechanism-Aware Design 2026-03-12T19:41:48Z Predicting whether a molecule can cross the blood-brain barrier (BBB) is a key step in early-stage neuro-pharmaceutical design, directly influencing the efficiency and success rate of drug development. Traditional methods based on physicochemical properties are prone to systematic misjudgements due to their reliance on previous empirical evidence. Early machine learning (ML) models, although data-driven, often suffer from limited capacity, poor generalization, and insufficient interpretability. In recent years, more advanced models have become essential tools for predicting BBB permeability and guiding related drug design, owing to their ability to simulate molecular structures and capture complex biological mechanisms. This article systematically reviews the evolution of this field-from deep neural networks to graph-based structural modelling-highlighting the advantages of multi-task and multimodal learning strategies in identifying mechanism-related features. We further explore the emerging potential of generative models and causal inference methods for integrating permeability prediction with mechanism-aware drug design. Nowadays, ML-based BBB crossing prediction is in the critical transition from mere discriminative classification toward structure-function modelling from a mechanistic perspective. This paradigm shift provides a methodological progression and future roadmap for the integration of AI into neuropharmacological development. 2025-07-24T16:30:46Z Zihan Yang Yuchen Xiao http://arxiv.org/abs/2603.12351v1 Probabilistic Joint and Individual Variation Explained (ProJIVE) for Data Integration 2026-03-12T18:16:07Z Collecting multiple types of data on the same set of subjects is common in modern scientific applications including, genomics, metabolomics, and neuroimaging. Joint and Individual Variance Explained (JIVE) seeks a low-rank approximation of the joint variation between two or more sets of features captured on common subjects and isolates this variation from that unique to eachset of features. We develop an expectation-maximization (EM) algorithm to estimate a probabilistic model for the JIVE framework. The model extends probabilistic principal components analysis to multiple data sets. Our maximum likelihood approach simultaneously estimates joint and individual components, which can lead to greater accuracy compared to other methods. We apply ProJIVE to measures of brain morphometry and cognition in Alzheimer's disease. ProJIVE learns biologically meaningful courses of variation, and the joint morphometry and cognition subject scores are strongly related to more expensive existing biomarkers. Data used in preparation of this article were obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database. Code to reproduce the analysis is available on our GitHub page. 2026-03-12T18:16:07Z Journal of Computational and Graphical Statistics (2026) Raphiel J. Murden Ganzhong Tian Deqiang Qiu Benajmin B. Risk 10.1080/10618600.2026.2639081 http://arxiv.org/abs/2603.12349v1 Budget-Sensitive Discovery Scoring: A Formally Verified Framework for Evaluating AI-Guided Scientific Selection 2026-03-12T18:09:53Z Scientific discovery increasingly relies on AI systems to select candidates for expensive experimental validation, yet no principled, budget-aware evaluation framework exists for comparing selection strategies -- a gap intensified by large language models (LLMs), which generate plausible scientific proposals without reliable downstream evaluation. We introduce the Budget-Sensitive Discovery Score (BSDS), a formally verified metric -- 20 theorems machine-checked by the Lean 4 proof assistant -- that jointly penalizes false discoveries (lambda-weighted FDR) and excessive abstention (gamma-weighted coverage gap) at each budget level. Its budget-averaged form, the Discovery Quality Score (DQS), provides a single summary statistic that no proposer can inflate by performing well at a cherry-picked budget. As a case study, we apply BSDS/DQS to: do LLMs add marginal value to an existing ML pipeline for drug discovery candidate selection? We evaluate 39 proposers -- 11 mechanistic variants, 14 zero-shot LLM configurations, and 14 few-shot LLM configurations -- using SMILES representations on MoleculeNet HIV (41,127 compounds, 3.5% active, 1,000 bootstrap replicates) under both random and scaffold splits. Three findings emerge. First, the simple RF-based Greedy-ML proposer achieves the best DQS (-0.046), outperforming all MLP variants and LLM configurations. Second, no LLM surpasses the Greedy-ML baseline under zero-shot or few-shot evaluation on HIV or Tox21, establishing that LLMs provide no marginal value over an existing trained classifier. Third, the proposer hierarchy generalizes across five MoleculeNet benchmarks spanning 0.18%-46.2% prevalence, a non-drug AV safety domain, and a 9x7 grid of penalty parameters (tau >= 0.636, mean tau = 0.863). The framework applies to any setting where candidates are selected under budget constraints and asymmetric error costs. 2026-03-12T18:09:53Z Abhinaba Basu Pavan Chakraborty http://arxiv.org/abs/2603.12341v1 Self-Reported Side Effects of Semaglutide and Tirzepatide in Online Communities 2026-03-12T18:03:22Z Social media can reveal patient experiences with glucagon-like peptide-1 receptor agonists (GLP-1 RAs) that extend beyond clinical trial data. We analyzed 410,198 Reddit posts (May 2019-June 2025) mentioning semaglutide or tirzepatide. A total of 67,008 users self-reported using these medications, and 43.5% described at least one side effect. Gastrointestinal symptoms predominated, including nausea (36.9%), fatigue (16.7%), vomiting (16.3%), constipation (15.3%), and diarrhea (12.6%). Notably, reproductive symptoms (e.g., menstrual irregularities) and temperature-related complaints (e.g., chills, hot flashes) emerged as unrecognized potential effects. These findings highlight patient concerns not well captured in current labeling or trials. Large-scale social media analysis can complement traditional pharmacovigilance by detecting emerging safety signals and expanding understanding of the real-world safety profile of GLP-1 RAs. 2026-03-12T18:03:22Z Neil K. R. Sehgal Jena Shaw Tronieri Lyle Ungar Sharath Chandra Guntuku http://arxiv.org/abs/2603.12253v1 Binding Free Energies without Alchemy 2026-03-12T17:58:49Z Absolute Binding Free Energy (ABFE) methods are among the most accurate computational techniques for predicting protein-ligand binding affinities, but their utility is limited by the need for many simulations of alchemically modified intermediate states. We propose Direct Binding Free Energy (DBFE), an end-state ABFE method in implicit solvent that requires no alchemical intermediates. DBFE outperforms OBC2 double decoupling on a host-guest benchmark and performs comparably to OBC2 MM/GBSA on a protein-ligand benchmark. Since receptor and ligand simulations can be precomputed and amortized across compounds, DBFE requires only one complex simulation per ligand compared to the many lambda windows needed for double decoupling, making it a promising candidate for virtual screening workflows. We publicly release the code for this method at https://github.com/molecularmodelinglab/dbfe. 2026-03-12T17:58:49Z 14 pages, 4 figures Michael Brocidiacono Brandon Novy Rishabh Dey Konstantin I. Popov Alexander Tropsha http://arxiv.org/abs/2506.17373v5 A practical identifiability criterion leveraging weak-form parameter estimation 2026-03-12T17:28:30Z In this work, we define a practical identifiability criterion, (e, q)-identifiability, based on a parameter e, reflecting the noise in observed variables, and a parameter q, reflecting the mean-square error of the parameter estimator. This criterion is better able to encompass changes in the quality of the parameter estimate due to increased noise in the data (compared to existing criteria based solely on average relative errors). Furthermore, we leverage a weak-form equation error-based method of parameter estimation for systems with unobserved variables to assess practical identifiability far more quickly in comparison to output error-based parameter estimation. We do so by generating weak-form input-output equations using differential algebra techniques, as previously proposed by Boulier et al [1], and then applying Weak form Estimation of Nonlinear Dynamics (WENDy) to obtain parameter estimates. This method is computationally efficient and robust to noise, as demonstrated through two classical biological modelling examples. 2025-06-20T16:11:47Z Nora Heitzman-Breen Vanja Dukic David M. Bortz http://arxiv.org/abs/2603.12016v1 Nyxus: A Next Generation Image Feature Extraction Library for the Big Data and AI Era 2026-03-12T14:58:07Z Modern imaging instruments can produce terabytes to petabytes of data for a single experiment. The biggest barrier to processing big image datasets has been computational, where image analysis algorithms often lack the efficiency needed to process such large datasets or make tradeoffs in robustness and accuracy. Deep learning algorithms have vastly improved the accuracy of the first step in an analysis workflow (region segmentation), but the expansion of domain specific feature extraction libraries across scientific disciplines has made it difficult to compare the performance and accuracy of extracted features. To address these needs, we developed a novel feature extraction library called Nyxus. Nyxus is designed from the ground up for scalable out-of-core feature extraction for 2D and 3D image data and rigorously tested against established standards. The comprehensive feature set of Nyxus covers multiple biomedical domains including radiomics and cellular analysis, and is designed for computational scalability across CPUs and GPUs. Nyxus has been packaged to be accessible to users of various skill sets and needs: as a Python package for code developers, a command line tool, as a Napari plugin for low to no-code users or users that want to visualize results, and as an Open Container Initiative (OCI) compliant container that can be used in cloud or super-computing workflows aimed at processing large data sets. Further, Nyxus enables a new methodological approach to feature extraction allowing for programmatic tuning of many features sets for optimal computational efficiency or coverage for use in novel machine learning and deep learning applications. 2026-03-12T14:58:07Z 29 pages, 9 figures, 6 supplemental tables Nicholas Schaub Andriy Kharchenko Hamdah Abbasi Sameeul Samee Hythem Sidky Nathan Hotaling http://arxiv.org/abs/2603.12307v1 SHREC: A Spectral Embedding-Based Approach for Ab-Initio Reconstruction of Helical Molecules 2026-03-12T12:04:03Z Cryo-electron microscopy (cryo-EM) has emerged as a powerful technique for determining the three-dimensional structures of biological molecules at near-atomic resolution. However, reconstructing helical assemblies presents unique challenges due to their inherent symmetry and the need to determine unknown helical symmetry parameters. Traditional approaches require an accurate initial estimation of these parameters, which is often obtained through trial and error or prior knowledge. These requirements can lead to incorrect reconstructions, limiting the reliability of ab initio helical reconstruction. In this work, we present SHREC (Spectral Helical REConstruction), an algorithm that directly recovers the projection angles of helical segments from their two-dimensional cryo-EM images, without requiring prior knowledge of helical symmetry parameters. Our approach leverages the insight that projections of helical segments form a one-dimensional manifold, which can be recovered using spectral embedding techniques. Experimental validation on publicly available datasets demonstrates that SHREC achieves high resolution reconstructions while accurately recovering helical parameters, requiring only knowledge of the specimen's axial symmetry group. By eliminating the need for initial symmetry estimates, SHREC offers a more robust and automated pathway for determining helical structures in cryo-EM. 2026-03-12T12:04:03Z Guy Shapira Yoel Shkolnisky http://arxiv.org/abs/2603.11476v1 Leveraging Phytolith Research using Artificial Intelligence 2026-03-12T02:57:23Z Phytolith analysis is a crucial tool for reconstructing past vegetation and human activities, but traditional methods are severely limited by labour-intensive, time-consuming manual microscopy. To address this bottleneck, we present Sorometry: a comprehensive end-to-end artificial intelligence pipeline for the high-throughput digitisation, inference, and interpretation of phytoliths. Our workflow processes z-stacked optical microscope scans to automatically generate synchronised 2D orthoimages and 3D point clouds of individual microscopic particles. We developed a multimodal fusion model that combines ConvNeXt for 2D image analysis and PointNet++ for 3D point cloud analysis, supported by a graphical user interface for expert annotation and review. Tested on reference collections and archaeological samples from the Bolivian Amazon, our fusion model achieved a global classification accuracy of 77.9\% across 24 diagnostic morphotypes and 84.5% for segmentation quality. Crucially, the integration of 3D data proved essential for distinguishing complex morphotypes (such as grass silica short cell phytoliths) whose diagnostic features are often obscured by their orientation in 2D projections. Beyond individual object classification, Sorometry incorporates Bayesian finite mixture modelling to predict overall plant source contributions at the assemblage level, successfully identifying specific plants like maize and palms in complex mixed samples. This integrated platform transforms phytolith research into an "omics"-scale discipline, dramatically expanding analytical capacity, standardising expert judgements, and enabling reproducible, population-level characterisations of archaeological and paleoecological assemblages. 2026-03-12T02:57:23Z 45 pages, 23 figures Andrés G. Mejía Ramón Kate Dudgeon Nina Witteveen Dolores Piperno Michael Kloster Luigi Palopoli Mónica Moraes R. José M. Capriles Umberto Lombardo http://arxiv.org/abs/2603.11387v1 Framing local structural identifiability and observability in terms of parameter-state symmetries 2026-03-12T00:01:18Z We introduce a subclass of Lie symmetries, called parameter-state symmetries, to analyse the local structural identifiability and observability of mechanistic models consisting of state-dependent ODEs with observed outputs. These symmetries act on parameters and states while preserving observed outputs at every time point. We prove that locally structurally identifiable parameter combinations and locally structurally observable states correspond to universal invariants of all parameter-state symmetries of a given model. We illustrate the framework on four previously studied mechanistic models, confirming known identifiability results and revealing novel insights into which states are observable, providing a unified symmetry-based approach for analysing structural properties of dynamical systems. 2026-03-12T00:01:18Z 52 pages, 0 figures. Supplementary calculations included in the appendices Johannes G. Borgqvist Alexander P. Browning Fredrik Ohlsson Ruth E. Baker http://arxiv.org/abs/2603.11344v1 Hybrid eTFCE-GRF: Exact Cluster-Size Retrieval with Analytical p-Values for Voxel-Based Morphometry 2026-03-11T22:20:03Z Threshold-free cluster enhancement (TFCE) integrates cluster extent across thresholds to improve voxel-wise neuroimaging inference, but permutation testing makes it prohibitively slow for large datasets. Probabilistic TFCE (pTFCE) uses analytical Gaussian random field (GRF) p-values but discretises the threshold grid. Exact TFCE (eTFCE) eliminates discretisation via a union-find data structure but still requires permutations. We combine eTFCE's union-find for exact cluster-size retrieval with pTFCE's analytical GRF inference. The union-find builds the cluster hierarchy in one pass over sorted voxels and enables exact size queries at any threshold; GRF theory then converts these sizes to analytical p-values without permutations. Validation on synthetic phantoms (64^3, 80 subjects): FWER controlled at nominal level (0/200 null rejections, 95% CI [0.0%, 1.9%]); power matches baseline pTFCE (Dice >= 0.999); smoothness error below 1%; concordance r > 0.99. On UK Biobank (N=500) and IXI (N=563), significance maps form strict subsets of reference R pTFCE, which supports conservative error control. Implemented in pytfce (pip install pytfce): baseline completes whole-brain VBM in ~5s (75x faster than R pTFCE), hybrid in ~85s (4.6x faster) with exact cluster sizes; both >1000x faster than permutation TFCE. 2026-03-11T22:20:03Z 25 pages, 7 figures, 3 tables. Submitted to NeuroImage. Open-source package: https://github.com/Don-Yin/pytfce Don Yin Hao Chen Takeshi Miki Boxing Liu Enyu Yang http://arxiv.org/abs/2603.11330v1 Ill-Conditioning in Dictionary-Based Dynamic-Equation Learning: A Systems Biology Case Study 2026-03-11T21:42:02Z Data-driven discovery of governing equations from time-series data provides a powerful framework for understanding complex biological systems. Library-based approaches that use sparse regression over candidate functions have shown considerable promise, but they face a critical challenge when candidate functions become strongly correlated: numerical ill-conditioning. Poor or restricted sampling, together with particular choices of candidate libraries, can produce strong multicollinearity and numerical instability. In such cases, measurement noise may lead to widely different recovered models, obscuring the true underlying dynamics and hindering accurate system identification. Although sparse regularization promotes parsimonious solutions and can partially mitigate conditioning issues, strong correlations may persist, regularization may bias the recovered models, and the regression problem may remain highly sensitive to small perturbations in the data. We present a systematic analysis of how ill-conditioning affects sparse identification of biological dynamics using benchmark models from systems biology. We show that combinations involving as few as two or three terms can already exhibit strong multicollinearity and extremely large condition numbers. We further show that orthogonal polynomial bases do not consistently resolve ill-conditioning and can perform worse than monomial libraries when the data distribution deviates from the weight function associated with the orthogonal basis. Finally, we demonstrate that when data are sampled from distributions aligned with the appropriate weight functions corresponding to the orthogonal basis, numerical conditioning improves, and orthogonal polynomial bases can yield improved model recovery accuracy across two baseline models. 2026-03-11T21:42:02Z Yuxiang Feng Niall M Mangan Manu Jayadharan