https://arxiv.org/api/k+Tu/MWrNqIoTen6YQYXO0BPzYg2026-03-16T06:36:36Z12843015http://arxiv.org/abs/2603.13051v1Causal Cellular Context Transfer Learning (C3TL): An Efficient Architecture for Prediction of Unseen Perturbation Effects2026-03-13T15:02:49ZPredicting the effects of chemical and genetic perturbations on quantitative cell states is a central challenge in computational biology, molecular medicine and drug discovery. Recent work has leveraged large-scale single-cell data and massive foundation models to address this task. However, such computational resources and extensive datasets are not always accessible in academic or clinical settings, hence limiting utility. Here we propose a lightweight framework for perturbation effect prediction that exploits the structured nature of biological interventions and specific inductive biases/invariances. Our approach leverages available information concerning perturbation effects to allow generalization to novel contexts and requires only widely-available bulk molecular data. Extensive testing, comparing predictions of context-specific perturbation effects against real, large-scale interventional experiments, demonstrates accurate prediction in new contexts. The proposed approach is competitive with SOTA foundation models but requires simpler data, much smaller model sizes and less time. Focusing on robust bulk signals and efficient architectures, we show that accurate prediction of perturbation effects is possible without proprietary hardware or very large models, hence opening up ways to leverage causal learning approaches in biomedicine generally.2026-03-13T15:02:49Z12 Pages, 3 figures, Keywords: perturbation prediction, context transfer, lightweight, machine learningMichael ScholkemperSach Mukherjeehttp://arxiv.org/abs/2603.12694v1RXNRECer Enables Fine-grained Enzymatic Function Annotation through Active Learning and Protein Language Models2026-03-13T06:20:14ZA key challenge in enzyme annotation is identifying the biochemical reactions catalyzed by proteins. Most existing methods rely on Enzyme Commission (EC) numbers as intermediaries: they first predict an EC number and then retrieve the associated reactions. This indirect strategy introduces ambiguity due to the complex many-to-many mappings among proteins, EC numbers, and reactions, and is further complicated by frequent updates to EC numbers and inconsistencies across databases. To address these challenges, we present RXNRECer, a transformer-based ensemble framework that directly predicts enzyme-catalyzed reactions without relying on EC numbers. It integrates protein language modeling and active learning to capture both high-level sequence semantics and fine-grained transformation patterns. Evaluations on curated cross-validation and temporal test sets demonstrate consistent improvements over six EC-based baselines, with gains of 16.54% in F1 score and 15.43% in accuracy. Beyond accuracy gains, the framework offers clear advantages for downstream applications, including scalable proteome-wide reaction annotation, enhanced specificity in refining generic reaction schemas, systematic annotation of previously uncurated proteins, and reliable identification of enzyme promiscuity. By incorporating large language models, it also provides interpretable rationales for predictions. These capabilities make RXNRECer a robust and versatile solution for EC-free, fine-grained enzyme function prediction, with potential applications across multiple areas of enzyme research and industrial applications.2026-03-13T06:20:14ZZhenkun ShiJun ZhuDehang WangBoYu ChenQianqian YuanZhitao MaoFan WeiWeining WuXiaoping LiaoHongwu Mahttp://arxiv.org/abs/2401.02739v5Denoising Diffusion Variational Inference: Diffusion Models as Expressive Variational Posteriors2026-03-13T02:13:07ZWe propose denoising diffusion variational inference (DDVI), a black-box variational inference algorithm for latent variable models which relies on diffusion models as flexible approximate posteriors. Specifically, our method introduces an expressive class of diffusion-based variational posteriors that perform iterative refinement in latent space; we train these posteriors with a novel regularized evidence lower bound (ELBO) on the marginal likelihood inspired by the wake-sleep algorithm. Our method is easy to implement (it fits a regularized extension of the ELBO), is compatible with black-box variational inference, and outperforms alternative classes of approximate posteriors based on normalizing flows or adversarial networks. We find that DDVI improves inference and learning in deep latent variable models across common benchmarks as well as on a motivating task in biology -- inferring latent ancestry from human genomes -- where it outperforms strong baselines on the Thousand Genomes dataset.2024-01-05T10:27:44Zpublished at AAAI 2025; the first two authors contribute equally to this work; code available at https://github.com/topwasu/DDVIWasu Top PiriyakulkijYingheng WangVolodymyr Kuleshovhttp://arxiv.org/abs/2507.18557v4Deep Learning for Blood-Brain Barrier Permeability Prediction: From Discriminative Models to Mechanism-Aware Design2026-03-12T19:41:48ZPredicting whether a molecule can cross the blood-brain barrier (BBB) is a key step in early-stage neuro-pharmaceutical design, directly influencing the efficiency and success rate of drug development. Traditional methods based on physicochemical properties are prone to systematic misjudgements due to their reliance on previous empirical evidence. Early machine learning (ML) models, although data-driven, often suffer from limited capacity, poor generalization, and insufficient interpretability. In recent years, more advanced models have become essential tools for predicting BBB permeability and guiding related drug design, owing to their ability to simulate molecular structures and capture complex biological mechanisms. This article systematically reviews the evolution of this field-from deep neural networks to graph-based structural modelling-highlighting the advantages of multi-task and multimodal learning strategies in identifying mechanism-related features. We further explore the emerging potential of generative models and causal inference methods for integrating permeability prediction with mechanism-aware drug design. Nowadays, ML-based BBB crossing prediction is in the critical transition from mere discriminative classification toward structure-function modelling from a mechanistic perspective. This paradigm shift provides a methodological progression and future roadmap for the integration of AI into neuropharmacological development.2025-07-24T16:30:46ZZihan YangYuchen Xiaohttp://arxiv.org/abs/2603.12351v1Probabilistic Joint and Individual Variation Explained (ProJIVE) for Data Integration2026-03-12T18:16:07ZCollecting multiple types of data on the same set of subjects is common in modern scientific applications including, genomics, metabolomics, and neuroimaging. Joint and Individual Variance Explained (JIVE) seeks a low-rank approximation of the joint variation between two or more sets of features captured on common subjects and isolates this variation from that unique to eachset of features. We develop an expectation-maximization (EM) algorithm to estimate a probabilistic model for the JIVE framework. The model extends probabilistic principal components analysis to multiple data sets. Our maximum likelihood approach simultaneously estimates joint and individual components, which can lead to greater accuracy compared to other methods. We apply ProJIVE to measures of brain morphometry and cognition in Alzheimer's disease. ProJIVE learns biologically meaningful courses of variation, and the joint morphometry and cognition subject scores are strongly related to more expensive existing biomarkers. Data used in preparation of this article were obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database. Code to reproduce the analysis is available on our GitHub page.2026-03-12T18:16:07ZJournal of Computational and Graphical Statistics (2026)Raphiel J. MurdenGanzhong TianDeqiang QiuBenajmin B. Risk10.1080/10618600.2026.2639081http://arxiv.org/abs/2603.12349v1Budget-Sensitive Discovery Scoring: A Formally Verified Framework for Evaluating AI-Guided Scientific Selection2026-03-12T18:09:53ZScientific discovery increasingly relies on AI systems to select candidates for expensive experimental validation, yet no principled, budget-aware evaluation framework exists for comparing selection strategies -- a gap intensified by large language models (LLMs), which generate plausible scientific proposals without reliable downstream evaluation. We introduce the Budget-Sensitive Discovery Score (BSDS), a formally verified metric -- 20 theorems machine-checked by the Lean 4 proof assistant -- that jointly penalizes false discoveries (lambda-weighted FDR) and excessive abstention (gamma-weighted coverage gap) at each budget level. Its budget-averaged form, the Discovery Quality Score (DQS), provides a single summary statistic that no proposer can inflate by performing well at a cherry-picked budget.
As a case study, we apply BSDS/DQS to: do LLMs add marginal value to an existing ML pipeline for drug discovery candidate selection? We evaluate 39 proposers -- 11 mechanistic variants, 14 zero-shot LLM configurations, and 14 few-shot LLM configurations -- using SMILES representations on MoleculeNet HIV (41,127 compounds, 3.5% active, 1,000 bootstrap replicates) under both random and scaffold splits. Three findings emerge. First, the simple RF-based Greedy-ML proposer achieves the best DQS (-0.046), outperforming all MLP variants and LLM configurations. Second, no LLM surpasses the Greedy-ML baseline under zero-shot or few-shot evaluation on HIV or Tox21, establishing that LLMs provide no marginal value over an existing trained classifier. Third, the proposer hierarchy generalizes across five MoleculeNet benchmarks spanning 0.18%-46.2% prevalence, a non-drug AV safety domain, and a 9x7 grid of penalty parameters (tau >= 0.636, mean tau = 0.863). The framework applies to any setting where candidates are selected under budget constraints and asymmetric error costs.2026-03-12T18:09:53ZAbhinaba BasuPavan Chakrabortyhttp://arxiv.org/abs/2603.12341v1Self-Reported Side Effects of Semaglutide and Tirzepatide in Online Communities2026-03-12T18:03:22ZSocial media can reveal patient experiences with glucagon-like peptide-1 receptor agonists (GLP-1 RAs) that extend beyond clinical trial data. We analyzed 410,198 Reddit posts (May 2019-June 2025) mentioning semaglutide or tirzepatide. A total of 67,008 users self-reported using these medications, and 43.5% described at least one side effect. Gastrointestinal symptoms predominated, including nausea (36.9%), fatigue (16.7%), vomiting (16.3%), constipation (15.3%), and diarrhea (12.6%). Notably, reproductive symptoms (e.g., menstrual irregularities) and temperature-related complaints (e.g., chills, hot flashes) emerged as unrecognized potential effects. These findings highlight patient concerns not well captured in current labeling or trials. Large-scale social media analysis can complement traditional pharmacovigilance by detecting emerging safety signals and expanding understanding of the real-world safety profile of GLP-1 RAs.2026-03-12T18:03:22ZNeil K. R. SehgalJena Shaw TronieriLyle UngarSharath Chandra Guntukuhttp://arxiv.org/abs/2603.12253v1Binding Free Energies without Alchemy2026-03-12T17:58:49ZAbsolute Binding Free Energy (ABFE) methods are among the most accurate computational techniques for predicting protein-ligand binding affinities, but their utility is limited by the need for many simulations of alchemically modified intermediate states. We propose Direct Binding Free Energy (DBFE), an end-state ABFE method in implicit solvent that requires no alchemical intermediates. DBFE outperforms OBC2 double decoupling on a host-guest benchmark and performs comparably to OBC2 MM/GBSA on a protein-ligand benchmark. Since receptor and ligand simulations can be precomputed and amortized across compounds, DBFE requires only one complex simulation per ligand compared to the many lambda windows needed for double decoupling, making it a promising candidate for virtual screening workflows. We publicly release the code for this method at https://github.com/molecularmodelinglab/dbfe.2026-03-12T17:58:49Z14 pages, 4 figuresMichael BrocidiaconoBrandon NovyRishabh DeyKonstantin I. PopovAlexander Tropshahttp://arxiv.org/abs/2506.17373v5A practical identifiability criterion leveraging weak-form parameter estimation2026-03-12T17:28:30ZIn this work, we define a practical identifiability criterion, (e, q)-identifiability, based on a parameter e, reflecting the noise in observed variables, and a parameter q, reflecting the mean-square error of the parameter estimator. This criterion is better able to encompass changes in the quality of the parameter estimate due to increased noise in the data (compared to existing criteria based solely on average relative errors). Furthermore, we leverage a weak-form equation error-based method of parameter estimation for systems with unobserved variables to assess practical identifiability far more quickly in comparison to output error-based parameter estimation. We do so by generating weak-form input-output equations using differential algebra techniques, as previously proposed by Boulier et al [1], and then applying Weak form Estimation of Nonlinear Dynamics (WENDy) to obtain parameter estimates. This method is computationally efficient and robust to noise, as demonstrated through two classical biological modelling examples.2025-06-20T16:11:47ZNora Heitzman-BreenVanja DukicDavid M. Bortzhttp://arxiv.org/abs/2603.12016v1Nyxus: A Next Generation Image Feature Extraction Library for the Big Data and AI Era2026-03-12T14:58:07ZModern imaging instruments can produce terabytes to petabytes of data for a single experiment. The biggest barrier to processing big image datasets has been computational, where image analysis algorithms often lack the efficiency needed to process such large datasets or make tradeoffs in robustness and accuracy. Deep learning algorithms have vastly improved the accuracy of the first step in an analysis workflow (region segmentation), but the expansion of domain specific feature extraction libraries across scientific disciplines has made it difficult to compare the performance and accuracy of extracted features. To address these needs, we developed a novel feature extraction library called Nyxus. Nyxus is designed from the ground up for scalable out-of-core feature extraction for 2D and 3D image data and rigorously tested against established standards. The comprehensive feature set of Nyxus covers multiple biomedical domains including radiomics and cellular analysis, and is designed for computational scalability across CPUs and GPUs. Nyxus has been packaged to be accessible to users of various skill sets and needs: as a Python package for code developers, a command line tool, as a Napari plugin for low to no-code users or users that want to visualize results, and as an Open Container Initiative (OCI) compliant container that can be used in cloud or super-computing workflows aimed at processing large data sets. Further, Nyxus enables a new methodological approach to feature extraction allowing for programmatic tuning of many features sets for optimal computational efficiency or coverage for use in novel machine learning and deep learning applications.2026-03-12T14:58:07Z29 pages, 9 figures, 6 supplemental tablesNicholas SchaubAndriy KharchenkoHamdah AbbasiSameeul SameeHythem SidkyNathan Hotalinghttp://arxiv.org/abs/2603.12307v1SHREC: A Spectral Embedding-Based Approach for Ab-Initio Reconstruction of Helical Molecules2026-03-12T12:04:03ZCryo-electron microscopy (cryo-EM) has emerged as a powerful technique for determining the three-dimensional structures of biological molecules at near-atomic resolution. However, reconstructing helical assemblies presents unique challenges due to their inherent symmetry and the need to determine unknown helical symmetry parameters. Traditional approaches require an accurate initial estimation of these parameters, which is often obtained through trial and error or prior knowledge. These requirements can lead to incorrect reconstructions, limiting the reliability of ab initio helical reconstruction.
In this work, we present SHREC (Spectral Helical REConstruction), an algorithm that directly recovers the projection angles of helical segments from their two-dimensional cryo-EM images, without requiring prior knowledge of helical symmetry parameters. Our approach leverages the insight that projections of helical segments form a one-dimensional manifold, which can be recovered using spectral embedding techniques. Experimental validation on publicly available datasets demonstrates that SHREC achieves high resolution reconstructions while accurately recovering helical parameters, requiring only knowledge of the specimen's axial symmetry group. By eliminating the need for initial symmetry estimates, SHREC offers a more robust and automated pathway for determining helical structures in cryo-EM.2026-03-12T12:04:03ZGuy ShapiraYoel Shkolniskyhttp://arxiv.org/abs/2603.11476v1Leveraging Phytolith Research using Artificial Intelligence2026-03-12T02:57:23ZPhytolith analysis is a crucial tool for reconstructing past vegetation and human activities, but traditional methods are severely limited by labour-intensive, time-consuming manual microscopy. To address this bottleneck, we present Sorometry: a comprehensive end-to-end artificial intelligence pipeline for the high-throughput digitisation, inference, and interpretation of phytoliths. Our workflow processes z-stacked optical microscope scans to automatically generate synchronised 2D orthoimages and 3D point clouds of individual microscopic particles. We developed a multimodal fusion model that combines ConvNeXt for 2D image analysis and PointNet++ for 3D point cloud analysis, supported by a graphical user interface for expert annotation and review. Tested on reference collections and archaeological samples from the Bolivian Amazon, our fusion model achieved a global classification accuracy of 77.9\% across 24 diagnostic morphotypes and 84.5% for segmentation quality. Crucially, the integration of 3D data proved essential for distinguishing complex morphotypes (such as grass silica short cell phytoliths) whose diagnostic features are often obscured by their orientation in 2D projections. Beyond individual object classification, Sorometry incorporates Bayesian finite mixture modelling to predict overall plant source contributions at the assemblage level, successfully identifying specific plants like maize and palms in complex mixed samples. This integrated platform transforms phytolith research into an "omics"-scale discipline, dramatically expanding analytical capacity, standardising expert judgements, and enabling reproducible, population-level characterisations of archaeological and paleoecological assemblages.2026-03-12T02:57:23Z45 pages, 23 figuresAndrés G. Mejía RamónKate DudgeonNina WitteveenDolores PipernoMichael KlosterLuigi PalopoliMónica Moraes R.José M. CaprilesUmberto Lombardohttp://arxiv.org/abs/2603.11387v1Framing local structural identifiability and observability in terms of parameter-state symmetries2026-03-12T00:01:18ZWe introduce a subclass of Lie symmetries, called parameter-state symmetries, to analyse the local structural identifiability and observability of mechanistic models consisting of state-dependent ODEs with observed outputs. These symmetries act on parameters and states while preserving observed outputs at every time point. We prove that locally structurally identifiable parameter combinations and locally structurally observable states correspond to universal invariants of all parameter-state symmetries of a given model. We illustrate the framework on four previously studied mechanistic models, confirming known identifiability results and revealing novel insights into which states are observable, providing a unified symmetry-based approach for analysing structural properties of dynamical systems.2026-03-12T00:01:18Z52 pages, 0 figures. Supplementary calculations included in the appendicesJohannes G. BorgqvistAlexander P. BrowningFredrik OhlssonRuth E. Bakerhttp://arxiv.org/abs/2603.11344v1Hybrid eTFCE-GRF: Exact Cluster-Size Retrieval with Analytical p-Values for Voxel-Based Morphometry2026-03-11T22:20:03ZThreshold-free cluster enhancement (TFCE) integrates cluster extent across thresholds to improve voxel-wise neuroimaging inference, but permutation testing makes it prohibitively slow for large datasets. Probabilistic TFCE (pTFCE) uses analytical Gaussian random field (GRF) p-values but discretises the threshold grid. Exact TFCE (eTFCE) eliminates discretisation via a union-find data structure but still requires permutations. We combine eTFCE's union-find for exact cluster-size retrieval with pTFCE's analytical GRF inference. The union-find builds the cluster hierarchy in one pass over sorted voxels and enables exact size queries at any threshold; GRF theory then converts these sizes to analytical p-values without permutations. Validation on synthetic phantoms (64^3, 80 subjects): FWER controlled at nominal level (0/200 null rejections, 95% CI [0.0%, 1.9%]); power matches baseline pTFCE (Dice >= 0.999); smoothness error below 1%; concordance r > 0.99. On UK Biobank (N=500) and IXI (N=563), significance maps form strict subsets of reference R pTFCE, which supports conservative error control. Implemented in pytfce (pip install pytfce): baseline completes whole-brain VBM in ~5s (75x faster than R pTFCE), hybrid in ~85s (4.6x faster) with exact cluster sizes; both >1000x faster than permutation TFCE.2026-03-11T22:20:03Z25 pages, 7 figures, 3 tables. Submitted to NeuroImage. Open-source package: https://github.com/Don-Yin/pytfceDon YinHao ChenTakeshi MikiBoxing LiuEnyu Yanghttp://arxiv.org/abs/2603.11330v1Ill-Conditioning in Dictionary-Based Dynamic-Equation Learning: A Systems Biology Case Study2026-03-11T21:42:02ZData-driven discovery of governing equations from time-series data provides a powerful framework for understanding complex biological systems. Library-based approaches that use sparse regression over candidate functions have shown considerable promise, but they face a critical challenge when candidate functions become strongly correlated: numerical ill-conditioning. Poor or restricted sampling, together with particular choices of candidate libraries, can produce strong multicollinearity and numerical instability. In such cases, measurement noise may lead to widely different recovered models, obscuring the true underlying dynamics and hindering accurate system identification. Although sparse regularization promotes parsimonious solutions and can partially mitigate conditioning issues, strong correlations may persist, regularization may bias the recovered models, and the regression problem may remain highly sensitive to small perturbations in the data. We present a systematic analysis of how ill-conditioning affects sparse identification of biological dynamics using benchmark models from systems biology. We show that combinations involving as few as two or three terms can already exhibit strong multicollinearity and extremely large condition numbers. We further show that orthogonal polynomial bases do not consistently resolve ill-conditioning and can perform worse than monomial libraries when the data distribution deviates from the weight function associated with the orthogonal basis. Finally, we demonstrate that when data are sampled from distributions aligned with the appropriate weight functions corresponding to the orthogonal basis, numerical conditioning improves, and orthogonal polynomial bases can yield improved model recovery accuracy across two baseline models.2026-03-11T21:42:02ZYuxiang FengNiall M ManganManu Jayadharan