https://arxiv.org/api/q0l2rrl8Lgbl0NShXmFVB3NCK4Q 2026-03-22T18:51:43Z 6642 240 15 http://arxiv.org/abs/2506.19865v2 Scalable and Cost-Efficient de Novo Template-Based Molecular Generation 2025-11-04T08:59:49Z

Template-based molecular generation offers a promising avenue for drug design by ensuring generated compounds are synthetically accessible through predefined reaction templates and building blocks. In this work, we tackle three core challenges in template-based GFlowNets: (1) minimizing synthesis cost, (2) scaling to large building block libraries, and (3) effectively utilizing small fragment sets. We propose Recursive Cost Guidance, a backward policy framework that employs auxiliary machine learning models to approximate synthesis cost and viability. This guidance steers generation toward low-cost synthesis pathways, significantly enhancing cost-efficiency, molecular diversity, and quality, especially when paired with an Exploitation Penalty that balances the trade-off between exploration and exploitation. To enhance performance in smaller building block libraries, we develop a Dynamic Library mechanism that reuses intermediate high-reward states to construct full synthesis trees. Our approach establishes state-of-the-art results in template-based molecular generation.

2025-06-10T15:16:09Z Piotr Gaiński Oussama Boussif Andrei Rekesh Dmytro Shevchuk Ali Parviz Mike Tyers Robert A. Batey Michał Koziarski http://arxiv.org/abs/2510.15975v2 Generative AI for Biosciences: Emerging Threats and Roadmap to Biosecurity 2025-11-04T08:03:22Z

The rapid adoption of generative artificial intelligence (GenAI) in the biosciences is transforming biotechnology, medicine, and synthetic biology. Yet this advancement is intrinsically linked to new vulnerabilities, as GenAI lowers the barrier to misuse and introduces novel biosecurity threats, such as generating synthetic viral proteins or toxins. These dual-use risks are often overlooked, as existing safety guardrails remain fragile and can be circumvented through deceptive prompts or jailbreak techniques. In this Perspective, we first outline the current state of GenAI in the biosciences and emerging threat vectors ranging from jailbreak attacks and privacy risks to the dual-use challenges posed by autonomous AI agents. We then examine urgent gaps in regulation and oversight, drawing on insights from 130 expert interviews across academia, government, industry, and policy. A large majority ($\approx 76$\%) expressed concern over AI misuse in biology, and 74\% called for the development of new governance frameworks. Finally, we explore technical pathways to mitigation, advocating a multi-layered approach to GenAI safety. These defenses include rigorous data filtering, alignment with ethical principles during development, and real-time monitoring to block harmful requests. Together, these strategies provide a blueprint for embedding security throughout the GenAI lifecycle. As GenAI becomes integrated into the biosciences, safeguarding this frontier requires an immediate commitment to both adaptive governance and secure-by-design technologies.

2025-10-13T00:24:41Z Zaixi Zhang Souradip Chakraborty Amrit Singh Bedi Emilin Mathew Varsha Saravanan Le Cong Alvaro Velasquez Sheng Lin-Gibson Megan Blewett Dan Hendrycs Alex John London Ellen Zhong Ben Raphael Adji Bousso Dieng Jian Ma Eric Xing Russ Altman George Church Mengdi Wang http://arxiv.org/abs/2511.02128v1 DL4Proteins Jupyter Notebooks Teach how to use Artificial Intelligence for Biomolecular Structure Prediction and Design 2025-11-03T23:43:20Z

Computational methods for predicting and designing biomolecular structures are increasingly powerful. While previous approaches relied on physics-based modeling, modern tools, such as AlphaFold2 in CASP14, leverage artificial intelligence (AI) to achieve significantly improved performance. The growing impact of AI-based tools in protein science necessitates enhanced educational materials that improve AI literacy among both established scientists seeking to deepen their expertise and new researchers entering the field. To address this need, we developed DL4Proteins, a series of ten interactive notebook modules that introduce fundamental machine learning (ML) concepts, guide users through training ML models for protein-related tasks, and ultimately present cutting-edge protein structure prediction and design pipelines. With nothing more than a web browser, learners can now access state-of-the-art computational tools employed by professional protein engineers - ranging from all-atom protein design to fine-tuning protein language models for biophysically relevant functional tasks. By increasing accessibility, this notebook series broadens participation in AI-driven protein research. The complete notebook series is publicly available at https://github.com/Graylab/DL4Proteins-notebooks.

2025-11-03T23:43:20Z 27 pages, 5 figures Michael Chungyoun Gabe Au Britnie Carpentier Sreevarsha Puvada Courtney Thomas Jeffrey J. Gray http://arxiv.org/abs/2510.19660v2 Machine Olfaction and Embedded AI Are Shaping the New Global Sensing Industry 2025-11-03T16:02:34Z

Machine olfaction is rapidly emerging as a transformative capability, with applications spanning non-invasive medical diagnostics, industrial monitoring, agriculture, and security and defense. Recent advances in stabilizing mammalian olfactory receptors and integrating them into biophotonic and bioelectronic systems have enabled detection at near single-molecule resolution thus placing machines on par with trained detection dogs. As this technology converges with multimodal AI and distributed sensor networks imbued with embedded AI, it introduces a new, biochemical layer to a sensing ecosystem currently dominated by machine vision and audition. This review and industry roadmap surveys the scientific foundations, technological frontiers, and strategic applications of machine olfaction making the case that we are currently witnessing the rise of a new industry that brings with it a global chemosensory infrastructure. We cover exemplary industrial, military and consumer applications and address some of the ethical and legal concerns arising. We find that machine olfaction is poised to bring forth a planet-wide molecular awareness tech layer with the potential of spawning vast emerging markets in health, security, and environmental sensing via scent.

2025-10-22T15:05:01Z 23 pages, 116 citations, combination tech review/industry roadmap/white paper on the rise of machine olfaction as an essential AI modality Andreas Mershin Nikolas Stefanou Adan Rotteveel Matthew Kung George Kung Alexandru Dan Howard Kivell Zoia Okulova Zoi Kountouri Paul Pu Liang http://arxiv.org/abs/2409.07189v2 AI-Guided Molecular Simulations in VR: Exploring Strategies for Imitation Learning in Hyperdimensional Molecular Systems 2025-11-03T11:59:01Z

Molecular dynamics (MD) simulations are a crucial computational tool for researchers to understand and engineer molecular structure and function in areas such as drug discovery, protein engineering, and material design. Despite their utility, MD simulations are expensive, owing to the high dimensionality of molecular systems. Interactive molecular dynamics in virtual reality (iMD-VR) has recently emerged as a "human-in-the-loop" strategy for efficiently navigating hyper-dimensional molecular systems. By providing an immersive 3D environment that enables visualization and manipulation of real-time molecular simulations running on high-performance computing architectures, iMD-VR enables researchers to reach out and guide molecular conformational dynamics, in order to efficiently explore complex, high-dimensional molecular systems. Moreover, iMD-VR simulations generate rich datasets that capture human experts' spatial insight regarding molecular structure and function. This paper explores the use of researcher-generated iMD-VR datasets to train AI agents via imitation learning (IL). IL enables agents to mimic complex behaviours from expert demonstrations, circumventing the need for explicit programming or intricate reward design. In this article, we review IL across robotics and Multi-agents systems domains which are comparable to iMD-VR, and discuss how iMD-VR recordings could be used to train IL models to interact with MD simulations. We then illustrate the applications of these ideas through a proof-of-principle study where iMD-VR data was used to train a CNN network on a simple molecular manipulation task; namely, threading a small molecule through a nanotube pore. Finally, we outline future research directions and potential challenges of using AI agents to augment human expertise in navigating vast molecular conformational spaces.

2024-09-11T11:21:02Z (First presented at the First Workshop on "eXtended Reality \& Intelligent Agents" (XRIA24) @ ECAI24, Santiago De Compostela (Spain), 20 October 2024) SN COMPUT. SCI. 6, 922 (2025) Mohamed Dhouioui Jonathan Barnoud Rhoslyn Roebuck Williams Harry J. Stroud Phil Bates David R. Glowacki 10.1007/s42979-025-04465-5 http://arxiv.org/abs/2510.27074v2 How Do Proteins Fold? 2025-11-03T03:44:42Z

How proteins fold remains a central unsolved problem in biology. While the idea of a folding code embedded in the amino acid sequence was introduced more than 6 decades ago, this code remains undefined. While we now have powerful predictive tools to predict the final native structure of proteins, we still lack a predictive framework for how sequences dictate folding pathways. Two main conceptual models dominate as explanations of folding mechanism: the funnel model, in which folding proceeds through many alternative routes on a rugged, hyperdimensional energy landscape; and the foldon model, which proposes a hierarchical sequence of discrete intermediates. Recent advances on two fronts are now enabling folding studies in unprecedented ways. Powerful experimental approaches; in particular, single-molecule force spectroscopy and hydrogen (deuterium exchange assays) allow time-resolved tracking of the folding process at high resolution. At the same time, computational breakthroughs culminating in algorithms such as AlphaFold have revolutionized static structure prediction, opening opportunities to extend machine learning toward dynamics. Together, these developments mark a turning point: for the first time, we are positioned to resolve how proteins fold, why they misfold, and how this knowledge can be harnessed for biology and medicine.

2025-10-31T00:46:57Z 13 pages, 3 figures Carlos Bustamante Christian Kaiser Erik Lindahl Robert Sosa Giovanni Volpe http://arxiv.org/abs/2511.00951v1 Design, Assessment, and Application of Machine Learning Potential Energy Surfaces 2025-11-02T14:25:00Z

Potential Energy Surfaces (PESs) are an indispensable tool to investigate, characterise and understand chemical and biological systems in the gas and condensed phases. Advances in Machine Learning (ML) methodologies have led to the development of Machine Learned Potential Energy Surfaces (ML-PES) which are now widely used to simulate such systems. The present work provides an overview of concepts, methodologies and recommendations for constructing and using ML-PESs. The choice of topics is focused on practical and recurrent issues to conceive and use such model. Application of the principles discussed are illustrated through two different systems of biomolecular importance: the non-reactive dynamics of the Alanine-Lysine-Alanine tripeptide in gas and solution phases, and double proton transfer reactions in DNA base pairs.

2025-11-02T14:25:00Z Valerii Andreichev Sena Aydin Kai Töpfer Markus Meuwly Luis Itza Vazquez-Salazar http://arxiv.org/abs/2510.27539v1 The transitional kinetics between open and closed Rep structures can be tuned by salt via two intermediate states 2025-10-31T15:14:49Z

DNA helicases undergo conformational changes; however, their structural dynamics are poorly understood. Here, we study single molecules of superfamily 1A DNA helicase Rep, which undergo conformational transitions during bacterial DNA replication, repair and recombination. We use time-correlated single-photon counting (TCSPC), fluorescence correlation spectroscopy (FCS), rapid single-molecule Förster resonance energy transfer (smFRET), Anti-Brownian ELectrokinetic (ABEL) trapping and molecular dynamics simulations (MDS) to provide unparalleled temporal and spatial resolution of Rep's domain movements. We detect four states revealing two hitherto hidden intermediates (S2, S3), between the open (S1) and closed (S4) structures, whose stability is salt dependent. Rep's open-to-closed switch involves multiple changes to all four subdomains 1A, 1B, 2A and 2B along the S1 to S2 to S3 to S4 transitional pathway comprising an initial truncated swing of 2B which then rolls across the 1B surface, following by combined rotations of 1B, 2A and 2B. High forward and reverse rates for S1 to S2 suggest that 1B may act to frustrate 2B movement to prevent premature Rep closure in the absence of DNA. These observations support a more general binding model for accessory DNA helicases that utilises conformational plasticity to explore a multiplicity of structures whose landscape can be tuned by salt prior to locking-in upon DNA binding.

2025-10-31T15:14:49Z Jamieson A L Howard Benjamin Ambrose Mahmoud A S Abdelhamid Lewis Frame Antoinette Alevropoulos-Borrill Ayesha Ejaz Lara Dresser Maria Dienerowitz Steven D Quinn Allison H Squires Agnes Noy Timothy D Craggs Mark C Leake http://arxiv.org/abs/2510.27212v1 The Demon Hidden Behind Life's Ultra-Energy-Efficient Information Processing -- Demonstrated by Biological Molecular Motors 2025-10-31T06:13:41Z

The remarkable progress of artificial intelligence (AI) has revealed the enormous energy demands of modern digital architectures, raising deep concerns about sustainability. In stark contrast, the human brain operates efficiently on only ~20 watts, and individual cells process gigabit-scale genetic information using energy on the order of trillionths of a watt. Under the same energy budget, a general-purpose digital processor can perform only a few simple operations per second. This striking disparity suggests that biological systems follow algorithms fundamentally distinct from conventional computation. The framework of information thermodynamics-especially Maxwell's demon and the Szilard engine-offers a theoretical clue, setting the lower bound of energy required for information processing. However, digital processors exceed this limit by about six orders of magnitude. Recent single-molecule studies have revealed that biological molecular motors convert Brownian motion into mechanical work, realizing a "demon-like" operational principle. These findings suggest that living systems have already implemented an ultra-efficient information-energy conversion mechanism that transcends digital computation. Here, we experimentally establish a quantitative correspondence between positional information (bits) and mechanical work, demonstrating that molecular machines selectively exploit rare but functional fluctuations arising from Brownian motion to achieve ATP-level energy efficiency. This integration of information, energy, and timescale indicates that life realizes a Maxwell's demon-like mechanism for energy-efficient information processing.

2025-10-31T06:13:41Z 8 pages, 5 figures, 1 table Toshio Yanagida Keisuke Fujita Mitsuhiro Iwaki http://arxiv.org/abs/2410.01755v3 Integrating Protein Sequence and Expression Level to Analysis Molecular Characterization of Breast Cancer Subtypes 2025-10-30T17:28:23Z

Breast cancer's complexity and variability pose significant challenges in understanding its progression and guiding effective treatment. This study aims to integrate protein sequence data with expression levels to improve the molecular characterization of breast cancer subtypes and predict clinical outcomes. Using ProtGPT2, a language model specifically designed for protein sequences, we generated embeddings that capture the functional and structural properties of proteins. These embeddings were integrated with protein expression levels to form enriched biological representations, which were analyzed using machine learning methods, such as ensemble K-means for clustering and XGBoost for classification. Our approach enabled the successful clustering of patients into biologically distinct groups and accurately predicted clinical outcomes such as survival and biomarker status, achieving high performance metrics, notably an F1 score of 0.88 for survival and 0.87 for biomarker status prediction. Feature importance analysis identified KMT2C, CLASP2, and MYO1B as key proteins involved in hormone signaling, cytoskeletal remodeling, and therapy resistance in hormone receptor-positive and triple-negative breast cancer, with potential influence on breast cancer subtype behavior and progression. Furthermore, protein-protein interaction networks and correlation analyses revealed functional interdependencies among proteins that may influence the behavior and progression of breast cancer subtypes. These findings suggest that integrating protein sequence and expression data provides valuable insights into tumor biology and has significant potential to enhance personalized treatment strategies in breast cancer care.

2024-10-02T17:05:48Z Hossein Sholehrasa Majid Jaberi-Douraki http://arxiv.org/abs/2510.26728v1 Modelling ion channels with a view towards identifiability 2025-10-30T17:27:25Z

Aggregated Markov models provide a flexible framework for stochastic dynamics that develops on multiple timescales. For example, Markov models for ion channels often consist of multiple open and closed state to account for "slow" and "fast" openings and closings of the channel. The approach is a popular tool in the construction of mechanistic models of ion channels - instead of viewing model states as generators of sojourn times of a certain characteristic length, each individual model state is interpreted as a representation of a distinct biophysical state. We will review the properties of aggregated Markov models and discuss the implications for mechanistic modelling. First, we show how the aggregated Markov models with a given number of states can be calculated using Pólya enumeration However, models with $n_O$ open and $n_C$ closed states that exceed the maximum number $2 n_O n_C$ of parameters are non-identifiable. We will present two derivations for this classical result and investigate non-identifiability further via a detailed analysis of the non-identifiable fully connected three-state model. Finally, we will discuss the implications of non-identifiability for mechanistic modelling of ion channels. We will argue that instead of designing models based on assumed transitions between distinct biophysical states which are modulated by ligand binding, it is preferable to build models based on additional sources of data that give more direct insight into the dynamics of conformational changes.

2025-10-30T17:27:25Z 37 pages, 6 figures, presented at MATRIX workshop "Parameter Identifiability in Mathematical Biology" https://www.matrix-inst.org.au/events/parameter-identifiability-in-mathematical-biology/? Ivo Siekmann http://arxiv.org/abs/2508.16398v2 Multiscale Growth Kinetics of Model Biomolecular Condensates Under Passive and Active Conditions 2025-10-30T16:43:11Z

Living cells exhibit a complex organization comprising numerous compartments, among which are RNA- and protein-rich membraneless, liquid-like organelles known as biomolecular condensates. Energy-consuming processes regulate their formation and dissolution, with (de-)phosphorylation by specific enzymes being among the most commonly involved reactions. By employing a model system consisting of a phosphorylatable peptide and homopolymeric RNA, we elucidate how enzymatic activity modulates the growth kinetics and alters the local structure of biomolecular condensates. Under passive condition, time-resolved ultra-small-angle X-ray scattering with synchrotron source reveals a nucleation-driven coalescence mechanism maintained over four decades in time, similar to the coarsening of simple binary fluid mixtures. Coarse-grained molecular dynamics simulations show that peptide-decorated RNA chains assembled shortly after mixing constitute the relevant subunits. In contrast, actively-formed condensates initially display a local mass fractal structure, which gradually matures upon enzymatic activity before condensates undergo coalescence. Both types of condensate eventually reach a steady state but fluorescence recovery after photobleaching indicates a peptide diffusivity twice higher in actively-formed condensates consistent with their loosely-packed local structure. We expect multiscale, integrative approaches implemented with model systems to link effectively the functional properties of membraneless organelles to their formation and dissolution kinetics as regulated by cellular active processes.

2025-08-22T14:00:08Z Tamizhmalar Sundararajan Matteo Boccalini Roméo Suss Sandrine Mariot Emerson R. Da Silva Fernando C. Giacomelli Austin Hubley Theyencheri Narayanan Alessandro Barducci Guillaume Tresset http://arxiv.org/abs/2508.06576v2 GFlowNets for Learning Better Drug-Drug Interaction Representations 2025-10-30T13:59:28Z

Drug-drug interactions pose a significant challenge in clinical pharmacology, with severe class imbalance among interaction types limiting the effectiveness of predictive models. Common interactions dominate datasets, while rare but critical interactions remain underrepresented, leading to poor model performance on infrequent cases. Existing methods often treat DDI prediction as a binary problem, ignoring class-specific nuances and exacerbating bias toward frequent interactions. To address this, we propose a framework combining Generative Flow Networks (GFlowNet) with Variational Graph Autoencoders (VGAE) to generate synthetic samples for rare classes, improving model balance and generate effective and novel DDI pairs. Our approach enhances predictive performance across interaction types, ensuring better clinical reliability.

2025-08-07T14:03:23Z Accepted to ICANN 2025:AIDD and NeurIPS 2025 Workshop on Structured Probabilistic Inference & Generative Modeling (https://openreview.net/forum?id=LZW1jSgfCI) Azmine Toushik Wasi http://arxiv.org/abs/2506.05768v2 AANet: Virtual Screening under Structural Uncertainty via Alignment and Aggregation 2025-10-30T06:11:45Z

Virtual screening (VS) is a critical component of modern drug discovery, yet most existing methods--whether physics-based or deep learning-based--are developed around holo protein structures with known ligand-bound pockets. Consequently, their performance degrades significantly on apo or predicted structures such as those from AlphaFold2, which are more representative of real-world early-stage drug discovery, where pocket information is often missing. In this paper, we introduce an alignment-and-aggregation framework to enable accurate virtual screening under structural uncertainty. Our method comprises two core components: (1) a tri-modal contrastive learning module that aligns representations of the ligand, the holo pocket, and cavities detected from structures, thereby enhancing robustness to pocket localization error; and (2) a cross-attention based adapter for dynamically aggregating candidate binding sites, enabling the model to learn from activity data even without precise pocket annotations. We evaluated our method on a newly curated benchmark of apo structures, where it significantly outperforms state-of-the-art methods in blind apo setting, improving the early enrichment factor (EF1%) from 11.75 to 37.19. Notably, it also maintains strong performance on holo structures. These results demonstrate the promise of our approach in advancing first-in-class drug discovery, particularly in scenarios lacking experimentally resolved protein-ligand complexes. Our implementation is publicly available at https://github.com/Wiley-Z/AANet.

2025-06-06T05:52:19Z Accepted at NeurIPS 2025 Wenyu Zhu Jianhui Wang Bowen Gao Yinjun Jia Haichuan Tan Ya-Qin Zhang Wei-Ying Ma Yanyan Lan http://arxiv.org/abs/2504.17247v2 OmegAMP: Targeted AMP Discovery through Biologically Informed Generation 2025-10-29T11:30:12Z

Deep learning-based antimicrobial peptide (AMP) discovery faces critical challenges such as limited controllability, lack of representations that efficiently model antimicrobial properties, and low experimental hit rates. To address these challenges, we introduce OmegAMP, a framework designed for reliable AMP generation with increased controllability. Its diffusion-based generative model leverages a novel conditioning mechanism to achieve fine-grained control over desired physicochemical properties and to direct generation towards specific activity profiles, including species-specific effectiveness. This is further enhanced by a biologically informed encoding space that significantly improves overall generative performance. Complementing these generative capabilities, OmegAMP leverages a novel synthetic data augmentation strategy to train classifiers for AMP filtering, drastically reducing false positive rates and thereby increasing the likelihood of experimental success. Our in silico experiments demonstrate that OmegAMP delivers state-of-the-art performance across key stages of the AMP discovery pipeline, enabling us to achieve an unprecedented success rate in wet lab experiments. We tested 25 candidate peptides, 24 of them (96%) demonstrated antimicrobial activity, proving effective even against multi-drug resistant strains. Our findings underscore OmegAMP's potential to significantly advance computational frameworks in the fight against antimicrobial resistance.

2025-04-24T04:53:04Z Diogo Soares Leon Hetzel Paulina Szymczak Marcelo Der Torossian Torres Johanna Sommer Cesar de la Fuente-Nunez Fabian Theis Stephan Günnemann Ewa Szczurek