https://arxiv.org/api/q0l2rrl8Lgbl0NShXmFVB3NCK4Q2026-03-22T18:51:43Z664224015http://arxiv.org/abs/2506.19865v2Scalable and Cost-Efficient de Novo Template-Based Molecular Generation2025-11-04T08:59:49ZTemplate-based molecular generation offers a promising avenue for drug design by ensuring generated compounds are synthetically accessible through predefined reaction templates and building blocks. In this work, we tackle three core challenges in template-based GFlowNets: (1) minimizing synthesis cost, (2) scaling to large building block libraries, and (3) effectively utilizing small fragment sets. We propose Recursive Cost Guidance, a backward policy framework that employs auxiliary machine learning models to approximate synthesis cost and viability. This guidance steers generation toward low-cost synthesis pathways, significantly enhancing cost-efficiency, molecular diversity, and quality, especially when paired with an Exploitation Penalty that balances the trade-off between exploration and exploitation. To enhance performance in smaller building block libraries, we develop a Dynamic Library mechanism that reuses intermediate high-reward states to construct full synthesis trees. Our approach establishes state-of-the-art results in template-based molecular generation.2025-06-10T15:16:09ZPiotr GaińskiOussama BoussifAndrei RekeshDmytro ShevchukAli ParvizMike TyersRobert A. BateyMichał Koziarskihttp://arxiv.org/abs/2510.15975v2Generative AI for Biosciences: Emerging Threats and Roadmap to Biosecurity2025-11-04T08:03:22ZThe rapid adoption of generative artificial intelligence (GenAI) in the biosciences is transforming biotechnology, medicine, and synthetic biology. Yet this advancement is intrinsically linked to new vulnerabilities, as GenAI lowers the barrier to misuse and introduces novel biosecurity threats, such as generating synthetic viral proteins or toxins. These dual-use risks are often overlooked, as existing safety guardrails remain fragile and can be circumvented through deceptive prompts or jailbreak techniques. In this Perspective, we first outline the current state of GenAI in the biosciences and emerging threat vectors ranging from jailbreak attacks and privacy risks to the dual-use challenges posed by autonomous AI agents. We then examine urgent gaps in regulation and oversight, drawing on insights from 130 expert interviews across academia, government, industry, and policy. A large majority ($\approx 76$\%) expressed concern over AI misuse in biology, and 74\% called for the development of new governance frameworks. Finally, we explore technical pathways to mitigation, advocating a multi-layered approach to GenAI safety. These defenses include rigorous data filtering, alignment with ethical principles during development, and real-time monitoring to block harmful requests. Together, these strategies provide a blueprint for embedding security throughout the GenAI lifecycle. As GenAI becomes integrated into the biosciences, safeguarding this frontier requires an immediate commitment to both adaptive governance and secure-by-design technologies.2025-10-13T00:24:41ZZaixi ZhangSouradip ChakrabortyAmrit Singh BediEmilin MathewVarsha SaravananLe CongAlvaro VelasquezSheng Lin-GibsonMegan BlewettDan HendrycsAlex John LondonEllen ZhongBen RaphaelAdji Bousso DiengJian MaEric XingRuss AltmanGeorge ChurchMengdi Wanghttp://arxiv.org/abs/2511.02128v1DL4Proteins Jupyter Notebooks Teach how to use Artificial Intelligence for Biomolecular Structure Prediction and Design2025-11-03T23:43:20ZComputational methods for predicting and designing biomolecular structures are increasingly powerful. While previous approaches relied on physics-based modeling, modern tools, such as AlphaFold2 in CASP14, leverage artificial intelligence (AI) to achieve significantly improved performance. The growing impact of AI-based tools in protein science necessitates enhanced educational materials that improve AI literacy among both established scientists seeking to deepen their expertise and new researchers entering the field. To address this need, we developed DL4Proteins, a series of ten interactive notebook modules that introduce fundamental machine learning (ML) concepts, guide users through training ML models for protein-related tasks, and ultimately present cutting-edge protein structure prediction and design pipelines. With nothing more than a web browser, learners can now access state-of-the-art computational tools employed by professional protein engineers - ranging from all-atom protein design to fine-tuning protein language models for biophysically relevant functional tasks. By increasing accessibility, this notebook series broadens participation in AI-driven protein research. The complete notebook series is publicly available at https://github.com/Graylab/DL4Proteins-notebooks.2025-11-03T23:43:20Z27 pages, 5 figuresMichael ChungyounGabe AuBritnie CarpentierSreevarsha PuvadaCourtney ThomasJeffrey J. Grayhttp://arxiv.org/abs/2510.19660v2Machine Olfaction and Embedded AI Are Shaping the New Global Sensing Industry2025-11-03T16:02:34ZMachine olfaction is rapidly emerging as a transformative capability, with applications spanning non-invasive medical diagnostics, industrial monitoring, agriculture, and security and defense. Recent advances in stabilizing mammalian olfactory receptors and integrating them into biophotonic and bioelectronic systems have enabled detection at near single-molecule resolution thus placing machines on par with trained detection dogs. As this technology converges with multimodal AI and distributed sensor networks imbued with embedded AI, it introduces a new, biochemical layer to a sensing ecosystem currently dominated by machine vision and audition. This review and industry roadmap surveys the scientific foundations, technological frontiers, and strategic applications of machine olfaction making the case that we are currently witnessing the rise of a new industry that brings with it a global chemosensory infrastructure. We cover exemplary industrial, military and consumer applications and address some of the ethical and legal concerns arising. We find that machine olfaction is poised to bring forth a planet-wide molecular awareness tech layer with the potential of spawning vast emerging markets in health, security, and environmental sensing via scent.2025-10-22T15:05:01Z23 pages, 116 citations, combination tech review/industry roadmap/white paper on the rise of machine olfaction as an essential AI modalityAndreas MershinNikolas StefanouAdan RotteveelMatthew KungGeorge KungAlexandru DanHoward KivellZoia OkulovaZoi KountouriPaul Pu Lianghttp://arxiv.org/abs/2409.07189v2AI-Guided Molecular Simulations in VR: Exploring Strategies for Imitation Learning in Hyperdimensional Molecular Systems2025-11-03T11:59:01ZMolecular dynamics (MD) simulations are a crucial computational tool for researchers to understand and engineer molecular structure and function in areas such as drug discovery, protein engineering, and material design. Despite their utility, MD simulations are expensive, owing to the high dimensionality of molecular systems. Interactive molecular dynamics in virtual reality (iMD-VR) has recently emerged as a "human-in-the-loop" strategy for efficiently navigating hyper-dimensional molecular systems. By providing an immersive 3D environment that enables visualization and manipulation of real-time molecular simulations running on high-performance computing architectures, iMD-VR enables researchers to reach out and guide molecular conformational dynamics, in order to efficiently explore complex, high-dimensional molecular systems. Moreover, iMD-VR simulations generate rich datasets that capture human experts' spatial insight regarding molecular structure and function. This paper explores the use of researcher-generated iMD-VR datasets to train AI agents via imitation learning (IL). IL enables agents to mimic complex behaviours from expert demonstrations, circumventing the need for explicit programming or intricate reward design. In this article, we review IL across robotics and Multi-agents systems domains which are comparable to iMD-VR, and discuss how iMD-VR recordings could be used to train IL models to interact with MD simulations. We then illustrate the applications of these ideas through a proof-of-principle study where iMD-VR data was used to train a CNN network on a simple molecular manipulation task; namely, threading a small molecule through a nanotube pore. Finally, we outline future research directions and potential challenges of using AI agents to augment human expertise in navigating vast molecular conformational spaces.2024-09-11T11:21:02Z(First presented at the First Workshop on "eXtended Reality \& Intelligent Agents" (XRIA24) @ ECAI24, Santiago De Compostela (Spain), 20 October 2024)SN COMPUT. SCI. 6, 922 (2025)Mohamed DhouiouiJonathan BarnoudRhoslyn Roebuck WilliamsHarry J. StroudPhil BatesDavid R. Glowacki10.1007/s42979-025-04465-5http://arxiv.org/abs/2510.27074v2How Do Proteins Fold?2025-11-03T03:44:42ZHow proteins fold remains a central unsolved problem in biology. While the idea of a folding code embedded in the amino acid sequence was introduced more than 6 decades ago, this code remains undefined. While we now have powerful predictive tools to predict the final native structure of proteins, we still lack a predictive framework for how sequences dictate folding pathways. Two main conceptual models dominate as explanations of folding mechanism: the funnel model, in which folding proceeds through many alternative routes on a rugged, hyperdimensional energy landscape; and the foldon model, which proposes a hierarchical sequence of discrete intermediates. Recent advances on two fronts are now enabling folding studies in unprecedented ways. Powerful experimental approaches; in particular, single-molecule force spectroscopy and hydrogen (deuterium exchange assays) allow time-resolved tracking of the folding process at high resolution. At the same time, computational breakthroughs culminating in algorithms such as AlphaFold have revolutionized static structure prediction, opening opportunities to extend machine learning toward dynamics. Together, these developments mark a turning point: for the first time, we are positioned to resolve how proteins fold, why they misfold, and how this knowledge can be harnessed for biology and medicine.2025-10-31T00:46:57Z13 pages, 3 figuresCarlos BustamanteChristian KaiserErik LindahlRobert SosaGiovanni Volpehttp://arxiv.org/abs/2511.00951v1Design, Assessment, and Application of Machine Learning Potential Energy Surfaces2025-11-02T14:25:00ZPotential Energy Surfaces (PESs) are an indispensable tool to investigate, characterise and understand chemical and biological systems in the gas and condensed phases. Advances in Machine Learning (ML) methodologies have led to the development of Machine Learned Potential Energy Surfaces (ML-PES) which are now widely used to simulate such systems. The present work provides an overview of concepts, methodologies and recommendations for constructing and using ML-PESs. The choice of topics is focused on practical and recurrent issues to conceive and use such model. Application of the principles discussed are illustrated through two different systems of biomolecular importance: the non-reactive dynamics of the Alanine-Lysine-Alanine tripeptide in gas and solution phases, and double proton transfer reactions in DNA base pairs.2025-11-02T14:25:00ZValerii AndreichevSena AydinKai TöpferMarkus MeuwlyLuis Itza Vazquez-Salazarhttp://arxiv.org/abs/2510.27539v1The transitional kinetics between open and closed Rep structures can be tuned by salt via two intermediate states2025-10-31T15:14:49ZDNA helicases undergo conformational changes; however, their structural dynamics are poorly understood. Here, we study single molecules of superfamily 1A DNA helicase Rep, which undergo conformational transitions during bacterial DNA replication, repair and recombination. We use time-correlated single-photon counting (TCSPC), fluorescence correlation spectroscopy (FCS), rapid single-molecule Förster resonance energy transfer (smFRET), Anti-Brownian ELectrokinetic (ABEL) trapping and molecular dynamics simulations (MDS) to provide unparalleled temporal and spatial resolution of Rep's domain movements. We detect four states revealing two hitherto hidden intermediates (S2, S3), between the open (S1) and closed (S4) structures, whose stability is salt dependent. Rep's open-to-closed switch involves multiple changes to all four subdomains 1A, 1B, 2A and 2B along the S1 to S2 to S3 to S4 transitional pathway comprising an initial truncated swing of 2B which then rolls across the 1B surface, following by combined rotations of 1B, 2A and 2B. High forward and reverse rates for S1 to S2 suggest that 1B may act to frustrate 2B movement to prevent premature Rep closure in the absence of DNA. These observations support a more general binding model for accessory DNA helicases that utilises conformational plasticity to explore a multiplicity of structures whose landscape can be tuned by salt prior to locking-in upon DNA binding.2025-10-31T15:14:49ZJamieson A L HowardBenjamin AmbroseMahmoud A S AbdelhamidLewis FrameAntoinette Alevropoulos-BorrillAyesha EjazLara DresserMaria DienerowitzSteven D QuinnAllison H SquiresAgnes NoyTimothy D CraggsMark C Leakehttp://arxiv.org/abs/2510.27212v1The Demon Hidden Behind Life's Ultra-Energy-Efficient Information Processing -- Demonstrated by Biological Molecular Motors2025-10-31T06:13:41ZThe remarkable progress of artificial intelligence (AI) has revealed the enormous energy demands of modern digital architectures, raising deep concerns about sustainability. In stark contrast, the human brain operates efficiently on only ~20 watts, and individual cells process gigabit-scale genetic information using energy on the order of trillionths of a watt. Under the same energy budget, a general-purpose digital processor can perform only a few simple operations per second. This striking disparity suggests that biological systems follow algorithms fundamentally distinct from conventional computation. The framework of information thermodynamics-especially Maxwell's demon and the Szilard engine-offers a theoretical clue, setting the lower bound of energy required for information processing. However, digital processors exceed this limit by about six orders of magnitude. Recent single-molecule studies have revealed that biological molecular motors convert Brownian motion into mechanical work, realizing a "demon-like" operational principle. These findings suggest that living systems have already implemented an ultra-efficient information-energy conversion mechanism that transcends digital computation. Here, we experimentally establish a quantitative correspondence between positional information (bits) and mechanical work, demonstrating that molecular machines selectively exploit rare but functional fluctuations arising from Brownian motion to achieve ATP-level energy efficiency. This integration of information, energy, and timescale indicates that life realizes a Maxwell's demon-like mechanism for energy-efficient information processing.2025-10-31T06:13:41Z8 pages, 5 figures, 1 tableToshio YanagidaKeisuke FujitaMitsuhiro Iwakihttp://arxiv.org/abs/2410.01755v3Integrating Protein Sequence and Expression Level to Analysis Molecular Characterization of Breast Cancer Subtypes2025-10-30T17:28:23ZBreast cancer's complexity and variability pose significant challenges in understanding its progression and guiding effective treatment. This study aims to integrate protein sequence data with expression levels to improve the molecular characterization of breast cancer subtypes and predict clinical outcomes. Using ProtGPT2, a language model specifically designed for protein sequences, we generated embeddings that capture the functional and structural properties of proteins. These embeddings were integrated with protein expression levels to form enriched biological representations, which were analyzed using machine learning methods, such as ensemble K-means for clustering and XGBoost for classification. Our approach enabled the successful clustering of patients into biologically distinct groups and accurately predicted clinical outcomes such as survival and biomarker status, achieving high performance metrics, notably an F1 score of 0.88 for survival and 0.87 for biomarker status prediction. Feature importance analysis identified KMT2C, CLASP2, and MYO1B as key proteins involved in hormone signaling, cytoskeletal remodeling, and therapy resistance in hormone receptor-positive and triple-negative breast cancer, with potential influence on breast cancer subtype behavior and progression. Furthermore, protein-protein interaction networks and correlation analyses revealed functional interdependencies among proteins that may influence the behavior and progression of breast cancer subtypes. These findings suggest that integrating protein sequence and expression data provides valuable insights into tumor biology and has significant potential to enhance personalized treatment strategies in breast cancer care.2024-10-02T17:05:48ZHossein SholehrasaMajid Jaberi-Dourakihttp://arxiv.org/abs/2510.26728v1Modelling ion channels with a view towards identifiability2025-10-30T17:27:25ZAggregated Markov models provide a flexible framework for stochastic dynamics that develops on multiple timescales. For example, Markov models for ion channels often consist of multiple open and closed state to account for "slow" and "fast" openings and closings of the channel. The approach is a popular tool in the construction of mechanistic models of ion channels - instead of viewing model states as generators of sojourn times of a certain characteristic length, each individual model state is interpreted as a representation of a distinct biophysical state. We will review the properties of aggregated Markov models and discuss the implications for mechanistic modelling. First, we show how the aggregated Markov models with a given number of states can be calculated using Pólya enumeration However, models with $n_O$ open and $n_C$ closed states that exceed the maximum number $2 n_O n_C$ of parameters are non-identifiable. We will present two derivations for this classical result and investigate non-identifiability further via a detailed analysis of the non-identifiable fully connected three-state model. Finally, we will discuss the implications of non-identifiability for mechanistic modelling of ion channels. We will argue that instead of designing models based on assumed transitions between distinct biophysical states which are modulated by ligand binding, it is preferable to build models based on additional sources of data that give more direct insight into the dynamics of conformational changes.2025-10-30T17:27:25Z37 pages, 6 figures, presented at MATRIX workshop "Parameter Identifiability in Mathematical Biology" https://www.matrix-inst.org.au/events/parameter-identifiability-in-mathematical-biology/?Ivo Siekmannhttp://arxiv.org/abs/2508.16398v2Multiscale Growth Kinetics of Model Biomolecular Condensates Under Passive and Active Conditions2025-10-30T16:43:11ZLiving cells exhibit a complex organization comprising numerous compartments, among which are RNA- and protein-rich membraneless, liquid-like organelles known as biomolecular condensates. Energy-consuming processes regulate their formation and dissolution, with (de-)phosphorylation by specific enzymes being among the most commonly involved reactions. By employing a model system consisting of a phosphorylatable peptide and homopolymeric RNA, we elucidate how enzymatic activity modulates the growth kinetics and alters the local structure of biomolecular condensates. Under passive condition, time-resolved ultra-small-angle X-ray scattering with synchrotron source reveals a nucleation-driven coalescence mechanism maintained over four decades in time, similar to the coarsening of simple binary fluid mixtures. Coarse-grained molecular dynamics simulations show that peptide-decorated RNA chains assembled shortly after mixing constitute the relevant subunits. In contrast, actively-formed condensates initially display a local mass fractal structure, which gradually matures upon enzymatic activity before condensates undergo coalescence. Both types of condensate eventually reach a steady state but fluorescence recovery after photobleaching indicates a peptide diffusivity twice higher in actively-formed condensates consistent with their loosely-packed local structure. We expect multiscale, integrative approaches implemented with model systems to link effectively the functional properties of membraneless organelles to their formation and dissolution kinetics as regulated by cellular active processes.2025-08-22T14:00:08ZTamizhmalar SundararajanMatteo BoccaliniRoméo SussSandrine MariotEmerson R. Da SilvaFernando C. GiacomelliAustin HubleyTheyencheri NarayananAlessandro BarducciGuillaume Tressethttp://arxiv.org/abs/2508.06576v2GFlowNets for Learning Better Drug-Drug Interaction Representations2025-10-30T13:59:28ZDrug-drug interactions pose a significant challenge in clinical pharmacology, with severe class imbalance among interaction types limiting the effectiveness of predictive models. Common interactions dominate datasets, while rare but critical interactions remain underrepresented, leading to poor model performance on infrequent cases. Existing methods often treat DDI prediction as a binary problem, ignoring class-specific nuances and exacerbating bias toward frequent interactions. To address this, we propose a framework combining Generative Flow Networks (GFlowNet) with Variational Graph Autoencoders (VGAE) to generate synthetic samples for rare classes, improving model balance and generate effective and novel DDI pairs. Our approach enhances predictive performance across interaction types, ensuring better clinical reliability.2025-08-07T14:03:23ZAccepted to ICANN 2025:AIDD and NeurIPS 2025 Workshop on Structured Probabilistic Inference & Generative Modeling (https://openreview.net/forum?id=LZW1jSgfCI)Azmine Toushik Wasihttp://arxiv.org/abs/2506.05768v2AANet: Virtual Screening under Structural Uncertainty via Alignment and Aggregation2025-10-30T06:11:45ZVirtual screening (VS) is a critical component of modern drug discovery, yet most existing methods--whether physics-based or deep learning-based--are developed around holo protein structures with known ligand-bound pockets. Consequently, their performance degrades significantly on apo or predicted structures such as those from AlphaFold2, which are more representative of real-world early-stage drug discovery, where pocket information is often missing. In this paper, we introduce an alignment-and-aggregation framework to enable accurate virtual screening under structural uncertainty. Our method comprises two core components: (1) a tri-modal contrastive learning module that aligns representations of the ligand, the holo pocket, and cavities detected from structures, thereby enhancing robustness to pocket localization error; and (2) a cross-attention based adapter for dynamically aggregating candidate binding sites, enabling the model to learn from activity data even without precise pocket annotations. We evaluated our method on a newly curated benchmark of apo structures, where it significantly outperforms state-of-the-art methods in blind apo setting, improving the early enrichment factor (EF1%) from 11.75 to 37.19. Notably, it also maintains strong performance on holo structures. These results demonstrate the promise of our approach in advancing first-in-class drug discovery, particularly in scenarios lacking experimentally resolved protein-ligand complexes. Our implementation is publicly available at https://github.com/Wiley-Z/AANet.2025-06-06T05:52:19ZAccepted at NeurIPS 2025Wenyu ZhuJianhui WangBowen GaoYinjun JiaHaichuan TanYa-Qin ZhangWei-Ying MaYanyan Lanhttp://arxiv.org/abs/2504.17247v2OmegAMP: Targeted AMP Discovery through Biologically Informed Generation2025-10-29T11:30:12ZDeep learning-based antimicrobial peptide (AMP) discovery faces critical challenges such as limited controllability, lack of representations that efficiently model antimicrobial properties, and low experimental hit rates. To address these challenges, we introduce OmegAMP, a framework designed for reliable AMP generation with increased controllability. Its diffusion-based generative model leverages a novel conditioning mechanism to achieve fine-grained control over desired physicochemical properties and to direct generation towards specific activity profiles, including species-specific effectiveness. This is further enhanced by a biologically informed encoding space that significantly improves overall generative performance. Complementing these generative capabilities, OmegAMP leverages a novel synthetic data augmentation strategy to train classifiers for AMP filtering, drastically reducing false positive rates and thereby increasing the likelihood of experimental success. Our in silico experiments demonstrate that OmegAMP delivers state-of-the-art performance across key stages of the AMP discovery pipeline, enabling us to achieve an unprecedented success rate in wet lab experiments. We tested 25 candidate peptides, 24 of them (96%) demonstrated antimicrobial activity, proving effective even against multi-drug resistant strains. Our findings underscore OmegAMP's potential to significantly advance computational frameworks in the fight against antimicrobial resistance.2025-04-24T04:53:04ZDiogo SoaresLeon HetzelPaulina SzymczakMarcelo Der Torossian TorresJohanna SommerCesar de la Fuente-NunezFabian TheisStephan GünnemannEwa Szczurek