https://arxiv.org/api/ZjJIsaCotjE9Hn17ZAQcqGfpfZs 2026-03-22T13:19:18Z 6642 180 15 http://arxiv.org/abs/2506.04490v2 Multiscale guidance of protein structure prediction with heterogeneous cryo-EM data 2025-12-01T01:05:46Z Protein structure prediction models are now capable of generating accurate 3D structural hypotheses from sequence alone. However, they routinely fail to capture the conformational diversity of dynamic biomolecular complexes, often requiring heuristic MSA subsampling approaches for generating alternative states. In parallel, cryo-electron microscopy (cryo-EM) has emerged as a powerful tool for imaging near-native structural heterogeneity, but is challenged by arduous pipelines to transform raw experimental data into atomic models. Here, we bridge the gap between these modalities, combining cryo-EM density maps with the rich sequence and biophysical priors learned by protein structure prediction models. Our method, CryoBoltz, guides the sampling trajectory of a pretrained biomolecular structure prediction model using both global and local structural constraints derived from density maps, driving predictions towards conformational states consistent with the experimental data. We demonstrate that this flexible yet powerful inference-time approach allows us to build atomic models into heterogeneous cryo-EM maps across a variety of dynamic biomolecular systems including transporters and antibodies. Code is available at https://github.com/ml-struct-bio/cryoboltz . 2025-06-04T22:16:27Z NeurIPS 2025 Rishwanth Raghu Axel Levy Gordon Wetzstein Ellen D. Zhong http://arxiv.org/abs/2512.00708v1 Towards Precision Protein-Ligand Affinity Prediction Benchmark: A Complete and Modification-Aware DAVIS Dataset 2025-11-30T03:14:39Z Advancements in AI for science unlocks capabilities for critical drug discovery tasks such as protein-ligand binding affinity prediction. However, current models overfit to existing oversimplified datasets that does not represent naturally occurring and biologically relevant proteins with modifications. In this work, we curate a complete and modification-aware version of the widely used DAVIS dataset by incorporating 4,032 kinase-ligand pairs involving substitutions, insertions, deletions, and phosphorylation events. This enriched dataset enables benchmarking of predictive models under biologically realistic conditions. Based on this new dataset, we propose three benchmark settings-Augmented Dataset Prediction, Wild-Type to Modification Generalization, and Few-Shot Modification Generalization-designed to assess model robustness in the presence of protein modifications. Through extensive evaluation of both docking-free and docking-based methods, we find that docking-based model generalize better in zero-shot settings. In contrast, docking-free models tend to overfit to wild-type proteins and struggle with unseen modifications but show notable improvement when fine-tuned on a small set of modified examples. We anticipate that the curated dataset and benchmarks offer a valuable foundation for developing models that better generalize to protein modifications, ultimately advancing precision medicine in drug discovery. The benchmark is available at: https://github.com/ZhiGroup/DAVIS-complete 2025-11-30T03:14:39Z Ming-Hsiu Wu Ziqian Xie Shuiwang Ji Degui Zhi http://arxiv.org/abs/2512.00642v1 DeepFRI Demystified: Interpretability vs. Accuracy in AI Protein Function Prediction 2025-11-29T21:42:53Z Machine learning technologies for protein function prediction are black box models. Despite their potential to identify key drug targets with high accuracy and accelerate therapy development, the adoption of these methods depends on verifying their findings. This study evaluates DeepFRI, a leading Graph Convolutional Network (GCN) based tool, using advanced explainability techniques (GradCAM, Excitation Backpropagation, and PGExplainer) and adversarial robustness tests. Our findings reveal that the model's predictions often prioritize conserved motifs over truly deterministic residues, complicating the identification of functional sites. Quantitative analyses show that explainability methods differ significantly in granularity, with GradCAM providing broad relevance and PGExplainer pinpointing specific active sites. These results highlight tradeoffs between accuracy and interpretability, suggesting areas for improvement in DeepFRI's architecture to enhance its trustworthiness in drug discovery and regulatory settings. 2025-11-29T21:42:53Z CVPL Interpretable Systems for Artificial Intelligence Transparency Workshop 2025 Ananya Krishna Valentina Simon Arjan Kohli http://arxiv.org/abs/2512.00384v1 Efficient and Programmable Exploration of Synthesizable Chemical Space 2025-11-29T08:21:21Z The constrained nature of synthesizable chemical space poses a significant challenge for sampling molecules that are both synthetically accessible and possess desired properties. In this work, we present PrexSyn, an efficient and programmable model for molecular discovery within synthesizable chemical space. PrexSyn is based on a decoder-only transformer trained on a billion-scale datastream of synthesizable pathways paired with molecular properties, enabled by a real-time, high-throughput C++-based data generation engine. The large-scale training data allows PrexSyn to reconstruct the synthesizable chemical space nearly perfectly at a high inference speed and learn the association between properties and synthesizable molecules. Based on its learned property-pathway mappings, PrexSyn can generate synthesizable molecules that satisfy not only single-property conditions but also composite property queries joined by logical operators, thereby allowing users to ``program'' generation objectives. Moreover, by exploiting this property-based querying capability, PrexSyn can efficiently optimize molecules against black-box oracle functions via iterative query refinement, achieving higher sampling efficiency than even synthesis-agnostic baselines, making PrexSyn a powerful general-purpose molecular optimization tool. Overall, PrexSyn pushes the frontier of synthesizable molecular design by setting a new state of the art in synthesizable chemical space coverage, molecular sampling efficiency, and inference speed. 2025-11-29T08:21:21Z Shitong Luo Connor W. Coley http://arxiv.org/abs/2512.00379v1 EnzyCLIP: A Cross-Attention Dual Encoder Framework with Contrastive Learning for Predicting Enzyme Kinetic Constants 2025-11-29T08:13:06Z Accurate prediction of enzyme kinetic parameters is crucial for drug discovery, metabolic engineering, and synthetic biology applications. Current computational approaches face limitations in capturing complex enzyme-substrate interactions and often focus on single parameters while neglecting the joint prediction of catalytic turnover numbers (Kcat) and Michaelis-Menten constants (Km). We present EnzyCLIP, a novel dual-encoder framework that leverages contrastive learning and cross-attention mechanisms to predict enzyme kinetic parameters from protein sequences and substrate molecular structures. Our approach integrates ESM-2 protein language model embeddings with ChemBERTa chemical representations through a CLIP-inspired architecture enhanced with bidirectional cross-attention for dynamic enzyme-substrate interaction modeling. EnzyCLIP combines InfoNCE contrastive loss with Huber regression loss to learn aligned multimodal representations while predicting log10-transformed kinetic parameters. The model is trained on the CatPred-DB database containing 23,151 Kcat and 41,174 Km experimentally validated measurements, and achieved competitive performance with R2 scores of 0.593 for Kcat and 0.607 for Km prediction. XGBoost ensemble methods applied to the learned embeddings further improved Km prediction (R2 = 0.61) while maintaining robust Kcat performance. 2025-11-29T08:13:06Z Anas Aziz Khan Md Shah Fahad Priyanka Ramesh Chandra Guransh Singh http://arxiv.org/abs/2403.01528v3 Leveraging Biomolecule and Natural Language through Multi-Modal Learning: A Survey 2025-11-28T13:15:22Z The integration of biomolecular modeling with natural language (BL) has emerged as a promising interdisciplinary area at the intersection of artificial intelligence, chemistry and biology. This approach leverages the rich, multifaceted descriptions of biomolecules contained within textual data sources to enhance our fundamental understanding and enable downstream computational tasks such as biomolecule property prediction. The fusion of the nuanced narratives expressed through natural language with the structural and functional specifics of biomolecules described via various molecular modeling techniques opens new avenues for comprehensively representing and analyzing biomolecules. By incorporating the contextual language data that surrounds biomolecules into their modeling, BL aims to capture a holistic view encompassing both the symbolic qualities conveyed through language as well as quantitative structural characteristics. In this review, we provide an extensive analysis of recent advancements achieved through cross modeling of biomolecules and natural language. (1) We begin by outlining the technical representations of biomolecules employed, including sequences, 2D graphs, and 3D structures. (2) We then examine in depth the rationale and key objectives underlying effective multi-modal integration of language and molecular data sources. (3) We subsequently survey the practical applications enabled to date in this developing research area. (4) We also compile and summarize the available resources and datasets to facilitate future work. (5) Looking ahead, we identify several promising research directions worthy of further exploration and investment to continue advancing the field. The related resources and contents are updating in https://github.com/QizhiPei/Awesome-Biomolecule-Language-Cross-Modeling. 2024-03-03T14:59:47Z 2025.11.28 Updated Version Qizhi Pei Zhimeng Zhou Kaiyuan Gao Jinhua Zhu Yue Wang Zun Wang Tao Qin Lijun Wu Rui Yan http://arxiv.org/abs/2412.12979v3 Guiding Generative Protein Language Models with Reinforcement Learning 2025-11-27T19:20:45Z Protein language models (pLMs) have demonstrated success at generating functional proteins across vast sequence spaces but lack the ability to design high-fitness variants on demand. Here, we iteratively guide pLMs toward user-defined objectives by applying reinforcement learning (RL). We demonstrate that RL can steer pLMs toward various protein properties, such as topologies or binding affinities, in a few iterations through long evolutionary trajectories. We apply our framework to the design of epidermal growth factor receptor (EGFR) binders, achieving a 26-fold increase in binding affinity in two iterations. 2024-12-17T14:58:37Z 28 pages including main text and supporting information Filippo Stocco Maria Artigues-Lleixa Andrea Hunklinger Talal Widatalla Marc Guell Noelia Ferruz http://arxiv.org/abs/2509.02196v4 Beyond Ensembles: Simulating All-Atom Protein Dynamics in a Learned Latent Space 2025-11-27T18:33:23Z Simulating the long-timescale dynamics of biomolecules is a central challenge in computational science. While enhanced sampling methods can accelerate these simulations, they rely on pre-defined collective variables that are often difficult to identify, restricting their ability to model complex switching mechanisms between metastable states. A recent generative model, LD-FPG, demonstrated that this problem could be bypassed by learning to sample the static equilibrium ensemble as all-atom deformations from a reference structure, establishing a powerful method for all-atom ensemble generation. However, while this approach successfully captures a system's probable conformations, it does not model the temporal evolution between them. We introduce the Graph Latent Dynamics Propagator (GLDP), a modular component for simulating dynamics within the learned latent space of LD-FPG. We then compare three classes of propagators: (i) score-guided Langevin dynamics, (ii) Koopman-based linear operators, and (iii) autoregressive neural networks. Within a unified encoder-propagator-decoder framework, we evaluate long-horizon stability, backbone and side-chain ensemble fidelity, and temporal kinetics via TICA. Benchmarks on systems ranging from small peptides to mixed-topology proteins and large GPCRs reveal that autoregressive neural networks deliver the most robust long rollouts and coherent physical timescales; score-guided Langevin best recovers side-chain thermodynamics when the score is well learned; and Koopman provides an interpretable, lightweight baseline that tends to damp fluctuations. These results clarify the trade-offs among propagators and offer practical guidance for latent-space simulators of all-atom protein dynamics. 2025-09-02T11:09:06Z Aditya Sengar Jiying Zhang Pierre Vandergheynst Patrick Barth http://arxiv.org/abs/2503.02058v5 RiboGen: RNA Sequence and Structure Co-Generation with Equivariant MultiFlow 2025-11-27T15:38:38Z Ribonucleic acid (RNA) plays fundamental roles in biological systems, from carrying genetic information to performing enzymatic function. Understanding and designing RNA can enable novel therapeutic application and biotechnological innovation. To enhance RNA design, in this paper we introduce RiboGen, the first deep learning model to simultaneously generate RNA sequence and all-atom 3D structure. RiboGen leverages the standard Flow Matching with Discrete Flow Matching in a multimodal data representation. RiboGen is based on Euclidean Equivariant neural networks for efficiently processing and learning three-dimensional geometry. Our experiments show that RiboGen can efficiently generate chemically plausible and self-consistent RNA samples, suggesting that co-generation of sequence and structure is a competitive approach for modeling RNA. 2025-03-03T21:19:11Z 6 pages Dana Rubin Allan dos Santos Costa Manvitha Ponnapati Joseph Jacobson http://arxiv.org/abs/2511.22239v1 DeepPNI: Language- and graph-based model for mutation-driven protein-nucleic acid energetics 2025-11-27T09:08:32Z The interaction between proteins and nucleic acids is crucial for processes that sustain cellular function, including DNA maintenance and the regulation of gene expression and translation. Amino acid mutations in protein-nucleic acid complexes often lead to vital diseases. Experimental techniques have their own specific limitations in predicting mutational effects in protein-nucleic acid complexes. In this study, we compiled a large dataset of 1951 mutations including both protein-DNA and protein-RNA complexes and integrated structural and sequential features to build a deep learning-based regression model named DeepPNI. This model estimates mutation-induced binding free energy changes in protein-nucleic acid complexes. The structural features are encoded via edge-aware RGCN and the sequential features are extracted using protein language model ESM-2. We have achieved a high average Pearson correlation coefficient (PCC) of 0.76 in the large dataset via five-fold cross-validation. Consistent performance across individual dataset of protein-DNA, protein-RNA complexes, and different experimental temperature split dataset make the model generalizable. Our model showed good performance in complex-based five-fold cross-validation, which proved its robustness. In addition, DeepPNI outperformed in external dataset validation, and comparison with existing tools 2025-11-27T09:08:32Z Somnath Mondal Tinkal Mondal Soumajit Pramanik Rukmankesh Mehra http://arxiv.org/abs/2511.21900v1 Beyond Atoms: Evaluating Electron Density Representation for 3D Molecular Learning 2025-11-26T20:42:31Z Machine learning models for 3D molecular property prediction typically rely on atom-based representations, which may overlook subtle physical information. Electron density maps -- the direct output of X-ray crystallography and cryo-electron microscopy -- offer a continuous, physically grounded alternative. We compare three voxel-based input types for 3D convolutional neural networks (CNNs): atom types, raw electron density, and density gradient magnitude, across two molecular tasks -- protein-ligand binding affinity prediction (PDBbind) and quantum property prediction (QM9). We focus on voxel-based CNNs because electron density is inherently volumetric, and voxel grids provide the most natural representation for both experimental and computed densities. On PDBbind, all representations perform similarly with full data, but in low-data regimes, density-based inputs outperform atom types, while a shape-based baseline performs comparably -- suggesting that spatial occupancy dominates this task. On QM9, where labels are derived from Density Functional Theory (DFT) but input densities from a lower-level method (XTB), density-based inputs still outperform atom-based ones at scale, reflecting the rich structural and electronic information encoded in density. Overall, these results highlight the task- and regime-dependent strengths of density-derived inputs, improving data efficiency in affinity prediction and accuracy in quantum property modeling. 2025-11-26T20:42:31Z Patricia Suriana Joshua A. Rackers Ewa M. Nowara Pedro O. Pinheiro John M. Nicoloudis Vishnu Sresht http://arxiv.org/abs/2511.00209v2 Diffusion Models at the Drug Discovery Frontier: A Review on Generating Small Molecules versus Therapeutic Peptides 2025-11-26T17:21:10Z Diffusion models have emerged as a leading framework in generative modeling, poised to transform the traditionally slow and costly process of drug discovery. This review provides a systematic comparison of their application in designing two principal therapeutic modalities: small molecules and therapeutic peptides. We dissect how the unified framework of iterative denoising is adapted to the distinct molecular representations, chemical spaces, and design objectives of each modality. For small molecules, these models excel at structure-based design, generating novel, pocket-fitting ligands with desired physicochemical properties, yet face the critical hurdle of ensuring chemical synthesizability. Conversely, for therapeutic peptides, the focus shifts to generating functional sequences and designing de novo structures, where the primary challenges are achieving biological stability against proteolysis, ensuring proper folding, and minimizing immunogenicity. Despite these distinct challenges, both domains face shared hurdles: the scarcity of high-quality experimental data, the reliance on inaccurate scoring functions for validation, and the crucial need for experimental validation. We conclude that the full potential of diffusion models will be unlocked by bridging these modality-specific gaps and integrating them into automated, closed-loop Design-Build-Test-Learn (DBTL) platforms, thereby shifting the paradigm from mere chemical exploration to the on-demand engineering of novel~therapeutics. 2025-10-31T19:11:41Z Published in Biology Biology 2025, 14(12), 1665 Yiquan Wang Yahui Ma Yuhan Chang Jiayao Yan Jialin Zhang Minnuo Cai Kai Wei 10.3390/biology14121665 http://arxiv.org/abs/2511.21781v1 BeeRNA: tertiary structure-based RNA inverse folding using Artificial Bee Colony 2025-11-26T12:43:32Z The Ribonucleic Acid (RNA) inverse folding problem, designing nucleotide sequences that fold into specific tertiary structures, is a fundamental computational biology problem with important applications in synthetic biology and bioengineering. The design of complex three-dimensional RNA architectures remains computationally demanding and mostly unresolved, as most existing approaches focus on secondary structures. In order to address tertiary RNA inverse folding, we present BeeRNA, a bio-inspired method that employs the Artificial Bee Colony (ABC) optimization algorithm. Our approach combines base-pair distance filtering with RMSD-based structural assessment using RhoFold for structure prediction, resulting in a two-stage fitness evaluation strategy. To guarantee biologically plausible sequences with balanced GC content, the algorithm takes thermodynamic constraints and adaptive mutation rates into consideration. In this work, we focus primarily on short and medium-length RNAs ($<$ 100 nucleotides), a biologically significant regime that includes microRNAs (miRNAs), aptamers, and ribozymes, where BeeRNA achieves high structural fidelity with practical CPU runtimes. The lightweight, training-free implementation will be publicly released for reproducibility, offering a promising bio-inspired approach for RNA design in therapeutics and biotechnology. 2025-11-26T12:43:32Z Accepted at the AI in Drug Discovery Workshop, AAAI 2026, Singapore Mehyar Mlaweh Tristan Cazenave Ines Alaya http://arxiv.org/abs/2511.19264v1 Interpreting GFlowNets for Drug Discovery: Extracting Actionable Insights for Medicinal Chemistry 2025-11-24T16:16:18Z Generative Flow Networks, or GFlowNets, offer a promising framework for molecular design, but their internal decision policies remain opaque. This limits adoption in drug discovery, where chemists require clear and interpretable rationales for proposed structures. We present an interpretability framework for SynFlowNet, a GFlowNet trained on documented chemical reactions and purchasable starting materials that generates both molecules and the synthetic routes that produce them. Our approach integrates three complementary components. Gradient based saliency combined with counterfactual perturbations identifies which atomic environments influence reward and how structural edits change molecular outcomes. Sparse autoencoders reveal axis aligned latent factors that correspond to physicochemical properties such as polarity, lipophilicity, and molecular size. Motif probes show that functional groups including aromatic rings and halogens are explicitly encoded and linearly decodable from the internal embeddings. Together, these results expose the chemical logic inside SynFlowNet and provide actionable and mechanistic insight that supports transparent and controllable molecular design. 2025-11-24T16:16:18Z 13 pages, 7 figures. Accepted for presentation at NeurIPS 2025 WiML Workshop and Molecular Machine Learning Conference (MoML) 2025 Amirtha Varshini A S Duminda S. Ranasinghe Hok Hei Tam http://arxiv.org/abs/2511.19184v1 Torsion-Space Diffusion for Protein Backbone Generation with Geometric Refinement 2025-11-24T14:51:29Z Designing new protein structures is fundamental to computational biology, enabling advances in therapeutic molecule discovery and enzyme engineering. Existing diffusion-based generative models typically operate in Cartesian coordinate space, where adding noise disrupts strict geometric constraints such as fixed bond lengths and angles, often producing physically invalid structures. To address this limitation, we propose a Torsion-Space Diffusion Model that generates protein backbones by denoising torsion angles, ensuring perfect local geometry by construction. A differentiable forward-kinematics module reconstructs 3D coordinates with fixed 3.8 Angstrom backbone bond lengths while a constrained post-processing refinement optimizes global compactness via Radius of Gyration (Rg) correction, without violating bond constraints. Experiments on standard PDB proteins demonstrate 100% bond-length accuracy and significantly improved structural compactness, reducing Rg error from 70% to 18.6% compared to Cartesian diffusion baselines. Overall, this hybrid torsion-diffusion plus geometric-refinement framework generates physically valid and compact protein backbones, providing a promising path toward full-atom protein generation. 2025-11-24T14:51:29Z 5 pages, 4 figures Lakshaditya Singh Adwait Shelke Divyansh Agrawal