https://arxiv.org/api/LKmogZqr60B+8fB8A7rN/MoC+yQ 2026-03-22T21:45:36Z 6642 270 15 http://arxiv.org/abs/2503.21681v3 A Comprehensive Benchmark for RNA 3D Structure-Function Modeling 2025-10-22T11:51:47Z

The relationship between RNA structure and function has recently attracted interest within the deep learning community, a trend expected to intensify as nucleic acid structure models advance. Despite this momentum, the lack of standardized, accessible benchmarks for applying deep learning to RNA 3D structures hinders progress. To this end, we introduce a collection of seven benchmarking datasets specifically designed to support RNA structure-function prediction. Built on top of the established Python package rnaglib, our library streamlines data distribution and encoding, provides tools for dataset splitting and evaluation, and offers a comprehensive, user-friendly environment for model comparison. The modular and reproducible design of our datasets encourages community contributions and enables rapid customization. To demonstrate the utility of our benchmarks, we report baseline results for all tasks using a relational graph neural network.

2025-03-27T16:49:31Z Luis Wyss Vincent Mallet Wissam Karroucha Karsten Borgwardt Carlos Oliver http://arxiv.org/abs/2510.19484v1 KnowMol: Advancing Molecular Large Language Models with Multi-Level Chemical Knowledge 2025-10-22T11:23:58Z

The molecular large language models have garnered widespread attention due to their promising potential on molecular applications. However, current molecular large language models face significant limitations in understanding molecules due to inadequate textual descriptions and suboptimal molecular representation strategies during pretraining. To address these challenges, we introduce KnowMol-100K, a large-scale dataset with 100K fine-grained molecular annotations across multiple levels, bridging the gap between molecules and textual descriptions. Additionally, we propose chemically-informative molecular representation, effectively addressing limitations in existing molecular representation strategies. Building upon these innovations, we develop KnowMol, a state-of-the-art multi-modal molecular large language model. Extensive experiments demonstrate that KnowMol achieves superior performance across molecular understanding and generation tasks. GitHub: https://github.com/yzf-code/KnowMol Huggingface: https://hf.co/datasets/yzf1102/KnowMol-100K

2025-10-22T11:23:58Z Zaifei Yang Hong Chang Ruibing Hou Shiguang Shan Xilin Chen http://arxiv.org/abs/2311.08076v6 Determining the optimal structural resolution of proteins through an information-theoretic analysis of their conformational ensemble 2025-10-22T09:04:01Z

The choice of structural resolution is a fundamental aspect of protein modelling, determining the balance between descriptive power and interpretability. Although atomistic simulations provide maximal detail, much of this information is redundant to understand the relevant large-scale motions and conformational states. Here, we introduce an unsupervised, information-theoretic framework that determines the minimal number of atoms required to retain a maximally informative description of the configurational space sampled by a protein. This framework quantifies the informativeness of coarse-grained representations obtained by systematically decimating atomic degrees of freedom and evaluating the resulting clustering of sampled conformations. Application to molecular dynamics trajectories of dynamically diverse proteins shows that the optimal number of retained atoms scales linearly with system size, averaging about four heavy atoms per residue--remarkably consistent with the resolution of well-established coarse-grained models, such as MARTINI and SIRAH. Furthermore, the analysis shows that the optimal retained atoms number depends not only on molecular size but also on the extent of conformational exploration, decreasing for systems dominated by collective motions. The proposed method establishes a general criterion to identify the minimal structural detail that preserves the essential configurational information, thereby offering a new viewpoint on the structure-dynamics-function relationship in proteins and guiding the construction of parsimonious yet informative multiscale models.

2023-11-14T11:01:36Z Margherita Mele Raffaele Fiorentini Thomas Tarenzi Giovanni Mattiotti Raffaello Potestio http://arxiv.org/abs/2406.18851v2 LICO: Large Language Models for In-Context Molecular Optimization 2025-10-22T03:44:08Z

Optimizing black-box functions is a fundamental problem in science and engineering. To solve this problem, many approaches learn a surrogate function that estimates the underlying objective from limited historical evaluations. Large Language Models (LLMs), with their strong pattern-matching capabilities via pretraining on vast amounts of data, stand out as a potential candidate for surrogate modeling. However, directly prompting a pretrained language model to produce predictions is not feasible in many scientific domains due to the scarcity of domain-specific data in the pretraining corpora and the challenges of articulating complex problems in natural language. In this work, we introduce LICO, a general-purpose model that extends arbitrary base LLMs for black-box optimization, with a particular application to the molecular domain. To achieve this, we equip the language model with a separate embedding layer and prediction layer, and train the model to perform in-context predictions on a diverse set of functions defined over the domain. Once trained, LICO can generalize to unseen molecule properties simply via in-context prompting. LICO performs competitively on PMO, a challenging molecular optimization benchmark comprising 23 objective functions, and achieves state-of-the-art performance on its low-budget version PMO-1K.

2024-06-27T02:43:18Z International Conference on Learning Representations (ICLR 2025) Tung Nguyen Aditya Grover http://arxiv.org/abs/2505.01919v2 From Possibility to Precision in Macromolecular Ensemble Prediction 2025-10-21T11:51:14Z

Proteins and other macromolecules exist not in a single state but as dynamic ensembles of interconverting conformations, which are essential for catalysis, allosteric regulation, and molecular recognition. While AI-based structure predictors like AlphaFold have revolutionized static structure prediction, they are not yet capable of capturing conformational ensembles. Progress towards the next generation of AI models capable of ensemble prediction is currently limited by the lack of accurate, high-resolution ground truth ensembles at the scale required for training and validation. This is due to the fact that no single experimental technique can fully resolve the atomistic complexity of conformational landscapes, and fundamental challenges remain in defining, representing, comparing, and validating structural ensembles. Here, we outline the infrastructure and methodological advances needed to overcome these barriers. We highlight emerging strategies for integrating heterogeneous experimental data into unified ensemble encoding representations and how to leverage these new methodologies to build benchmarks and establish ensemble-specific validation protocols. Finally, we discuss how ensemble predictions will be an interactive cycle of experimental and computational innovation. Establishing this ecosystem will allow structural biology to move beyond static snapshots toward a dynamic understanding of molecular behavior that captures the full complexity of biological systems.

2025-05-03T20:53:40Z Stephanie A. Wankowicz Massimiliano Bonomi http://arxiv.org/abs/2505.15093v2 Steering Generative Models with Experimental Data for Protein Fitness Optimization 2025-10-20T23:15:03Z

Protein fitness optimization involves finding a protein sequence that maximizes desired quantitative properties in a combinatorially large design space of possible sequences. Recent advances in steering protein generative models (e.g., diffusion models and language models) with labeled data offer a promising approach. However, most previous studies have optimized surrogate rewards and/or utilized large amounts of labeled data for steering, making it unclear how well existing methods perform and compare to each other in real-world optimization campaigns where fitness is measured through low-throughput wet-lab assays. In this study, we explore fitness optimization using small amounts (hundreds) of labeled sequence-fitness pairs and comprehensively evaluate strategies such as classifier guidance and posterior sampling for guiding generation from different discrete diffusion models of protein sequences. We also demonstrate how guidance can be integrated into adaptive sequence selection akin to Thompson sampling in Bayesian optimization, showing that plug-and-play guidance strategies offer advantages over alternatives such as reinforcement learning with protein language models. Overall, we provide practical insights into how to effectively steer modern generative models for next-generation protein fitness optimization.

2025-05-21T04:30:48Z NeurIPS 2025 Jason Yang Wenda Chu Daniel Khalil Raul Astudillo Bruce J. Wittmann Frances H. Arnold Yisong Yue http://arxiv.org/abs/2506.01857v5 Protein folding classes -- High-dimensional geometry of amino acid composition space revisited 2025-10-20T14:29:17Z

In this study, the distributions of protein structure classes (or folding types) of experimentally determined structures from a legacy dataset and a comprehensive database (SCOP) are modeled precisely with geometric constructs such as convex polytopes in high-dimensional amino acid composition space. This is a follow-up of a previous non-statistical, geometry-motivated modeling of protein classes with ellipsoidal models, which is superseded presently in three important respects: (1) as a paradigm shift a descriptive 'distribution model' of experimental data is de-coupled from, and serves as the basis for, a possible future predictive 'domain model' generalizable to proteins in the same class for which 3D structures have yet to be determined experimentally, (2) the geometric and analytic characteristics of class distributions are obtained via exact computational geometry calculations, and (3) the full data from a comprehensive database are included in such calculations, eschewing training set selection and biases. In contrast to statistical and machine-learning approaches, the analytical, non-statistical geometry models of protein class distributions demonstrated in this study furnish complete and precise information on their size and relative disposition in the high-dimensional space (vis-à-vis any overlaps leading to ambiguity and classification limits). Intended primarily as an accurate and summary description of the complex relationships between amino acid composition and protein classes, and suitably as a basis for predictive modeling where possible, the results suggest that pen-ultimately they may be useful adjuncts for validating sequence-based protein structure predictions and contribute to theoretical and fundamental understanding of secondary structure formation and protein folding, demonstrating the role of high dimensional amino acid composition space in protein studies.

2025-06-02T16:44:02Z 50 pages, 6 figures, 4 tables Boryeu Mao http://arxiv.org/abs/2509.06849v3 Canonicalization of the E value from BLAST similarity search -- dissimilarity measure and distance function for a metric space of protein sequences 2025-10-20T14:22:24Z

Sequence matching algorithms such as BLAST and FASTA have been widely used in searching for evolutionary origin and biological functions of newly discovered nucleic acid and protein sequences. As parts of these search tools, alignment scores and E values are useful indicators of the quality of search results from querying a database of annotated sequences, whereby a high alignment score (and inversely a low E value) reflects significant similarity between the query and the subject (target) sequences. For cross-comparison of results from sufficiently different queries however, the interpretation of alignment score as a similarity measure and E value a dissimilarity measure becomes somewhat nuanced, and prompts herein a judicious distinction of different types of similarity. We show that an adjustment of E value to account for self-matching of query and subject sequences corrects for certain ostensibly anomalous similarity comparisons, resulting in canonical dissimilarity and similarity measures that would be more appropriate for database applications, such as all-on-all sequence alignment or selection of diverse subsets. In actual practice, the canonicalization of E value dissimilarity improves clustering and the diversity of subset selection. While both E value and the canonical E value share positivity and symmetry, two of the four axiomatic properties of a metric space, the canonical E value is also reflexive and meets the condition of triangle inequality, thus itself an appropriate distance function for a metric space of protein sequences.

2025-09-08T16:18:13Z 36 pages, 4 figures, 3 tables Boryeu Mao http://arxiv.org/abs/2510.17187v1 A Standardized Benchmark for Machine-Learned Molecular Dynamics using Weighted Ensemble Sampling 2025-10-20T06:02:36Z

The rapid evolution of molecular dynamics (MD) methods, including machine-learned dynamics, has outpaced the development of standardized tools for method validation. Objective comparison between simulation approaches is often hindered by inconsistent evaluation metrics, insufficient sampling of rare conformational states, and the absence of reproducible benchmarks. To address these challenges, we introduce a modular benchmarking framework that systematically evaluates protein MD methods using enhanced sampling analysis. Our approach uses weighted ensemble (WE) sampling via The Weighted Ensemble Simulation Toolkit with Parallelization and Analysis (WESTPA), based on progress coordinates derived from Time-lagged Independent Component Analysis (TICA), enabling fast and efficient exploration of protein conformational space. The framework includes a flexible, lightweight propagator interface that supports arbitrary simulation engines, allowing both classical force fields and machine learning-based models. Additionally, the framework offers a comprehensive evaluation suite capable of computing more than 19 different metrics and visualizations across a variety of domains. We further contribute a dataset of nine diverse proteins, ranging from 10 to 224 residues, that span a variety of folding complexities and topologies. Each protein has been extensively simulated at 300K for one million MD steps per starting point (4 ns). To demonstrate the utility of our framework, we perform validation tests using classic MD simulations with implicit solvent and compare protein conformational sampling using a fully trained versus under-trained CGSchNet model. By standardizing evaluation protocols and enabling direct, reproducible comparisons across MD approaches, our open-source platform lays the groundwork for consistent, rigorous benchmarking across the molecular simulation community.

2025-10-20T06:02:36Z 37 Pages (Main Text), 10 Figures, Submitted to Journal of Physical Chemistry B Alexander Aghili Andy Bruce Daniel Sabo Sanya Murdeshwar Kevin Bachelor Ionut Mistreanu Ashwin Lokapally Razvan Marinescu http://arxiv.org/abs/2507.09251v2 Reshaping Biomolecular Structure Prediction through Strategic Conformational Exploration with HelixFold-S1 2025-10-20T02:45:49Z

Generating large ensembles of candidate conformations is standard for improving biomolecular structure prediction. Yet aimless sampling is inefficient and costly, producing many redundant conformations with limited diversity, so additional computation often yields little improvement. Here, we present HelixFold-S1, a guided planning approach that strategically targets the most informative regions of conformational space to produce accurate conformations. For each biomolecule, predicted inter-chain contact probabilities serve as a blueprint of the conformational space, guiding computational effort toward higher-probability, low-redundancy contacts that constrain structure generation. Across diverse biomolecular benchmarks, HelixFold-S1 achieves markedly higher structural accuracy than traditional unguided methods while reducing sampling requirements by an order of magnitude. Predicted contact probabilities also provide a rough indicator of prediction difficulty and sampling utility. These results demonstrate that guided planning reshapes conformational exploration and enables more efficient and accurate structural inference.

2025-07-12T11:15:40Z Lihang Liu Yang Liu Xianbin Ye Shanzhuo Zhang Yuxin Li Kunrui Zhu Yang Xue Xiaonan Zhang Xiaomin Fang http://arxiv.org/abs/2506.13174v2 GeoRecon: Graph-Level Representation Learning for 3D Molecules via Reconstruction-Based Pretraining 2025-10-20T00:21:40Z

The pretraining-finetuning paradigm has powered major advances in domains such as natural language processing and computer vision, with representative examples including masked language modeling and next-token prediction. In molecular representation learning, however, pretraining tasks remain largely restricted to node-level denoising, which effectively captures local atomic environments but is often insufficient for encoding the global molecular structure critical to graph-level property prediction tasks such as energy estimation and molecular regression. To address this gap, we introduce GeoRecon, a graph-level pretraining framework that shifts the focus from individual atoms to the molecule as an integrated whole. GeoRecon formulates a graph-level reconstruction task: during pretraining, the model is trained to produce an informative graph representation that guides geometry reconstruction while inducing smoother and more transferable latent spaces. This encourages the learning of coherent, global structural features beyond isolated atomic details. Without relying on external supervision, GeoRecon generally improves over backbone baselines on multiple molecular benchmarks including QM9, MD17, MD22, and 3BPA, demonstrating the effectiveness of graph-level reconstruction for holistic and geometry-aware molecular embeddings.

2025-06-16T07:35:49Z Shaoheng Yan Zian Li Muhan Zhang http://arxiv.org/abs/2510.16612v1 Accelerated Learning on Large Scale Screens using Generative Library Models 2025-10-18T18:33:51Z

Biological machine learning is often bottlenecked by a lack of scaled data. One promising route to relieving data bottlenecks is through high throughput screens, which can experimentally test the activity of $10^6-10^{12}$ protein sequences in parallel. In this article, we introduce algorithms to optimize high throughput screens for data creation and model training. We focus on the large scale regime, where dataset sizes are limited by the cost of measurement and sequencing. We show that when active sequences are rare, we maximize information gain if we only collect positive examples of active sequences, i.e. $x$ with $y>0$. We can correct for the missing negative examples using a generative model of the library, producing a consistent and efficient estimate of the true $p(y | x)$. We demonstrate this approach in simulation and on a large scale screen of antibodies. Overall, co-design of experiments and inference lets us accelerate learning dramatically.

2025-10-18T18:33:51Z Eli N. Weinstein Andrei Slabodkin Mattia G. Gollub Elizabeth B. Wood http://arxiv.org/abs/2510.16590v1 Atom-anchored LLMs speak Chemistry: A Retrosynthesis Demonstration 2025-10-18T17:27:44Z

Applications of machine learning in chemistry are often limited by the scarcity and expense of labeled data, restricting traditional supervised methods. In this work, we introduce a framework for molecular reasoning using general-purpose Large Language Models (LLMs) that operates without requiring labeled training data. Our method anchors chain-of-thought reasoning to the molecular structure by using unique atomic identifiers. First, the LLM performs a one-shot task to identify relevant fragments and their associated chemical labels or transformation classes. In an optional second step, this position-aware information is used in a few-shot task with provided class examples to predict the chemical transformation. We apply our framework to single-step retrosynthesis, a task where LLMs have previously underperformed. Across academic benchmarks and expert-validated drug discovery molecules, our work enables LLMs to achieve high success rates in identifying chemically plausible reaction sites ($\geq90\%$), named reaction classes ($\geq40\%$), and final reactants ($\geq74\%$). Beyond solving complex chemical tasks, our work also provides a method to generate theoretically grounded synthetic datasets by mapping chemical knowledge onto the molecular structure and thereby addressing data scarcity.

2025-10-18T17:27:44Z Alan Kai Hassen and Andrius Bernatavicius contributed equally to this work Alan Kai Hassen Andrius Bernatavicius Antonius P. A. Janssen Mike Preuss Gerard J. P. van Westen Djork-Arné Clevert http://arxiv.org/abs/2510.26806v1 Molecular glues stabilize water-mediated hydrogen bonds in ternary complexes 2025-10-18T16:11:33Z

By stabilizing weak and transient protein-protein interactions (PPIs), molecular glues address the challenge of targeting proteins previously considered undruggable. Rapamycin and WDB002 are molecular glues that bind to FK506-binding protein (FKBP12) and target the FKBP12-rapamycin-associated protein (FRAP) and the centrosomal protein 250 (CEP250), respectively. Here, we used molecular dynamics simulations to gain insights into the effects of molecular glues on protein conformation and PPIs. The molecular glues modulated protein flexibility, leading to less flexibility in some regions, and changed the pattern and stability of water-mediated hydrogen bonds between the proteins. Our findings highlight the importance of considering water-mediated hydrogen bonds in developing strategies for the rational design of molecular glues.

2025-10-18T16:11:33Z 8 pages, 4 figures, Supplementary information included Apoorva Mathur Mariona Alegre Canela Max von Graevenitz Chiara Gerstner Ariane Nunes-Alves http://arxiv.org/abs/2510.16510v1 CryoDyna: Multiscale end-to-end modeling of cryo-EM macromolecule dynamics with physics-aware neural network 2025-10-18T13:53:12Z

Single-particle cryo-EM has transformed structural biology but still faces challenges in resolving conformational heterogeneity at atomic resolution. Existing cryo-EM heterogeneity analysis methods either lack atomic details or tend to subject to overfitting due to image noise and limited information in single views. To obtain atomic detailed multiple conformations and make full use of particle images of different orientations, we present here CryoDyna, a deep learning framework to infer macromolecular dynamics directly from 2D projections by integrating cross-view attention and multi-scale deformation modeling. Combining coarse-grained MARTINI representation with atomic backmapping, CryoDyna achieves near-atomic interpretation of protein conformational landscapes. Validated on multiple simulated and experimental datasets, CryoDyna demonstrates improved modeling accuracy and robustly recovers multi-scale complex structure changes hidden in the cryo-EM particle stacks. As examples, we generated protein-RNA coordinated motions, resolved dynamics in the unseen region of RAG signal end complex, mapped translocating ribosome states in a one-shot manner, and revealed step-wise closure of a membrane-anchored protein multimer. This work bridges the gap between cryo-EM heterogeneity analysis and atomic-scale structural dynamics, offering a promising tool for exploration of complex biological mechanisms.

2025-10-18T13:53:12Z Chengwei Zhang Shimian Li Yihao Niu Zhen Zhu Sihao Yuan Sirui Liu Yi Qin Gao