OneProt: Towards Multi-Modal Protein Foundation Models

2025-10-18T11:37:56Z

Recent advances in Artificial Intelligence have enabled multi-modal systems to model and translate diverse information spaces. Extending beyond text and vision, we introduce OneProt, a multi-modal AI for proteins that integrates structural, sequence, text, and binding site data. Using the ImageBind framework, OneProt aligns the latent spaces of protein modality encoders in a lightweight fine-tuning scheme that focuses on pairwise alignment with sequence data rather than requiring full matches. This novel approach comprises a mix of Graph Neural Networks and transformer architectures. It demonstrates strong performance in retrieval tasks and showcases the efficacy of multi-modal systems in Protein Machine Learning through a broad spectrum of downstream baselines, including enzyme function prediction and binding site analysis. Furthermore, OneProt enables the transfer of representational information from specialized encoders to the sequence encoder, enhancing capabilities for distinguishing evolutionarily related and unrelated sequences and exhibiting representational properties where evolutionarily related proteins align in similar directions within the latent space. In addition, we extensively investigate modality ablations to identify the encoders that contribute most to predictive performance, highlighting the significance of the binding site encoder, which has not been used in similar models previously. This work expands the horizons of multi-modal protein models, paving the way for transformative applications in drug discovery, biocatalytic reaction planning, and protein engineering.

Protein Folding with Neural Ordinary Differential Equations

2025-10-17T22:56:03Z

Recent advances in protein structure prediction, such as AlphaFold, have demonstrated the power of deep neural architectures like the Evoformer for capturing complex spatial and evolutionary constraints on protein conformation. However, the depth of the Evoformer, comprising 48 stacked blocks, introduces high computational costs and rigid layerwise discretization. Inspired by Neural Ordinary Differential Equations (Neural ODEs), we propose a continuous-depth formulation of the Evoformer, replacing its 48 discrete blocks with a Neural ODE parameterization that preserves its core attention-based operations. This continuous-time Evoformer achieves constant memory cost (in depth) via the adjoint method, while allowing a principled trade-off between runtime and accuracy through adaptive ODE solvers. Benchmarking on protein structure prediction tasks, we find that the Neural ODE-based Evoformer produces structurally plausible predictions and reliably captures certain secondary structure elements, such as alpha-helices, though it does not fully replicate the accuracy of the original architecture. However, our model achieves this performance using dramatically fewer resources, just 17.5 hours of training on a single GPU, highlighting the promise of continuous-depth models as a lightweight and interpretable alternative for biomolecular modeling. This work opens new directions for efficient and adaptive protein structure prediction frameworks.

Sparing of DNA irradiated with Ultra-High Dose-Rates under Physiological Oxygen and Salt conditions

2025-10-17T09:41:45Z

Cancer treatment with radiotherapy aims to kill tumor cells and spare healthy tissue.Thus,the experimentally observed sparing of healthy tissue by the FLASH effect during irradiations with ultra-high dose rates (UHDR) enables clinicians to extend the therapeutic window.However, the underlying radiobiological and chemical mechanisms are far from being understood.DNA is one of the main molecular targets for radiotherapy.Ionizing radiation damage to DNA in water depends strongly on salt,pH,buffer and oxygen content of the solvent.Here we present a study of plasmid DNA pUC19,irradiated with 18MeV electrons at low dose rates (LDR) and UHDR under tightly controlled ambient and physiological oxygen conditions in PBS at pH 7.4.For the first time a sparing effect of DNA strand-break induction between UHDR(>10MGy/s) and LDR(<0.1Gy/s) irradiated plasmid DNA under physiological oxygen, salt and pH is observed for total doses above 10Gy.Under physiological oxygen (physoxia,5%O2,40mmHg),more single (SSB) and double strand-breaks (DSB) are observed when exposed to LDR, than to UHDR.This behaviour is absent for ambient oxygen (normoxia,21%O2,150-160mmHg).The experiments are accompanied by TOPAS-nBio based particle-scattering and chemical MCS to obtain information about the yields of reactive oxygen species (ROS).Hereby,an extended set of chemical reactions was considered, which improved upon the discrepancy between experiment and simulations of previous works, and allowed to predict DR dependent g-values of hydrogen peroxide (H2O2).To explain the observed DNA sparing effect under FLASH conditions at physoxia,the following model was proposed:The interplay of O2 with OH induced H-abstraction at the phosphate backbone,and the conversion of DNA base-damage to SSB,under consideration of the dose-rate dependent H3O+ yield via beta elimination processes is accounted for, to explain the observed behavior.

Coder as Editor: Code-driven Interpretable Molecular Optimization

2025-10-16T08:55:06Z

Molecular optimization is a central task in drug discovery that requires precise structural reasoning and domain knowledge. While large language models (LLMs) have shown promise in generating high-level editing intentions in natural language, they often struggle to faithfully execute these modifications-particularly when operating on non-intuitive representations like SMILES. We introduce MECo, a framework that bridges reasoning and execution by translating editing actions into executable code. MECo reformulates molecular optimization for LLMs as a cascaded framework: generating human-interpretable editing intentions from a molecule and property goal, followed by translating those intentions into executable structural edits via code generation. Our approach achieves over 98% accuracy in reproducing held-out realistic edits derived from chemical reactions and target-specific compound pairs. On downstream optimization benchmarks spanning physicochemical properties and target activities, MECo substantially improves consistency by 38-86 percentage points to 90%+ and achieves higher success rates over SMILES-based baselines while preserving structural similarity. By aligning intention with execution, MECo enables consistent, controllable and interpretable molecular design, laying the foundation for high-fidelity feedback loops and collaborative human-AI workflows in drug discovery.

Learning Inter-Atomic Potentials without Explicit Equivariance

2025-10-15T17:55:37Z

Accurate and scalable machine-learned inter-atomic potentials (MLIPs) are essential for molecular simulations ranging from drug discovery to new material design. Current state-of-the-art models enforce roto-translational symmetries through equivariant neural network architectures, a hard-wired inductive bias that can often lead to reduced flexibility, computational efficiency, and scalability. In this work, we introduce TransIP: Transformer-based Inter-Atomic Potentials, a novel training paradigm for interatomic potentials achieving symmetry compliance without explicit architectural constraints. Our approach guides a generic non-equivariant Transformer-based model to learn SO(3)-equivariance by optimizing its representations in the embedding space. Trained on the recent Open Molecules (OMol25) collection, a large and diverse molecular dataset built specifically for MLIPs and covering different types of molecules (including small organics, biomolecular fragments, and electrolyte-like species), TransIP effectively learns symmetry in its latent space, providing low equivariance error. Further, compared to a data augmentation baseline, TransIP achieves 40% to 60% improvement in performance across varying OMol25 dataset sizes. More broadly, our work shows that learned equivariance can be a powerful and efficient alternative to augmentation-based MLIP models.

Multi-state Protein Design with DynamicMPNN

2025-10-15T12:29:20Z

Structural biology has long been dominated by the one sequence, one structure, one function paradigm, yet many critical biological processes - from enzyme catalysis to membrane transport - depend on proteins that adopt multiple conformational states. Existing multi-state design approaches rely on post-hoc aggregation of single-state predictions, achieving poor experimental success rates compared to single-state design. We introduce DynamicMPNN, an inverse folding model explicitly trained to generate sequences compatible with multiple conformations through joint learning across conformational ensembles. Trained on 46,033 conformational pairs covering 75% of CATH superfamilies and evaluated using Alphafold 3, DynamicMPNN outperforms ProteinMPNN by up to 25% on decoy-normalized RMSD and by 12% on sequence recovery across our challenging multi-state protein benchmark.

Precision Design of Cyclic Peptides using AlphaFold

2025-10-15T03:57:57Z

This independent research investigates methods to improve the precision of cyclic peptide generation targeting the HIV gp120 trimer using AlphaFold. The study explores proximity-based hotspot mapping at the CD4 binding site, centroid distance penalization, generative loss tuning, and custom loss function development. These enhancements produced cyclic peptides that closely resemble the binding conformation of the CD4 attachment inhibitor BMS-818251. The proposed methodology demonstrates improved structural control and precision in cyclic peptide generation, advancing the applicability of AlphaFold in structure-based drug discovery.

Superior Molecular Representations from Intermediate Encoder Layers

2025-10-15T01:55:53Z

Pretrained molecular encoders have become indispensable in computational chemistry for tasks such as property prediction and molecular generation. However, the standard practice of relying solely on final-layer embeddings for downstream tasks may discard valuable information. In this work, we first analyze the information flow in five diverse molecular encoders and find that intermediate layers retain more general-purpose features, whereas the final-layer specializes and compresses information. We then perform an empirical layer-wise evaluation across 22 property prediction tasks. We find that using frozen embeddings from optimal intermediate layers improves downstream performance by an average of 5.4%, up to 28.6%, compared to the final-layer. Furthermore, finetuning encoders truncated at intermediate depths achieves even greater average improvements of 8.5%, with increases as high as 40.8%, obtaining new state-of-the-art results on several benchmarks. These findings highlight the importance of exploring the full representational depth of molecular encoders to achieve substantial performance improvements and computational efficiency. The code will be made publicly available.

Protein Design with Dynamic Protein Vocabulary

2025-10-14T14:04:33Z

Protein design is a fundamental challenge in biotechnology, aiming to design novel sequences with specific functions within the vast space of possible proteins. Recent advances in deep generative models have enabled function-based protein design from textual descriptions, yet struggle with structural plausibility. Inspired by classical protein design methods that leverage natural protein structures, we explore whether incorporating fragments from natural proteins can enhance foldability in generative models. Our empirical results show that even random incorporation of fragments improves foldability. Building on this insight, we introduce ProDVa, a novel protein design approach that integrates a text encoder for functional descriptions, a protein language model for designing proteins, and a fragment encoder to dynamically retrieve protein fragments based on textual functional descriptions. Experimental results demonstrate that our approach effectively designs protein sequences that are both functionally aligned and structurally plausible. Compared to state-of-the-art models, ProDVa achieves comparable function alignment using less than 0.04% of the training data, while designing significantly more well-folded proteins, with the proportion of proteins having pLDDT above 70 increasing by 7.38% and those with PAE below 10 increasing by 9.6%.

HelixVS: Deep Learning-Enhanced Structure-Based Platform for Screening and Design

2025-10-14T07:35:23Z

Drug discovery through virtual screening (VS) has become a popular strategy for identifying hits against protein targets. Alongside VS, molecular design further expands accessible chemical space. Together, these approaches have the potential to reduce the cost and time needed for manual selection and wet-laboratory experiments, thereby accelerating drug discovery pipelines. Improving the cost-effectiveness of virtual screening is a significant challenge, aiming to explore larger compound libraries while maintaining lower screening costs. Here, we present HelixVS, a structure-based VS platform enhanced by deep learning models. HelixVS integrates a precise deep learning-based pose-scoring model and a pose-screening module into a multi-stage VS process, enabling more effective screening of active compounds. Compared to classic molecular docking tools like Vina, HelixVS demonstrated significantly improved screening performance across nearly a hundred targets, achieving an average 2.6-fold higher enrichment factor (EF) and more than 10 times faster screening speed. We applied HelixVS in four drug development pipelines, targeting both traditional competitive drug-binding pockets and novel protein-protein interaction interfaces. Wet-lab validations across these pipelines consistently identified active compounds, with over 10% of the molecules tested in wet labs demonstrating activity at uM or even nM levels. This demonstrates the ability of HelixVS to identify high-affinity ligands for various targets and pockets.In addition, the HelixVS platform has been extended with HelixVS-Syn, which enables design of novel compounds from reference scaffolds. These designed molecules are seamlessly integrated into the HelixVS screening workflow, allowing researchers to explore both existing chemical libraries and novel chemical space with high affinity, synthetic accessibility, and structural novelty.

An Efficient Algorithm for Exploring RNA Branching Conformations under the Nearest-Neighbor Thermodynamic Model

2025-10-14T01:57:14Z

Background: In the Nearest-Neighbor Thermodynamic Model, a standard approach for RNA secondary structure prediction, the energy of the multiloops is modeled using a linear entropic penalty governed by three branching parameters. Although these parameters are typically fixed, recent work has shown that reparametrizing the multiloop score and considering alternative branching conformations can lead to significantly better structure predictions. However, prior approaches for exploring the alternative branching structures were computationally inefficient for long sequences. Results: We present a novel algorithm that partitions the parameter space, identifying all distinct branching structures (optimal under different branching parameters) for a given RNA sequence using the fewest possible minimum free energy computations. Our method efficiently computes the full parameter-space partition and the associated optimal structures, enabling a comprehensive evaluation of the structural landscape across parameter choices. We apply this algorithm to the Archive II benchmarking dataset, assessing the maximum attainable prediction accuracy for each sequence under the reparameterized multiloop model. We find that the potential for improvement over default predictions is substantial in many cases, and that the optimal prediction accuracy is highly sensitive to auxiliary modeling decisions, such as the treatment of lonely base pairs and dangling ends. Conclusion: Our results support the hypothesis that the conventional choice of multiloop parameters may limit prediction accuracy and that exploring alternative parameterizations is both tractable and worthwhile. The efficient partitioning algorithm we introduce makes this exploration feasible for longer sequences and larger datasets. Furthermore, we identify several open challenges in identifying the optimal structure.

Engineering Supercomputing Platforms for Biomolecular Applications

2025-10-13T16:30:33Z

A range of computational biology software (GROMACS, AMBER, NAMD, LAMMPS, OpenMM, Psi4 and RELION) was benchmarked on a representative selection of HPC hardware, including AMD EPYC 7742 CPU nodes, NVIDIA V100 and AMD MI250X GPU nodes, and an NVIDIA GH200 testbed. The raw performance, power efficiency and data storage requirements of the software was evaluated for each HPC facility, along with qualitative factors such as the user experience and software environment. It was found that the diversity of methods used within computational biology means that there is no single HPC hardware that can optimally run every type of HPC job, and that diverse hardware is the only way to properly support all methods. New hardware, such as AMD GPUs and Nvidia AI chips, are mostly compatible with existing methods, but are also more labour-intensive to support. GPUs offer the most efficient way to run most computational biology tasks, though some tasks still require CPUs. A fast HPC node running molecular dynamics can produce around 10GB of data per day, however, most facilities and research institutions lack short-term and long-term means to store this data. Finally, as the HPC landscape has become more complex, deploying software and keeping HPC systems online has become more difficult. This situation could be improved through hiring/training in DevOps practices, expanding the consortium model to provide greater support to HPC system administrators, and implementing build frameworks/containerisation/virtualisation tools to allow users to configure their own software environment, rather than relying on centralised software installations.

RiboFlow: Conditional De Novo RNA Co-Design via Synergistic Flow Matching

2025-10-13T12:28:02Z

Ribonucleic acid (RNA) binds to molecules to achieve specific biological functions. While generative models are advancing biomolecule design, existing methods for designing RNA that target specific ligands face limitations in capturing RNA's conformational flexibility, ensuring structural validity, and overcoming data scarcity. To address these challenges, we introduce RiboFlow, a synergistic flow matching model to co-design RNA structures and sequences based on target molecules. By integrating RNA backbone frames, torsion angles, and sequence features in an unified architecture, RiboFlow explicitly models RNA's dynamic conformations while enforcing sequence-structure consistency to improve validity. Additionally, we curate RiboBind, a large-scale dataset of RNA-molecule interactions, to resolve the scarcity of high-quality structural data. Extensive experiments reveal that RiboFlow not only outperforms state-of-the-art RNA design methods by a large margin but also showcases controllable capabilities for achieving high binding affinity to target ligands. Our work bridges critical gaps in controllable RNA design, offering a framework for structure-aware, data-efficient generation.

Quantification of protein homodimer affinity using native mass spectrometry

2025-10-13T11:19:56Z

Biological processes rely on finely tuned homo- and heteromeric interactions between (biomacro)molecules. The strength of an interaction, typically given by the dissociation constant (KD), plays a crucial role in basic research and must be monitored throughout the development of drugs and agrochemicals. An ideal method for KD determination is applicable to various analytes with a large range of affinities, tolerates complex matrix compositions, does not require labeling, and simultaneously provides information on the structural integrity of the binding partners. Native mass spectrometry meets these criteria but typically struggles with homooligomeric complexes due to overlapping mass signals. To overcome this, we resolve monomer/dimer contributions to overlapping MS-peaks by separately analyzing the charge state distribution of each oligomeric species via sample dilution and covalent crosslinking. Following this approach, we show that quantitative Laser-Induced Liquid Bead Ion Desorption mass spectrometry (qLILBID-MS) accurately captures the affinities of Bovine Serum Albumin and chemically induced dimers of Tryparedoxin, an oxidoreductase from human pathogenic Trypanosoma brucei parasites, with various molecular glues and homodimer affinities. Conveniently, qLILBID-MS requires a fraction of sample used by other methods such as isothermal titration calorimetry and yields previously inaccessible protein homodimer KDs in the high micromolar range, which allowed us to monitor the gradual decrease in homodimer affinity via mutation of crucial dimer interface contacts. Overall, qLILBID-MS is a sensitive, robust, fast, scalable, and cost-effective alternative to quantify protein/protein interactions that can accelerate contemporary drug discovery workflows, e.g. the efficient screening for proximity inducing molecules like proteolysis targeting chimera and molecular glues.

Protein as a Second Language for LLMs

2025-10-13T09:21:45Z

Deciphering the function of unseen protein sequences is a fundamental challenge with broad scientific impact, yet most existing methods depend on task-specific adapters or large-scale supervised fine-tuning. We introduce the "Protein-as-Second-Language" framework, which reformulates amino-acid sequences as sentences in a novel symbolic language that large language models can interpret through contextual exemplars. Our approach adaptively constructs sequence-question-answer triples that reveal functional cues in a zero-shot setting, without any further training. To support this process, we curate a bilingual corpus of 79,926 protein-QA instances spanning attribute prediction, descriptive understanding, and extended reasoning. Empirically, our method delivers consistent gains across diverse open-source LLMs and GPT-4, achieving up to 17.2% ROUGE-L improvement (average +7%) and even surpassing fine-tuned protein-specific language models. These results highlight that generic LLMs, when guided with protein-as-language cues, can outperform domain-specialized models, offering a scalable pathway for protein understanding in foundation models.