https://arxiv.org/api/r92DCv9f1OfJ6f24+A2xseWJmoM 2026-03-25T10:13:23Z 6650 375 15 http://arxiv.org/abs/2509.10410v1 Knotted DNA Configurations in Bacteriophage Capsids: A Liquid Crystal Theory Approach 2025-09-12T17:06:21Z Bacteriophages, viruses that infect bacteria, store their micron long DNA inside an icosahedral capsid with a typical diameter of 40 nm to 100 nm. Consistent with experimental observations, such confinement conditions induce an arrangement of DNA that corresponds to a hexagonal chromonic liquid-crystalline phase, and increase the topological complexity of the genome in the form of knots. A mathematical model that implements a chromonic liquid-crystalline phase and that captures the changes in topology has been lacking. We adopt a mathematical model that represents the viral DNA as a pair of a vector field and a line. The vector field is a minimizer of the total Oseen-Frank energy for nematic liquid crystals under chromonic constraints, while the line is identified with the tangent to the field at selected locations, representing the central axis of the DNA molecule. The fact that the Oseen-Frank functional assigns infinite energy to topological defects (point defects in two dimensions and line defects in three dimensions) precludes the presence of singularities and, in particular, of knot structures. To address this issue, we begin with the optimal vector field and helical line, and propose a new algorithm to introduce knots through stochastic perturbations associated with splay and twist deformations, modeled by means of a Langevin system. We conclude by comparing knot distributions generated by the model and by interpreting them in the context of previously published experimental results. Altogether, this work relies on the synergy of modeling, analysis and computation in the study of viral DNA organization in capsids. 2025-09-12T17:06:21Z Pei Liu Zhijie Wang Tamara Christiani Mariel Vazquez M. Carme Calderer Javier Arsuaga http://arxiv.org/abs/2509.07983v2 Steering Protein Language Models 2025-09-12T12:39:45Z Protein Language Models (PLMs), pre-trained on extensive evolutionary data from natural proteins, have emerged as indispensable tools for protein design. While powerful, PLMs often struggle to produce proteins with precisely specified functionalities or properties due to inherent challenges in controlling their outputs. In this work, we investigate the potential of Activation Steering, a technique originally developed for controlling text generation in Large Language Models (LLMs), to direct PLMs toward generating protein sequences with targeted properties. We propose a simple yet effective method that employs activation editing to steer PLM outputs, and extend this approach to protein optimization through a novel editing site identification module. Through comprehensive experiments on lysozyme-like sequence generation and optimization, we demonstrate that our methods can be seamlessly integrated into both auto-encoding and autoregressive PLMs without requiring additional training. These results highlight a promising direction for precise protein engineering using foundation models. 2025-07-01T16:03:55Z Accepted to ICML 2025 Long-Kai Huang Rongyi Zhu Bing He Jianhua Yao http://arxiv.org/abs/2502.16446v2 Auxiliary Discrminator Sequence Generative Adversarial Networks (ADSeqGAN) for Few Sample Molecule Generation 2025-09-11T19:05:03Z In this work, we introduce Auxiliary Discriminator Sequence Generative Adversarial Networks (ADSeqGAN), a novel approach for molecular generation in small-sample datasets. Traditional generative models often struggle with limited training data, particularly in drug discovery, where molecular datasets for specific therapeutic targets, such as nucleic acids binders and central nervous system (CNS) drugs, are scarce. ADSeqGAN addresses this challenge by integrating an auxiliary random forest classifier as an additional discriminator into the GAN framework, significantly improves molecular generation quality and class specificity. Our method incorporates pretrained generator and Wasserstein distance to enhance training stability and diversity. We evaluate ADSeqGAN across three representative cases. First, on nucleic acid- and protein-targeting molecules, ADSeqGAN shows superior capability in generating nucleic acid binders compared to baseline models. Second, through oversampling, it markedly improves CNS drug generation, achieving higher yields than traditional de novo models. Third, in cannabinoid receptor type 1 (CB1) ligand design, ADSeqGAN generates novel druglike molecules, with 32.8\% predicted actives surpassing hit rates of CB1-focused and general-purpose libraries when assessed by a target-specific LRIP-SF scoring function. Overall, ADSeqGAN offers a versatile framework for molecular design in data-scarce scenarios, with demonstrated applications in nucleic acid binders, CNS drugs, and CB1 ligands. 2025-02-23T05:22:53Z Accepted by Journal of Chemical Information and Modeling, ASAP Haocheng Tang Jing Long Beihong Ji Junmei Wang http://arxiv.org/abs/2509.06465v4 CAME-AB: Cross-Modality Attention with Mixture-of-Experts for Antibody Binding Site Prediction 2025-09-11T05:09:47Z Antibody binding site prediction plays a pivotal role in computational immunology and therapeutic antibody design. Existing sequence or structure methods rely on single-view features and fail to identify antibody-specific binding sites on the antigens. In this paper, we propose \textbf{CAME-AB}, a novel Cross-modality Attention framework with a Mixture-of-Experts (MoE) backbone for robust antibody binding site prediction. CAME-AB integrates five biologically grounded modalities, including raw amino acid encodings, BLOSUM substitution profiles, pretrained language model embeddings, structure-aware features, and GCN-refined biochemical graphs, into a unified multimodal representation. To enhance adaptive cross-modal reasoning, we propose an \emph{adaptive modality fusion} module that learns to dynamically weight each modality based on its global relevance and input-specific contribution. A Transformer encoder combined with an MoE module further promotes feature specialization and capacity expansion. We additionally incorporate a supervised contrastive learning objective to explicitly shape the latent space geometry, encouraging intra-class compactness and inter-class separability. To improve optimization stability and generalization, we apply stochastic weight averaging during training. Extensive experiments on benchmark antibody-antigen datasets demonstrate that CAME-AB consistently outperforms strong baselines on multiple metrics, including Precision, Recall, F1-score, AUC-ROC, and MCC. Ablation studies further validate the effectiveness of each architectural component and the benefit of multimodal feature integration. The model implementation details and the codes are available on https://anonymous.4open.science/r/CAME-AB-C525 2025-09-08T09:24:09Z Hongzong Li Jiahao Ma Zhanpeng Shi Rui Xiao Fanming Jin Ye-Fan Hu Hangjun Che Jian-Dong Huang http://arxiv.org/abs/2509.03084v2 SurGBSA: Learning Representations From Molecular Dynamics Simulations 2025-09-10T23:46:01Z Self-supervised pretraining from static structures of drug-like compounds and proteins enable powerful learned feature representations. Learned features demonstrate state of the art performance on a range of predictive tasks including molecular properties, structure generation, and protein-ligand interactions. The majority of approaches are limited by their use of static structures and it remains an open question, how best to use atomistic molecular dynamics (MD) simulations to develop more generalized models to improve prediction accuracy for novel molecular structures. We present SURrogate mmGBSA (SurGBSA) as a new modeling approach for MD-based representation learning, which learns a surrogate function of the Molecular Mechanics Generalized Born Surface Area (MMGBSA). We show for the first time the benefits of physics-informed pre-training to train a surrogate MMGBSA model on a collection of over 1.4 million 3D trajectories collected from MD simulations of the CASF-2016 benchmark. SurGBSA demonstrates a dramatic 27,927x speedup versus a traditional physics-based single-point MMGBSA calculation while nearly matching single-point MMGBSA accuracy on the challenging pose ranking problem for identification of the correct top pose (-0.4% difference). Our work advances the development of molecular foundation models by showing model improvements when training on MD simulations. Models, code and training data are made publicly available. 2025-09-03T07:27:21Z Derek Jones Yue Yang Felice C. Lightstone Niema Moshiri Jonathan E. Allen Tajana S. Rosing http://arxiv.org/abs/2509.08707v1 Tokenizing Loops of Antibodies 2025-09-10T15:56:19Z The complementarity-determining regions of antibodies are loop structures that are key to their interactions with antigens, and of high importance to the design of novel biologics. Since the 1980s, categorizing the diversity of CDR structures into canonical clusters has enabled the identification of key structural motifs of antibodies. However, existing approaches have limited coverage and cannot be readily incorporated into protein foundation models. Here we introduce ImmunoGlobulin LOOp Tokenizer, Igloo, a multimodal antibody loop tokenizer that encodes backbone dihedral angles and sequence. Igloo is trained using a contrastive learning objective to map loops with similar backbone dihedral angles closer together in latent space. Igloo can efficiently retrieve the closest matching loop structures from a structural antibody database, outperforming existing methods on identifying similar H3 loops by 5.9\%. Igloo assigns tokens to all loops, addressing the limited coverage issue of canonical clusters, while retaining the ability to recover canonical loop conformations. To demonstrate the versatility of Igloo tokens, we show that they can be incorporated into protein language models with IglooLM and IglooALM. On predicting binding affinity of heavy chain variants, IglooLM outperforms the base protein language model on 8 out of 10 antibody-antigen targets. Additionally, it is on par with existing state-of-the-art sequence-based and multimodal protein language models, performing comparably to models with $7\times$ more parameters. IglooALM samples antibody loops which are diverse in sequence and more consistent in structure than state-of-the-art antibody inverse folding models. Igloo demonstrates the benefit of introducing multimodal tokens for antibody loops for encoding the diverse landscape of antibody loops, improving protein foundation models, and for antibody CDR design. 2025-09-10T15:56:19Z 21 pages, 7 figures, 10 tables, code available at https://github.com/prescient-design/igloo Ada Fang Robert G. Alberstein Simon Kelow Frédéric A. Dreyer http://arxiv.org/abs/2509.08633v1 Quantifying the liquid-liquid transition in cold water/glycerol mixtures by ih-RIDME 2025-09-10T14:27:53Z Water/glycerol mixtures are common for experiments with biomacromolecules at cryogenic temperatures due to their vitrification properties. Above the glass transition temperature, they undergo liquid-liquid phase separation. Using the novel EPR technique called intermolecular hyperfine Relaxation-Induced Dipolar Modulation Enhancement (ih-RIDME), we quantified the molar composition in frozen water/glycerol mixtures with one or the other component deuterated after the phase transition. Our experiments reveal nearly equal phase composition regardless of the proton/deuterium isotope balance. With the new ih-RIDME data, we can also revisit the already reported body of glass transition data for such mixtures and build a consistent picture for water/glycerol freezing and phase transitions. Our results also indicate that ih-RIDME has the potential for investigating the solvation shells of spin-labelled macromolecules. 2025-09-10T14:27:53Z Manuscript prepared for submission Sergei Kuzin Maxim Yulikov http://arxiv.org/abs/2509.07458v1 Unveiling Biological Models Through Turing Patterns 2025-09-09T07:26:36Z Turing patterns play a fundamental role in morphogenesis and population dynamics, encoding key information about the underlying biological mechanisms. Yet, traditional inverse problems have largely relied on non-biological data such as boundary measurements, neglecting the rich information embedded in the patterns themselves. Here we introduce a new research direction that directly leverages physical observables from nature--the amplitude of Turing patterns--to achieve complete parameter identification. We present a framework that uses the spatial amplitude profile of a single pattern to simultaneously recover all system parameters, including wavelength, diffusion constants, and the full nonlinear forms of chemotactic and kinetic coefficient functions. Demonstrated on models of chemotactic bacteria, this amplitude-based approach establishes a biologically grounded, mathematically rigorous paradigm for reverse-engineering pattern formation mechanisms across diverse biological systems. 2025-09-09T07:26:36Z 22 pages keywords: inverse reaction-diffusion equations, Turing patterns, Turing instability, periodic solutions, sinusoidal form Yuhan Li Hongyu Liu Catharine W. K. Lo http://arxiv.org/abs/2411.10548v5 BioNeMo Framework: a modular, high-performance library for AI model development in drug discovery 2025-09-08T19:12:19Z Artificial Intelligence models encoding biology and chemistry are opening new routes to high-throughput and high-quality in-silico drug development. However, their training increasingly relies on computational scale, with recent protein language models (pLM) training on hundreds of graphical processing units (GPUs). We introduce the BioNeMo Framework to facilitate the training of computational biology and chemistry AI models across hundreds of GPUs. Its modular design allows the integration of individual components, such as data loaders, into existing workflows and is open to community contributions. We detail technical features of the BioNeMo Framework through use cases such as pLM pre-training and fine-tuning. On 256 NVIDIA A100s, BioNeMo Framework trains a three billion parameter BERT-based pLM on over one trillion tokens in 4.2 days. The BioNeMo Framework is open-source and free for everyone to use. 2024-11-15T19:46:16Z Peter St. John Dejun Lin Polina Binder Malcolm Greaves Vega Shah John St. John Adrian Lange Patrick Hsu Rajesh Illango Arvind Ramanathan Anima Anandkumar David H Brookes Akosua Busia Abhishaike Mahajan Stephen Malina Neha Prasad Sam Sinai Lindsay Edwards Thomas Gaudelet Cristian Regep Martin Steinegger Burkhard Rost Alexander Brace Kyle Hippe Luca Naef Keisuke Kamata George Armstrong Kevin Boyd Zhonglin Cao Han-Yi Chou Simon Chu Allan dos Santos Costa Sajad Darabi Eric Dawson Kieran Didi Cong Fu Mario Geiger Michelle Gill Darren J Hsu Gagan Kaushik Maria Korshunova Steven Kothen-Hill Youhan Lee Meng Liu Micha Livne Zachary McClure Jonathan Mitchell Alireza Moradzadeh Ohad Mosafi Youssef Nashed Saee Paliwal Yuxing Peng Sara Rabhi Farhad Ramezanghorbani Danny Reidenbach Camir Ricketts Brian C Roland Kushal Shah Tyler Shimko Hassan Sirelkhatim Savitha Srinivasan Abraham C Stern Dorota Toczydlowska Srimukh Prasad Veccham Niccolò Alberto Elia Venanzi Anton Vorontsov Jared Wilber Isabel Wilkinson Wei Jing Wong Eva Xue Cory Ye Xin Yu Yang Zhang Guoqing Zhou Becca Zandstein Alejandro Chacon Prashant Sohani Maximilian Stadler Christian Hundt Feiwen Zhu Christian Dallago Bruno Trentini Emine Kucukbenli Saee Paliwal Timur Rvachov Eddie Calleja Johnny Israeli Harry Clifford Risto Haukioja Nicholas Haemel Kyle Tretina Neha Tadimeti Anthony B Costa http://arxiv.org/abs/2404.00081v2 Molecular Generative Adversarial Network with Multi-Property Optimization 2025-09-08T10:22:05Z Deep generative models, such as generative adversarial networks (GANs), have been employed for $de~novo$ molecular generation in drug discovery. Most prior studies have utilized reinforcement learning (RL) algorithms, particularly Monte Carlo tree search (MCTS), to handle the discrete nature of molecular representations in GANs. However, due to the inherent instability in training GANs and RL models, along with the high computational cost associated with MCTS sampling, MCTS RL-based GANs struggle to scale to large chemical databases. To tackle these challenges, this study introduces a novel GAN based on actor-critic RL with instant and global rewards, called InstGAN, to generate molecules at the token-level with multi-property optimization. Furthermore, maximized information entropy is leveraged to alleviate the mode collapse. The experimental results demonstrate that InstGAN outperforms other baselines, achieves comparable performance to state-of-the-art models, and efficiently generates molecules with multi-property optimization. The source code will be released upon acceptance of the paper. 2024-03-29T08:55:39Z Huidong Tang Chen Li Sayaka Kamei Yoshihiro Yamanishi Yasuhiko Morimoto http://arxiv.org/abs/2509.06271v1 Computational predictions of nutrient precipitation for intensified cell 1 culture media via amino acid solution thermodynamics 2025-09-08T01:39:27Z The majority of therapeutic monoclonal antibodies (mAbs) on the market are produced using Chinese Hamster Ovary (CHO) cells cultured at scale in chemically defined cell culture medium. Because of the high costs associated with mammalian cell cultures, obtaining high cell densities to produce high product titers is desired. These bioprocesses require high concentrations of nutrients in the basal media and periodically adding concentrated feed media to sustain cell growth and therapeutic protein productivity. Unfortunately, the desired or optimal nutrient concentrations of the feed media are often solubility limited due to precipitation of chemical complexes that form in the solution. Experimentally screening the various cell culture media configurations which contain 50 to 100 compounds can be expensive and laborious. This article lays the foundation for utilizing computational tools to understand precipitation of nutrients in cell culture media by studying the pairwise interactions between amino acids in thermodynamic models. Activity coefficient data for one amino acid in water and amino acid solubility data of two amino acids in water have been used to determine a single set of UNIFAC group interaction parameters to predict the thermodynamic behavior of the multi-component systems found in mammalian cell culture media. The data collected in this study is, to our knowledge, the largest set of ternary system amino acid solubility data reported to date. These amino acid precipitation predictions have been verified with experimentally measured ternary and quaternary amino acid solutions. Thus, we demonstrate the utility of our model as a digital twin to identify optimal cell culture media compositions by replacing empirical approaches for nutrient precipitation with computational predictions based on thermodynamics of individual media components in complex mixtures. 2025-09-08T01:39:27Z 32 pages, 8 figures Jayanth Venkatarama Reddy Nelson Ndahiro Lateef Aliyu Ashwin Dravid Tianxin Xang Jinke Wu Michael Betenbaugh Marc Donohue http://arxiv.org/abs/2403.03726v4 Diffusion on language model encodings for protein sequence generation 2025-09-06T08:28:28Z Protein sequence design has seen significant advances through discrete diffusion and autoregressive approaches, yet the potential of continuous diffusion remains underexplored. Here, we present DiMA, a latent diffusion framework that operates on protein language model representations. Through systematic exploration of architectural choices and diffusion components, we develop a robust methodology that generalizes across multiple protein encoders ranging from 8M to 3B parameters. We demonstrate that our framework achieves consistently high performance across sequence-only (ESM-2, ESMc), dual-decodable (CHEAP), and multimodal (SaProt) representations using the same architecture and training approach. We extensively evaluate existing methods alongside DiMA using multiple metrics across two protein modalities, covering quality, diversity, novelty, and distribution matching of generated proteins. DiMA consistently produces novel, high-quality and diverse protein sequences and achieves strong results compared to baselines such as autoregressive, discrete diffusion and flow matching language models. The model demonstrates versatile functionality, supporting conditional generation tasks including protein family-generation, motif scaffolding and infilling, and fold-specific sequence design. This work provides a universal continuous diffusion framework for protein sequence generation, offering both architectural insights and practical applicability across various protein design scenarios. Code is released at \href{https://github.com/MeshchaninovViacheslav/DiMA}{GitHub}. 2024-03-06T14:15:20Z Viacheslav Meshchaninov Pavel Strashnov Andrey Shevtsov Fedor Nikolaev Nikita Ivanisenko Olga Kardymon Dmitry Vetrov http://arxiv.org/abs/2509.05517v1 Towards better structural models from cryo-electron microscopy data with physics-based methods 2025-09-05T22:00:20Z Cryo-electron microscopy can now routinely deliver atomic resolution structures for a variety of biological systems. The relevance and value of these structures is directly related to their ability to help rationalize experimental observables, which in turn depends on the quality of model built into the density map. Coupling traditional model building tools with physics-based methods, such as docking, simulation, and modern force fields, has been shown to improve the quality of the resulting structures. Here, we survey the landscape of these hybrid approaches, highlighting their usefulness for medium- and low-resolution datasets, as well as for structures of small molecules, and make the argument that the community stands to benefit from their inclusion in model building and refinement workflows. 2025-09-05T22:00:20Z 13 pages, 2 figures. Submitted to FEBS Letters Hande Boyaci Selcuk Gabriella Reggiano Jacob Robson-Tull Lichirui Zhang João P. G. L. M. Rodrigues http://arxiv.org/abs/2509.04998v1 Directed Evolution of Proteins via Bayesian Optimization in Embedding Space 2025-09-05T10:47:49Z Directed evolution is an iterative laboratory process of designing proteins with improved function by iteratively synthesizing new protein variants and evaluating their desired property with expensive and time-consuming biochemical screening. Machine learning methods can help select informative or promising variants for screening to increase their quality and reduce the amount of necessary screening. In this paper, we present a novel method for machine-learning-assisted directed evolution of proteins which combines Bayesian optimization with informative representation of protein variants extracted from a pre-trained protein language model. We demonstrate that the new representation based on the sequence embeddings significantly improves the performance of Bayesian optimization yielding better results with the same number of conducted screening in total. At the same time, our method outperforms the state-of-the-art machine-learning-assisted directed evolution methods with regression objective. 2025-09-05T10:47:49Z 8 pages, 2 figures Proceedings of 2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Lisbon, Portugal, 2024, pp. 91-98 Matouš Soldát Jiří Kléma 10.1109/BIBM62325.2024.10822356 http://arxiv.org/abs/2508.20004v2 Modal Geometry Governs Proteoform Dynamics 2025-09-04T09:55:38Z The fundamental laws governing proteoform dynamics have yet to be formulated. As a result, it is unclear how a specific proteoform, a distinct molecular variant of a protein, dynamically shapes its own future by evolving into new modes that exist only in potential until realised. Here, Modal Geometric Field (MGF) Theory couples real and abstract proteoform transitions through four axioms. Axioms 1 to 3 (invariant) dictate that only first-order transitions occur on the discrete, volume-invariant, non symplectic modal manifold. Axiom 4 (mutable) projects the occupancy and shape of a real, instantiated molecule into the modal manifold, generating occupancy-induced curvature. By coupling what is real to what is abstract, curvature, which is always conserved, governs proteoform dynamics by dictating the least-action modal transition. Because curvature distribution renders activation energy relative, barriers are mutable, and entropy emerges inevitably from curvature transport. This unification of energy, entropy, and curvature yields hysteresis, path dependence, fractal self similarity, and trajectories that oscillate between order and chaos. As a scale invariant and universal framework, MGF Theory reveals how modal geometry governs proteoform dynamics 2025-08-27T16:07:55Z 25 pages, 4 figures James N. Cobley