https://arxiv.org/api/r92DCv9f1OfJ6f24+A2xseWJmoM2026-03-25T10:13:23Z665037515http://arxiv.org/abs/2509.10410v1Knotted DNA Configurations in Bacteriophage Capsids: A Liquid Crystal Theory Approach2025-09-12T17:06:21ZBacteriophages, viruses that infect bacteria, store their micron long DNA inside an icosahedral capsid with a typical diameter of 40 nm to 100 nm. Consistent with experimental observations, such confinement conditions induce an arrangement of DNA that corresponds to a hexagonal chromonic liquid-crystalline phase, and increase the topological complexity of the genome in the form of knots. A mathematical model that implements a chromonic liquid-crystalline phase and that captures the changes in topology has been lacking. We adopt a mathematical model that represents the viral DNA as a pair of a vector field and a line. The vector field is a minimizer of the total Oseen-Frank energy for nematic liquid crystals under chromonic constraints, while the line is identified with the tangent to the field at selected locations, representing the central axis of the DNA molecule. The fact that the Oseen-Frank functional assigns infinite energy to topological defects (point defects in two dimensions and line defects in three dimensions) precludes the presence of singularities and, in particular, of knot structures. To address this issue, we begin with the optimal vector field and helical line, and propose a new algorithm to introduce knots through stochastic perturbations associated with splay and twist deformations, modeled by means of a Langevin system. We conclude by comparing knot distributions generated by the model and by interpreting them in the context of previously published experimental results. Altogether, this work relies on the synergy of modeling, analysis and computation in the study of viral DNA organization in capsids.2025-09-12T17:06:21ZPei LiuZhijie WangTamara ChristianiMariel VazquezM. Carme CaldererJavier Arsuagahttp://arxiv.org/abs/2509.07983v2Steering Protein Language Models2025-09-12T12:39:45ZProtein Language Models (PLMs), pre-trained on extensive evolutionary data from natural proteins, have emerged as indispensable tools for protein design. While powerful, PLMs often struggle to produce proteins with precisely specified functionalities or properties due to inherent challenges in controlling their outputs. In this work, we investigate the potential of Activation Steering, a technique originally developed for controlling text generation in Large Language Models (LLMs), to direct PLMs toward generating protein sequences with targeted properties. We propose a simple yet effective method that employs activation editing to steer PLM outputs, and extend this approach to protein optimization through a novel editing site identification module. Through comprehensive experiments on lysozyme-like sequence generation and optimization, we demonstrate that our methods can be seamlessly integrated into both auto-encoding and autoregressive PLMs without requiring additional training. These results highlight a promising direction for precise protein engineering using foundation models.2025-07-01T16:03:55ZAccepted to ICML 2025Long-Kai HuangRongyi ZhuBing HeJianhua Yaohttp://arxiv.org/abs/2502.16446v2Auxiliary Discrminator Sequence Generative Adversarial Networks (ADSeqGAN) for Few Sample Molecule Generation2025-09-11T19:05:03ZIn this work, we introduce Auxiliary Discriminator Sequence Generative Adversarial Networks (ADSeqGAN), a novel approach for molecular generation in small-sample datasets. Traditional generative models often struggle with limited training data, particularly in drug discovery, where molecular datasets for specific therapeutic targets, such as nucleic acids binders and central nervous system (CNS) drugs, are scarce. ADSeqGAN addresses this challenge by integrating an auxiliary random forest classifier as an additional discriminator into the GAN framework, significantly improves molecular generation quality and class specificity. Our method incorporates pretrained generator and Wasserstein distance to enhance training stability and diversity. We evaluate ADSeqGAN across three representative cases. First, on nucleic acid- and protein-targeting molecules, ADSeqGAN shows superior capability in generating nucleic acid binders compared to baseline models. Second, through oversampling, it markedly improves CNS drug generation, achieving higher yields than traditional de novo models. Third, in cannabinoid receptor type 1 (CB1) ligand design, ADSeqGAN generates novel druglike molecules, with 32.8\% predicted actives surpassing hit rates of CB1-focused and general-purpose libraries when assessed by a target-specific LRIP-SF scoring function. Overall, ADSeqGAN offers a versatile framework for molecular design in data-scarce scenarios, with demonstrated applications in nucleic acid binders, CNS drugs, and CB1 ligands.2025-02-23T05:22:53ZAccepted by Journal of Chemical Information and Modeling, ASAPHaocheng TangJing LongBeihong JiJunmei Wanghttp://arxiv.org/abs/2509.06465v4CAME-AB: Cross-Modality Attention with Mixture-of-Experts for Antibody Binding Site Prediction2025-09-11T05:09:47ZAntibody binding site prediction plays a pivotal role in computational immunology and therapeutic antibody design. Existing sequence or structure methods rely on single-view features and fail to identify antibody-specific binding sites on the antigens. In this paper, we propose \textbf{CAME-AB}, a novel Cross-modality Attention framework with a Mixture-of-Experts (MoE) backbone for robust antibody binding site prediction. CAME-AB integrates five biologically grounded modalities, including raw amino acid encodings, BLOSUM substitution profiles, pretrained language model embeddings, structure-aware features, and GCN-refined biochemical graphs, into a unified multimodal representation. To enhance adaptive cross-modal reasoning, we propose an \emph{adaptive modality fusion} module that learns to dynamically weight each modality based on its global relevance and input-specific contribution. A Transformer encoder combined with an MoE module further promotes feature specialization and capacity expansion. We additionally incorporate a supervised contrastive learning objective to explicitly shape the latent space geometry, encouraging intra-class compactness and inter-class separability. To improve optimization stability and generalization, we apply stochastic weight averaging during training. Extensive experiments on benchmark antibody-antigen datasets demonstrate that CAME-AB consistently outperforms strong baselines on multiple metrics, including Precision, Recall, F1-score, AUC-ROC, and MCC. Ablation studies further validate the effectiveness of each architectural component and the benefit of multimodal feature integration. The model implementation details and the codes are available on https://anonymous.4open.science/r/CAME-AB-C5252025-09-08T09:24:09ZHongzong LiJiahao MaZhanpeng ShiRui XiaoFanming JinYe-Fan HuHangjun CheJian-Dong Huanghttp://arxiv.org/abs/2509.03084v2SurGBSA: Learning Representations From Molecular Dynamics Simulations2025-09-10T23:46:01ZSelf-supervised pretraining from static structures of drug-like compounds and proteins enable powerful learned feature representations. Learned features demonstrate state of the art performance on a range of predictive tasks including molecular properties, structure generation, and protein-ligand interactions. The majority of approaches are limited by their use of static structures and it remains an open question, how best to use atomistic molecular dynamics (MD) simulations to develop more generalized models to improve prediction accuracy for novel molecular structures. We present SURrogate mmGBSA (SurGBSA) as a new modeling approach for MD-based representation learning, which learns a surrogate function of the Molecular Mechanics Generalized Born Surface Area (MMGBSA). We show for the first time the benefits of physics-informed pre-training to train a surrogate MMGBSA model on a collection of over 1.4 million 3D trajectories collected from MD simulations of the CASF-2016 benchmark. SurGBSA demonstrates a dramatic 27,927x speedup versus a traditional physics-based single-point MMGBSA calculation while nearly matching single-point MMGBSA accuracy on the challenging pose ranking problem for identification of the correct top pose (-0.4% difference). Our work advances the development of molecular foundation models by showing model improvements when training on MD simulations. Models, code and training data are made publicly available.2025-09-03T07:27:21ZDerek JonesYue YangFelice C. LightstoneNiema MoshiriJonathan E. AllenTajana S. Rosinghttp://arxiv.org/abs/2509.08707v1Tokenizing Loops of Antibodies2025-09-10T15:56:19ZThe complementarity-determining regions of antibodies are loop structures that are key to their interactions with antigens, and of high importance to the design of novel biologics. Since the 1980s, categorizing the diversity of CDR structures into canonical clusters has enabled the identification of key structural motifs of antibodies. However, existing approaches have limited coverage and cannot be readily incorporated into protein foundation models. Here we introduce ImmunoGlobulin LOOp Tokenizer, Igloo, a multimodal antibody loop tokenizer that encodes backbone dihedral angles and sequence. Igloo is trained using a contrastive learning objective to map loops with similar backbone dihedral angles closer together in latent space. Igloo can efficiently retrieve the closest matching loop structures from a structural antibody database, outperforming existing methods on identifying similar H3 loops by 5.9\%. Igloo assigns tokens to all loops, addressing the limited coverage issue of canonical clusters, while retaining the ability to recover canonical loop conformations. To demonstrate the versatility of Igloo tokens, we show that they can be incorporated into protein language models with IglooLM and IglooALM. On predicting binding affinity of heavy chain variants, IglooLM outperforms the base protein language model on 8 out of 10 antibody-antigen targets. Additionally, it is on par with existing state-of-the-art sequence-based and multimodal protein language models, performing comparably to models with $7\times$ more parameters. IglooALM samples antibody loops which are diverse in sequence and more consistent in structure than state-of-the-art antibody inverse folding models. Igloo demonstrates the benefit of introducing multimodal tokens for antibody loops for encoding the diverse landscape of antibody loops, improving protein foundation models, and for antibody CDR design.2025-09-10T15:56:19Z21 pages, 7 figures, 10 tables, code available at https://github.com/prescient-design/iglooAda FangRobert G. AlbersteinSimon KelowFrédéric A. Dreyerhttp://arxiv.org/abs/2509.08633v1Quantifying the liquid-liquid transition in cold water/glycerol mixtures by ih-RIDME2025-09-10T14:27:53ZWater/glycerol mixtures are common for experiments with biomacromolecules at cryogenic temperatures due to their vitrification properties. Above the glass transition temperature, they undergo liquid-liquid phase separation. Using the novel EPR technique called intermolecular hyperfine Relaxation-Induced Dipolar Modulation Enhancement (ih-RIDME), we quantified the molar composition in frozen water/glycerol mixtures with one or the other component deuterated after the phase transition. Our experiments reveal nearly equal phase composition regardless of the proton/deuterium isotope balance. With the new ih-RIDME data, we can also revisit the already reported body of glass transition data for such mixtures and build a consistent picture for water/glycerol freezing and phase transitions. Our results also indicate that ih-RIDME has the potential for investigating the solvation shells of spin-labelled macromolecules.2025-09-10T14:27:53ZManuscript prepared for submissionSergei KuzinMaxim Yulikovhttp://arxiv.org/abs/2509.07458v1Unveiling Biological Models Through Turing Patterns2025-09-09T07:26:36ZTuring patterns play a fundamental role in morphogenesis and population dynamics, encoding key information about the underlying biological mechanisms. Yet, traditional inverse problems have largely relied on non-biological data such as boundary measurements, neglecting the rich information embedded in the patterns themselves. Here we introduce a new research direction that directly leverages physical observables from nature--the amplitude of Turing patterns--to achieve complete parameter identification. We present a framework that uses the spatial amplitude profile of a single pattern to simultaneously recover all system parameters, including wavelength, diffusion constants, and the full nonlinear forms of chemotactic and kinetic coefficient functions. Demonstrated on models of chemotactic bacteria, this amplitude-based approach establishes a biologically grounded, mathematically rigorous paradigm for reverse-engineering pattern formation mechanisms across diverse biological systems.2025-09-09T07:26:36Z22 pages keywords: inverse reaction-diffusion equations, Turing patterns, Turing instability, periodic solutions, sinusoidal formYuhan LiHongyu LiuCatharine W. K. Lohttp://arxiv.org/abs/2411.10548v5BioNeMo Framework: a modular, high-performance library for AI model development in drug discovery2025-09-08T19:12:19ZArtificial Intelligence models encoding biology and chemistry are opening new routes to high-throughput and high-quality in-silico drug development. However, their training increasingly relies on computational scale, with recent protein language models (pLM) training on hundreds of graphical processing units (GPUs). We introduce the BioNeMo Framework to facilitate the training of computational biology and chemistry AI models across hundreds of GPUs. Its modular design allows the integration of individual components, such as data loaders, into existing workflows and is open to community contributions. We detail technical features of the BioNeMo Framework through use cases such as pLM pre-training and fine-tuning. On 256 NVIDIA A100s, BioNeMo Framework trains a three billion parameter BERT-based pLM on over one trillion tokens in 4.2 days. The BioNeMo Framework is open-source and free for everyone to use.2024-11-15T19:46:16ZPeter St. JohnDejun LinPolina BinderMalcolm GreavesVega ShahJohn St. JohnAdrian LangePatrick HsuRajesh IllangoArvind RamanathanAnima AnandkumarDavid H BrookesAkosua BusiaAbhishaike MahajanStephen MalinaNeha PrasadSam SinaiLindsay EdwardsThomas GaudeletCristian RegepMartin SteineggerBurkhard RostAlexander BraceKyle HippeLuca NaefKeisuke KamataGeorge ArmstrongKevin BoydZhonglin CaoHan-Yi ChouSimon ChuAllan dos Santos CostaSajad DarabiEric DawsonKieran DidiCong FuMario GeigerMichelle GillDarren J HsuGagan KaushikMaria KorshunovaSteven Kothen-HillYouhan LeeMeng LiuMicha LivneZachary McClureJonathan MitchellAlireza MoradzadehOhad MosafiYoussef NashedSaee PaliwalYuxing PengSara RabhiFarhad RamezanghorbaniDanny ReidenbachCamir RickettsBrian C RolandKushal ShahTyler ShimkoHassan SirelkhatimSavitha SrinivasanAbraham C SternDorota ToczydlowskaSrimukh Prasad VecchamNiccolò Alberto Elia VenanziAnton VorontsovJared WilberIsabel WilkinsonWei Jing WongEva XueCory YeXin YuYang ZhangGuoqing ZhouBecca ZandsteinAlejandro ChaconPrashant SohaniMaximilian StadlerChristian HundtFeiwen ZhuChristian DallagoBruno TrentiniEmine KucukbenliSaee PaliwalTimur RvachovEddie CallejaJohnny IsraeliHarry CliffordRisto HaukiojaNicholas HaemelKyle TretinaNeha TadimetiAnthony B Costahttp://arxiv.org/abs/2404.00081v2Molecular Generative Adversarial Network with Multi-Property Optimization2025-09-08T10:22:05ZDeep generative models, such as generative adversarial networks (GANs), have been employed for $de~novo$ molecular generation in drug discovery. Most prior studies have utilized reinforcement learning (RL) algorithms, particularly Monte Carlo tree search (MCTS), to handle the discrete nature of molecular representations in GANs. However, due to the inherent instability in training GANs and RL models, along with the high computational cost associated with MCTS sampling, MCTS RL-based GANs struggle to scale to large chemical databases. To tackle these challenges, this study introduces a novel GAN based on actor-critic RL with instant and global rewards, called InstGAN, to generate molecules at the token-level with multi-property optimization. Furthermore, maximized information entropy is leveraged to alleviate the mode collapse. The experimental results demonstrate that InstGAN outperforms other baselines, achieves comparable performance to state-of-the-art models, and efficiently generates molecules with multi-property optimization. The source code will be released upon acceptance of the paper.2024-03-29T08:55:39ZHuidong TangChen LiSayaka KameiYoshihiro YamanishiYasuhiko Morimotohttp://arxiv.org/abs/2509.06271v1Computational predictions of nutrient precipitation for intensified cell 1 culture media via amino acid solution thermodynamics2025-09-08T01:39:27ZThe majority of therapeutic monoclonal antibodies (mAbs) on the market are produced using Chinese Hamster Ovary (CHO) cells cultured at scale in chemically defined cell culture medium. Because of the high costs associated with mammalian cell cultures, obtaining high cell densities to produce high product titers is desired. These bioprocesses require high concentrations of nutrients in the basal media and periodically adding concentrated feed media to sustain cell growth and therapeutic protein productivity. Unfortunately, the desired or optimal nutrient concentrations of the feed media are often solubility limited due to precipitation of chemical complexes that form in the solution. Experimentally screening the various cell culture media configurations which contain 50 to 100 compounds can be expensive and laborious. This article lays the foundation for utilizing computational tools to understand precipitation of nutrients in cell culture media by studying the pairwise interactions between amino acids in thermodynamic models. Activity coefficient data for one amino acid in water and amino acid solubility data of two amino acids in water have been used to determine a single set of UNIFAC group interaction parameters to predict the thermodynamic behavior of the multi-component systems found in mammalian cell culture media. The data collected in this study is, to our knowledge, the largest set of ternary system amino acid solubility data reported to date. These amino acid precipitation predictions have been verified with experimentally measured ternary and quaternary amino acid solutions. Thus, we demonstrate the utility of our model as a digital twin to identify optimal cell culture media compositions by replacing empirical approaches for nutrient precipitation with computational predictions based on thermodynamics of individual media components in complex mixtures.2025-09-08T01:39:27Z32 pages, 8 figuresJayanth Venkatarama ReddyNelson NdahiroLateef AliyuAshwin DravidTianxin XangJinke WuMichael BetenbaughMarc Donohuehttp://arxiv.org/abs/2403.03726v4Diffusion on language model encodings for protein sequence generation2025-09-06T08:28:28ZProtein sequence design has seen significant advances through discrete diffusion and autoregressive approaches, yet the potential of continuous diffusion remains underexplored. Here, we present DiMA, a latent diffusion framework that operates on protein language model representations. Through systematic exploration of architectural choices and diffusion components, we develop a robust methodology that generalizes across multiple protein encoders ranging from 8M to 3B parameters. We demonstrate that our framework achieves consistently high performance across sequence-only (ESM-2, ESMc), dual-decodable (CHEAP), and multimodal (SaProt) representations using the same architecture and training approach. We extensively evaluate existing methods alongside DiMA using multiple metrics across two protein modalities, covering quality, diversity, novelty, and distribution matching of generated proteins. DiMA consistently produces novel, high-quality and diverse protein sequences and achieves strong results compared to baselines such as autoregressive, discrete diffusion and flow matching language models. The model demonstrates versatile functionality, supporting conditional generation tasks including protein family-generation, motif scaffolding and infilling, and fold-specific sequence design. This work provides a universal continuous diffusion framework for protein sequence generation, offering both architectural insights and practical applicability across various protein design scenarios. Code is released at \href{https://github.com/MeshchaninovViacheslav/DiMA}{GitHub}.2024-03-06T14:15:20ZViacheslav MeshchaninovPavel StrashnovAndrey ShevtsovFedor NikolaevNikita IvanisenkoOlga KardymonDmitry Vetrovhttp://arxiv.org/abs/2509.05517v1Towards better structural models from cryo-electron microscopy data with physics-based methods2025-09-05T22:00:20ZCryo-electron microscopy can now routinely deliver atomic resolution structures for a variety of biological systems. The relevance and value of these structures is directly related to their ability to help rationalize experimental observables, which in turn depends on the quality of model built into the density map. Coupling traditional model building tools with physics-based methods, such as docking, simulation, and modern force fields, has been shown to improve the quality of the resulting structures. Here, we survey the landscape of these hybrid approaches, highlighting their usefulness for medium- and low-resolution datasets, as well as for structures of small molecules, and make the argument that the community stands to benefit from their inclusion in model building and refinement workflows.2025-09-05T22:00:20Z13 pages, 2 figures. Submitted to FEBS LettersHande Boyaci SelcukGabriella ReggianoJacob Robson-TullLichirui ZhangJoão P. G. L. M. Rodrigueshttp://arxiv.org/abs/2509.04998v1Directed Evolution of Proteins via Bayesian Optimization in Embedding Space2025-09-05T10:47:49ZDirected evolution is an iterative laboratory process of designing proteins with improved function by iteratively synthesizing new protein variants and evaluating their desired property with expensive and time-consuming biochemical screening. Machine learning methods can help select informative or promising variants for screening to increase their quality and reduce the amount of necessary screening. In this paper, we present a novel method for machine-learning-assisted directed evolution of proteins which combines Bayesian optimization with informative representation of protein variants extracted from a pre-trained protein language model. We demonstrate that the new representation based on the sequence embeddings significantly improves the performance of Bayesian optimization yielding better results with the same number of conducted screening in total. At the same time, our method outperforms the state-of-the-art machine-learning-assisted directed evolution methods with regression objective.2025-09-05T10:47:49Z8 pages, 2 figuresProceedings of 2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Lisbon, Portugal, 2024, pp. 91-98Matouš SoldátJiří Kléma10.1109/BIBM62325.2024.10822356http://arxiv.org/abs/2508.20004v2Modal Geometry Governs Proteoform Dynamics2025-09-04T09:55:38ZThe fundamental laws governing proteoform dynamics have yet to be formulated. As a result, it is unclear how a specific proteoform, a distinct molecular variant of a protein, dynamically shapes its own future by evolving into new modes that exist only in potential until realised. Here, Modal Geometric Field (MGF) Theory couples real and abstract proteoform transitions through four axioms. Axioms 1 to 3 (invariant) dictate that only first-order transitions occur on the discrete, volume-invariant, non symplectic modal manifold. Axiom 4 (mutable) projects the occupancy and shape of a real, instantiated molecule into the modal manifold, generating occupancy-induced curvature. By coupling what is real to what is abstract, curvature, which is always conserved, governs proteoform dynamics by dictating the least-action modal transition. Because curvature distribution renders activation energy relative, barriers are mutable, and entropy emerges inevitably from curvature transport. This unification of energy, entropy, and curvature yields hysteresis, path dependence, fractal self similarity, and trajectories that oscillate between order and chaos. As a scale invariant and universal framework, MGF Theory reveals how modal geometry governs proteoform dynamics2025-08-27T16:07:55Z25 pages, 4 figuresJames N. Cobley