https://arxiv.org/api/dQlW2DD5YOFWqvGWe673tvCJoHQ2026-03-16T14:02:12Z66357515http://arxiv.org/abs/2602.00157v2ProDCARL: Reinforcement Learning-Aligned Diffusion Models for De Novo Antimicrobial Peptide Design2026-02-04T05:28:09ZAntimicrobial resistance threatens healthcare sustainability and motivates low-cost computational discovery of antimicrobial peptides (AMPs). De novo peptide generation must optimize antimicrobial activity and safety through low predicted toxicity, but likelihood-trained generators do not enforce these goals explicitly. We introduce ProDCARL, a reinforcement-learning alignment framework that couples a diffusion-based protein generator (EvoDiff OA-DM 38M) with sequence property predictors for AMP activity and peptide toxicity. We fine-tune the diffusion prior on AMP sequences to obtain a domain-aware generator. Top-k policy-gradient updates use classifier-derived rewards plus entropy regularization and early stopping to preserve diversity and reduce reward hacking. In silico experiments show ProDCARL increases the mean predicted AMP score from 0.081 after fine-tuning to 0.178. The joint high-quality hit rate reaches 6.3\% with pAMP $>$0.7 and pTox $<$0.3. ProDCARL maintains high diversity, with $1-$mean pairwise identity equal to 0.929. Qualitative analyses with AlphaFold3 and ProtBERT embeddings suggest candidates show plausible AMP-like structural and semantic characteristics. ProDCARL serves as a candidate generator that narrows experimental search space, and experimental validation remains future work.2026-01-29T19:16:39ZFang ShengMohammad NoaeenZahra Shakerihttp://arxiv.org/abs/2505.17914v4Flexible MOF Generation with Torsion-Aware Flow Matching2026-02-04T04:17:30ZDesigning metal-organic frameworks (MOFs) with novel chemistries is a longstanding challenge due to their large combinatorial space and complex 3D arrangements of the building blocks. While recent deep generative models have enabled scalable MOF generation, they assume (1) a fixed set of building blocks and (2) known local 3D coordinates of building blocks. However, this limits their ability to (1) design novel MOFs and (2) generate the structure using novel building blocks. We propose a two-stage MOF generation framework that overcomes these limitations by modeling both chemical and geometric degrees of freedom. First, we train an SMILES-based autoregressive model to generate metal and organic building blocks, paired with a cheminformatics toolkit for 3D structure initialization. Second, we introduce a flow matching model that predicts translations, rotations, and torsional angles to assemble the blocks into valid 3D frameworks. Our experiments demonstrate improved reconstruction accuracy, the generation of valid, novel, and unique MOFs, and the ability to create novel building blocks. Our code is available at https://github.com/nayoung10/MOFFlow-2.2025-05-23T13:56:30Z24 pages, 9 figuresNeural Information Processing Systems (NeurIPS) 2025Nayoung KimSeongsu KimSungsoo Ahnhttp://arxiv.org/abs/2602.03779v1Generative AI for Enzyme Design and Biocatalysis2026-02-03T17:40:20ZSparked by innovations in generative artificial intelligence (AI), the field of protein design has undergone a paradigm shift with an explosion of new models for optimizing existing enzymes or creating them from scratch. After more than one decade of low success rates for computationally designed enzymes, generative AI models are now frequently used for designing proficient enzymes. Here, we provide a comprehensive overview and classification of generative AI models for enzyme design, highlighting models with experimental validation relevant to real-world settings and outlining their respective limitations. We argue that generative AI models now have the maturity to create and optimize enzymes for industrial applications. Wider adoption of generative AI models with experimental feedback loops can speed up the development of biocatalysts and serve as a community assessment to inform the next generation of models.2026-02-03T17:40:20ZLasse MiddendorfNoelia Ferruzhttp://arxiv.org/abs/2512.20581v2MERGE-RNA: a physics-based model to predict RNA secondary structure ensembles with chemical probing2026-02-02T13:14:03ZRNA function is deeply tied to secondary structure, with most molecules operating through dynamic and heterogeneous structural ensembles. While current analysis tools typically output single static structures or averaged contact maps, chemical probing methods like DMS capture nucleotide-resolution signals that represent the full structural ensemble, which however remain of difficult structural interpretation. To address this, we present MERGE-RNA, a framework that explicitly describes and outputs RNA as a structural ensemble. By modeling the experimental pipeline through a statistical physics framework, MERGE-RNA learns a small set of transferable and interpretable parameters, enabling the seamless integration of measurements across different molecules, probe concentrations, and replicates in a single optimization to improve robustness. Our model employs a maximum-entropy principle to predict thermodynamic populations, introducing only the minimal sequence-specific adjustments necessary to align the ensemble with experimental data. We validate MERGE-RNA on diverse RNAs, showing that it achieves strong structural accuracy and the ability to fill knowledge gaps in single-conformation reference structures. Furthermore, in a designed RNA construct for which we report new DMS data, MERGE-RNA deconvolves mixed states to reveal transient intermediate populations involved in strand displacement, dynamics that remain invisible to traditional analysis methods.2025-12-23T18:26:57ZGiuseppe SaccoJianhui LiRedmond P. SmythGuido SanguinettiGiovanni Bussihttp://arxiv.org/abs/2602.13249v1Boltz is a Strong Baseline for Atom-level Representation Learning2026-02-02T08:11:53ZFoundation models in molecular learning have advanced along two parallel tracks: protein models, which typically utilize evolutionary information to learn amino acid-level representations for folding, and small-molecule models, which focus on learning atom-level representations for property prediction tasks such as ADMET. Notably, cutting-edge protein-centric models such as Boltz now operate at atom-level granularity for protein-ligand co-folding, yet their atom-level expressiveness for small-molecule tasks remains unexplored. A key open question is whether these protein co-folding models capture transferable chemical physics or rely on protein evolutionary signals, which would limit their utility for small-molecule tasks. In this work, we investigate the quality of Boltz atom-level representations across diverse small-molecule benchmarks. Our results show that Boltz is competitive with specialized baselines on ADMET property prediction tasks and effective for molecular generation and optimization. These findings suggest that the representational capacity of cutting-edge protein-centric models has been underexplored and position Boltz as a strong baseline for atom-level representation learning for small molecules.2026-02-02T08:11:53ZHyosoon JangHyunjin SeoYunhui JangSeonghyun ParkSungsoo Ahnhttp://arxiv.org/abs/2511.06356v3Reaction Prediction via Interaction Modeling of Symmetric Difference Shingle Sets2026-02-01T11:24:32ZChemical reaction prediction remains a fundamental challenge in organic chemistry, where existing machine learning models face two critical limitations: sensitivity to input permutations (molecule/atom orderings) and inadequate modeling of substructural interactions governing reactivity. These shortcomings lead to inconsistent predictions and poor generalization to real-world scenarios. To address these challenges, we propose ReaDISH, a novel reaction prediction model that learns permutation-invariant representations while incorporating interaction-aware features. It introduces two innovations: (1) symmetric difference shingle encoding, which extends the differential reaction fingerprint (DRFP) by representing shingles as continuous high-dimensional embeddings, capturing structural changes while eliminating order sensitivity; and (2) geometry-structure interaction attention, a mechanism that models intra- and inter-molecular interactions at the shingle level. Extensive experiments demonstrate that ReaDISH improves reaction prediction performance across diverse benchmarks. It shows enhanced robustness with an average improvement of 8.76% on R$^2$ under permutation perturbations.2025-11-09T12:29:16ZRunhan ShiLetian ChenGufeng YuYang Yanghttp://arxiv.org/abs/2602.00782v1Controlling Repetition in Protein Language Models2026-01-31T15:47:57ZProtein language models (PLMs) have enabled advances in structure prediction and de novo protein design, yet they frequently collapse into pathological repetition during generation. Unlike in text, where repetition merely reduces readability, in proteins it undermines structural confidence and functional viability. To unify this problem, we present the first systematic study of repetition in PLMs. We first propose quantitative metrics to characterize motif-level and homopolymer repetition and then demonstrate their negative impact on folding reliability. To address this challenge, we propose UCCS (Utility-Controlled Contrastive Steering), which steers protein generation with a constrained dataset. Instead of naively contrasting high- vs. low-repetition sequences, we construct contrastive sets that maximize differences in repetition while tightly controlling for structural utility. This disentanglement yields steering vectors that specifically target repetition without degrading foldability. Injected at inference, these vectors consistently reduce repetition without retraining or heuristic decoding. Experiments with ESM-3 and ProtGPT2 in CATH, UniRef50, and SCOP show that our method outperforms decoding penalties and other baselines, substantially lowering repetition while preserving AlphaFold confidence scores. Our results establish repetition control as a central challenge for PLMs and highlight dataset-guided steering as a principled approach for reliable protein generation.2026-01-31T15:47:57ZPublished as a conference paper at ICLR 2026Jiahao ZhangZeqing ZhangDi WangLijie Huhttp://arxiv.org/abs/2602.00660v1Phase Transitions in Unsupervised Feature Selection2026-01-31T11:13:56ZIdentifying minimal and informative feature sets is a central challenge in data analysis, particularly when few data points are available. Here we present a theoretical analysis of an unsupervised feature selection pipeline based on the Differentiable Information Imbalance (DII). We consider the specific case of structural and physico-chemical features describing a set of proteins. We show that if one considers the features as coordinates of a (hypothetical) statistical physics model, this model undergoes a phase transition as a function of the number of retained features. For physico-chemical descriptors, this transition is between a glass-like phase when the features are few and a liquid-like phase. The glass-like phase exhibits bimodal order-parameter distributions and Binder cumulant minima. In contrast, for structural descriptors the transition is less sharp. Remarkably, for physico-chemical descriptors the critical number of features identified from the DII coincides with the saturation of downstream binary classification performance. These results provide a principled, unsupervised criterion for minimal feature sets in protein classification and reveal distinct mechanisms of criticality across different feature types.2026-01-31T11:13:56Z15 pages, 4 figures in main text, 7 figures in supplemental materialJonathan FiorentinoMichele MontiDimitrios Miltiadis-VrachnosVittorio Del TattoAlessandro LaioGian Gaetano Tartagliahttp://arxiv.org/abs/2601.23212v1Disentangling multispecific antibody function with graph neural networks2026-01-30T17:36:19ZMultispecific antibodies offer transformative therapeutic potential by engaging multiple epitopes simultaneously, yet their efficacy is an emergent property governed by complex molecular architectures. Rational design is often bottlenecked by the inability to predict how subtle changes in domain topology influence functional outcomes, a challenge exacerbated by the scarcity of comprehensive experimental data. Here, we introduce a computational framework to address part of this gap. First, we present a generative method for creating large-scale, realistic synthetic functional landscapes that capture non-linear interactions where biological activity depends on domain connectivity. Second, we propose a graph neural network architecture that explicitly encodes these topological constraints, distinguishing between format configurations that appear identical to sequence-only models. We demonstrate that this model, trained on synthetic landscapes, recapitulates complex functional properties and, via transfer learning, has the potential to achieve high predictive accuracy on limited biological datasets. We showcase the model's utility by optimizing trade-offs between efficacy and toxicity in trispecific T-cell engagers and retrieving optimal common light chains. This work provides a robust benchmarking environment for disentangling the combinatorial complexity of multispecifics, accelerating the design of next-generation therapeutics.2026-01-30T17:36:19Z16 pages, 5 figures, code available at https://github.com/prescient-design/synapseJoshua SouthernChangpeng LuSantrupti NerliSamuel D. StantonAndrew M. WatkinsFranziska SeegerFrédéric A. Dreyerhttp://arxiv.org/abs/2602.11189v1MuCO: Generative Peptide Cyclization Empowered by Multi-stage Conformation Optimization2026-01-30T10:02:15ZModeling peptide cyclization is critical for the virtual screening of candidate peptides with desirable physical and pharmaceutical properties. This task is challenging because a cyclic peptide often exhibits diverse, ring-shaped conformations, which cannot be well captured by deterministic prediction models derived from linear peptide folding. In this study, we propose MuCO (Multi-stage Conformation Optimization), a generative peptide cyclization method that models the distribution of cyclic peptide conformations conditioned on the corresponding linear peptide. In principle, MuCO decouples the peptide cyclization task into three stages: topology-aware backbone design, generative side-chain packing, and physics-aware all-atom optimization, thereby generating and optimizing conformations of cyclic peptides in a coarse-to-fine manner. This multi-stage framework enables an efficient parallel sampling strategy for conformation generation and allows for rapid exploration of diverse, low-energy conformations. Experiments on the large-scale CPSea dataset demonstrate that MuCO significantly outperforms state-of-the-art methods in consistently in physical stability, structural diversity, secondary structure recovery, and computational efficiency, making it a promising computational tool for exploring and designing cyclic peptides.2026-01-30T10:02:15ZYitian WangFanmeng WangAngxiao YueWentao GuoYaning CuiHongteng Xuhttp://arxiv.org/abs/2601.22757v1Unveiling Scaling Behaviors in Molecular Language Models: Effects of Model Size, Data, and Representation2026-01-30T09:32:12ZMolecular generative models, often employing GPT-style language modeling on molecular string representations, have shown promising capabilities when scaled to large datasets and model sizes. However, it remains unclear and subject to debate whether these models adhere to predictable scaling laws under fixed computational budgets, which is a crucial understanding for optimally allocating resources between model size, data volume, and molecular representation. In this study, we systematically investigate the scaling behavior of molecular language models across both pretraining and downstream tasks. We train 300 models and conduct over 10,000 experiments, rigorously controlling compute budgets while independently varying model size, number of training tokens, and molecular representation. Our results demonstrate clear scaling laws in molecular models for both pretraining and downstream transfer, reveal the substantial impact of molecular representation on performance, and explain previously observed inconsistencies in scaling behavior for molecular generation. Additionally, we publicly release the largest library of molecular language models to date to facilitate future research and development. Code and models are available at https://github.com/SZU-ADDG/MLM-Scaling.2026-01-30T09:32:12Z34 pages, 51 figuresDong XuQihua PanSisi YuanJianqiang LiZexuan ZhuJunkai Jihttp://arxiv.org/abs/2601.22408v1Minimal-Action Discrete Schrödinger Bridge Matching for Peptide Sequence Design2026-01-29T23:38:36ZGenerative modeling of peptide sequences requires navigating a discrete and highly constrained space in which many intermediate states are chemically implausible or unstable. Existing discrete diffusion and flow-based methods rely on reversing fixed corruption processes or following prescribed probability paths, which can force generation through low-likelihood regions and require countless sampling steps. We introduce Minimal-action discrete Schrödinger Bridge Matching (MadSBM), a rate-based generative framework for peptide design that formulates generation as a controlled continuous-time Markov process on the amino-acid edit graph. To yield probability trajectories that remain near high-likelihood sequence neighborhoods throughout generation, MadSBM 1) defines generation relative to a biologically informed reference process derived from pre-trained protein language model logits and 2) learns a time-dependent control field that biases transition rates to produce low-action transport paths from a masked prior to the data distribution. We finally introduce guidance to the MadSBM sampling procedure towards a specific functional objective, expanding the design space of therapeutic peptides; to our knowledge, this represents the first-ever application of discrete classifier guidance to Schrödinger bridge-based generative models.2026-01-29T23:38:36ZShrey GoelPranam Chatterjeehttp://arxiv.org/abs/2510.24736v2RNAGenScape: Property-Guided, Optimized Generation of mRNA Sequences with Manifold Langevin Dynamics2026-01-29T16:34:46ZGenerating property-optimized mRNA sequences is central to applications such as vaccine design and protein replacement therapy, but remains challenging due to limited data, complex sequence-function relationships, and the narrow space of biologically viable sequences. Generative methods that drift away from the data manifold can yield sequences that fail to fold, translate poorly, or are otherwise nonfunctional. We present RNAGenScape, a property-guided manifold Langevin dynamics framework for mRNA sequence generation that operates directly on a learned manifold of real data. By performing iterative local optimization constrained to this manifold, RNAGenScape preserves biological viability, accesses reliable guidance, and avoids excursions into nonfunctional regions of the ambient sequence space. The framework integrates three components: (1) an autoencoder jointly trained with a property predictor to learn a property-organized latent manifold, (2) a denoising autoencoder that projects updates back onto the manifold, and (3) a property-guided Langevin dynamics procedure that performs optimization along the manifold. Across three real-world mRNA datasets spanning two orders of magnitude in size, RNAGenScape increases median property gain by up to 148% and success rate by up to 30% while ensuring biological viability of generated sequences, and achieves competitive inference efficiency relative to existing generative approaches.2025-10-14T19:55:41ZICML 2025 Generative AI and Biology (GenBio) Workshop, Oral presentation (top 9.7%)Danqi LiaoChen LiuXingzhi SunDié TangHaochen WangScott YoultenSrikar Krishna GopinathHaejeong LeeEthan C. StrayerAntonio J. GiraldezSmita Krishnaswamyhttp://arxiv.org/abs/2601.21216v1Multiple binding modes of AKT on PIP$_3$-containing membranes2026-01-29T03:26:27ZThe PI3K/AKT signaling pathway is triggered by recruitment of AKT to cellular membranes. Although AKT is a multidomain serine/threonine kinase composed of an N-terminal pleckstrin homology (PH) domain and a C-terminal kinase domain, how these domains cooperate to regulate AKT activation on membranes remains unclear at the molecular level. Here, using molecular dynamics simulations of full-length AKT on PIP$_3$-containing lipid bilayers, we identify four distinct membrane-binding modes that differ in the orientations and membrane contacts of the PH and kinase domains. In addition to PIP$_3$ binding to the PH domain, we observe specific PIP$_3$ interactions with basic residues in the kinase domain. In the most stable mode, PIP$_3$ interacts with both the canonical and a secondary binding site in the PH domain, while the kinase domain adopts an orientation in which the activation-loop phosphorylation site is exposed to the solvent. Interestingly, the populations of these binding modes depend on the PIP$_3$ concentration in the membrane, leading to changes in the preferred orientation of AKT. These findings shed light on how lipid recognition by the PH domain and the kinase domain of AKT cooperatively shape its membrane-bound conformations.2026-01-29T03:26:27ZYuki NakagakiEiji Yamamotohttp://arxiv.org/abs/2601.17138v2AI Developments for T and B Cell Receptor Modeling and Therapeutic Design2026-01-28T19:03:26ZArtificial intelligence (AI) is accelerating progress in modeling T and B cell receptors by enabling predictive and generative frameworks grounded in sequence data and immune context. This chapter surveys recent advances in the use of protein language models, machine learning, and multimodal integration for immune receptor modeling. We highlight emerging strategies to leverage single-cell and repertoire-scale datasets, and optimize immune receptor candidates for therapeutic design. These developments point toward a new generation of data-efficient, generalizable, and clinically relevant models that better capture the diversity and complexity of adaptive immunity.2026-01-23T19:28:08ZLinhui XieAurelien PelissierYanjun ShaoMaria Rodriguez Martinez