https://arxiv.org/api/KVGB4cjjCl3hKzjQOlnLG7pAEos 2026-03-25T14:58:44Z 6650 420 15 http://arxiv.org/abs/2508.12629v1 FlowMol3: Flow Matching for 3D De Novo Small-Molecule Generation 2025-08-18T05:13:27Z

A generative model capable of sampling realistic molecules with desired properties could accelerate chemical discovery across a wide range of applications. Toward this goal, significant effort has focused on developing models that jointly sample molecular topology and 3D structure. We present FlowMol3, an open-source, multi-modal flow matching model that advances the state of the art for all-atom, small-molecule generation. Its substantial performance gains over previous FlowMol versions are achieved without changes to the graph neural network architecture or the underlying flow matching formulation. Instead, FlowMol3's improvements arise from three architecture-agnostic techniques that incur negligible computational cost: self-conditioning, fake atoms, and train-time geometry distortion. FlowMol3 achieves nearly 100% molecular validity for drug-like molecules with explicit hydrogens, more accurately reproduces the functional group composition and geometry of its training data, and does so with an order of magnitude fewer learnable parameters than comparable methods. We hypothesize that these techniques mitigate a general pathology affecting transport-based generative models, enabling detection and correction of distribution drift during inference. Our results highlight simple, transferable strategies for improving the stability and quality of diffusion- and flow-based molecular generative models.

2025-08-18T05:13:27Z Ian Dunn David R. Koes http://arxiv.org/abs/2508.10775v1 IBEX: Information-Bottleneck-EXplored Coarse-to-Fine Molecular Generation under Limited Data 2025-08-14T15:59:22Z

Three-dimensional generative models increasingly drive structure-based drug discovery, yet it remains constrained by the scarce publicly available protein-ligand complexes. Under such data scarcity, almost all existing pipelines struggle to learn transferable geometric priors and consequently overfit to training-set biases. As such, we present IBEX, an Information-Bottleneck-EXplored coarse-to-fine pipeline to tackle the chronic shortage of protein-ligand complex data in structure-based drug design. Specifically, we use PAC-Bayesian information-bottleneck theory to quantify the information density of each sample. This analysis reveals how different masking strategies affect generalization and indicates that, compared with conventional de novo generation, the constrained Scaffold Hopping task endows the model with greater effective capacity and improved transfer performance. IBEX retains the original TargetDiff architecture and hyperparameters for training to generate molecules compatible with the binding pocket; it then applies an L-BFGS optimization step to finely refine each conformation by optimizing five physics-based terms and adjusting six translational and rotational degrees of freedom in under one second. With only these modifications, IBEX raises the zero-shot docking success rate on CBGBench CrossDocked2020-based from 53% to 64%, improves the mean Vina score from $-7.41 kcal mol^{-1}$ to $-8.07 kcal mol^{-1}$, and achieves the best median Vina energy in 57 of 100 pockets versus 3 for the original TargetDiff. IBEX also increases the QED by 25%, achieves state-of-the-art validity and diversity, and markedly reduces extrapolation error.

2025-08-14T15:59:22Z 10 pages, 8 figures Dong Xu Zhangfan Yang Jenna Xinyi Yao Shuangbao Song Zexuan Zhu Junkai Ji http://arxiv.org/abs/2507.08162v2 AmpLyze: A Deep Learning Model for Predicting the Hemolytic Concentration 2025-08-13T23:59:49Z

Red-blood-cell lysis (HC50) is the principal safety barrier for antimicrobial-peptide (AMP) therapeutics, yet existing models only say "toxic" or "non-toxic." AmpLyze closes this gap by predicting the actual HC50 value from sequence alone and explaining the residues that drive toxicity. The model couples residue-level ProtT5/ESM2 embeddings with sequence-level descriptors in dual local and global branches, aligned by a cross-attention module and trained with log-cosh loss for robustness to assay noise. The optimal AmpLyze model reaches a PCC of 0.756 and an MSE of 0.987, outperforming classical regressors and the state-of-the-art. Ablations confirm that both branches are essential, and cross-attention adds a further 1% PCC and 3% MSE improvement. Expected-Gradients attributions reveal known toxicity hotspots and suggest safer substitutions. By turning hemolysis assessment into a quantitative, sequence-based, and interpretable prediction, AmpLyze facilitates AMP design and offers a practical tool for early-stage toxicity screening.

2025-07-10T20:47:06Z Peng Qiu Hanqi Feng Meng-Chun Zhang Barnabas Poczos http://arxiv.org/abs/2508.10114v1 Physical Principles of Size and Frequency Scaling of Active Cytoskeletal Spirals 2025-08-13T18:20:43Z

Cytoskeletal filaments transported by surface immobilized molecular motors with one end pinned to the surface have been observed to spiral in a myosin-driven actin 'gliding assay'. The radius of the spiral was shown to scale with motor density with an exponent of -1/3, while the frequency was theoretically predicted to scale with an exponent of 4/3. While both the spiraling radius and frequency depend on motor density, the theory assumed independence of filament length, and remained to be tested on cytoskeletal systems other than actin-myosin. Here, we reconstitute dynein-driven microtubule spiraling and compare experiments to theory and numerical simulations. We characterize the scaling laws of spiraling MTs and find the radius dependence on force density to be consistent with previous results. Frequency on the other hand scales with force density with an exponent of ~1/3, contrary to previous predictions. We also predict that the spiral radius scales proportionally and the frequency scales inversely with filament length, both with an exponent of ~1/3. A model of variable persistence length best explains the length dependence observed in experiments. Our findings that reconcile theory, simulations, and experiments improve our understanding of the role of cytoskeletal filament elasticity, mechanics of microtubule buckling and motor transport and the physical principles of active filaments.

2025-08-13T18:20:43Z 5 figures Aman Soni Shivani A. Yadav Chaitanya A. Athale http://arxiv.org/abs/2506.17857v2 AbRank: A Benchmark Dataset and Metric-Learning Framework for Antibody-Antigen Affinity Ranking 2025-08-13T17:13:41Z

Accurate prediction of antibody-antigen (Ab-Ag) binding affinity is essential for therapeutic design and vaccine development, yet the performance of current models is limited by noisy experimental labels, heterogeneous assay conditions, and poor generalization across the vast antibody and antigen sequence space. We introduce AbRank, a large-scale benchmark and evaluation framework that reframes affinity prediction as a pairwise ranking problem. AbRank aggregates over 380,000 binding assays from nine heterogeneous sources, spanning diverse antibodies, antigens, and experimental conditions, and introduces standardized data splits that systematically increase distribution shift, from local perturbations such as point mutations to broad generalization across novel antigens and antibodies. To ensure robust supervision, AbRank defines an m-confident ranking framework by filtering out comparisons with marginal affinity differences, focusing training on pairs with at least an m-fold difference in measured binding strength. As a baseline for the benchmark, we introduce WALLE-Affinity, a graph-based approach that integrates protein language model embeddings with structural information to predict pairwise binding preferences. Our benchmarks reveal significant limitations in current methods under realistic generalization settings and demonstrate that ranking-based training improves robustness and transferability. In summary, AbRank offers a robust foundation for machine learning models to generalize across the antibody-antigen space, with direct relevance for scalable, structure-aware antibody therapeutic design.

2025-06-21T23:34:46Z Chunan Liu Aurelien Pelissier Yanjun Shao Lilian Denzler Andrew C. R. Martin Brooks Paige María Rodríguez Martínez http://arxiv.org/abs/2508.21076v1 Pep2Prob Benchmark: Predicting Fragment Ion Probability for MS$^2$-based Proteomics 2025-08-12T20:39:50Z

Proteins perform nearly all cellular functions and constitute most drug targets, making their analysis fundamental to understanding human biology in health and disease. Tandem mass spectrometry (MS$^2$) is the major analytical technique in proteomics that identifies peptides by ionizing them, fragmenting them, and using the resulting mass spectra to identify and quantify proteins in biological samples. In MS$^2$ analysis, peptide fragment ion probability prediction plays a critical role, enhancing the accuracy of peptide identification from mass spectra as a complement to the intensity information. Current approaches rely on global statistics of fragmentation, which assumes that a fragment's probability is uniform across all peptides. Nevertheless, this assumption is oversimplified from a biochemical principle point of view and limits accurate prediction. To address this gap, we present Pep2Prob, the first comprehensive dataset and benchmark designed for peptide-specific fragment ion probability prediction. The proposed dataset contains fragment ion probability statistics for 608,780 unique precursors (each precursor is a pair of peptide sequence and charge state), summarized from more than 183 million high-quality, high-resolution, HCD MS$^2$ spectra with validated peptide assignments and fragmentation annotations. We establish baseline performance using simple statistical rules and learning-based methods, and find that models leveraging peptide-specific information significantly outperform previous methods using only global fragmentation statistics. Furthermore, performance across benchmark models with increasing capacities suggests that the peptide-fragmentation relationship exhibits complex nonlinearities requiring sophisticated machine learning approaches.

2025-08-12T20:39:50Z Dataset is available at HuggingFace: https://huggingface.co/datasets/bandeiralab/Pep2Prob Hao Xu Zhichao Wang Shengqi Sang Pisit Wajanasara Nuno Bandeira http://arxiv.org/abs/2504.03590v2 Towards a Unified Framework for Determining Conformational Ensembles of Disordered Proteins 2025-08-12T14:40:48Z

Disordered proteins play essential roles in myriad cellular processes, yet their structural characterization remains a major challenge due to their dynamic and heterogeneous nature. We here present a community-driven initiative to address this problem by advocating a unified framework for determining conformational ensembles of disordered proteins. Our aim is to integrate state-of-the-art experimental techniques with advanced computational methods, including knowledge-based sampling, enhanced molecular dynamics, and machine learning models. The modular framework comprises three interconnected components: experimental data acquisition, computational ensemble generation, and validation. The systematic development of this framework will ensure the accurate and reproducible determination of conformational ensembles of disordered proteins. We highlight the open challenges necessary to achieve this goal, including force field accuracy, efficient sampling, and environmental dependency, advocating for collaborative benchmarking and standardized protocols.

2025-04-04T16:57:49Z Hamidreza Ghafouri Pavel Kadeřávek Ana M Melo Maria Cristina Aspromonte Pau Bernadó Juan Cortes Zsuzsanna Dosztányi Gabor Erdos Michael Feig Giacomo Janson Kresten Lindorff-Larsen Frans A. A. Mulder Peter Nagy Richard Pestell Damiano Piovesan Marco Schiavina Benjamin Schuler Nathalie Sibille Giulio Tesei Peter Tompa Michele Vendruscolo Jiri Vondrasek Wim Vranken Lukas Zidek Silvio C. E. Tosatto Alexander Miguel Monzon http://arxiv.org/abs/2508.08109v1 Probing the Dark Energy in the Functional Protein Universe 2025-08-11T15:48:20Z

We show how to localize and quantify the functional evolutionary constraints on natural proteins. The method compares the perturbations caused by local sequence variants to the energetics of the protein folding process and to the corresponding change to the apparent selection landscape of sequences over the evolutionary time scale. The difference between the physical folding free energies and the evolutionary free energies can be called a "Dark Energy". We analyse various protein sets and thereby show that Dark Energy is largely localized at functional sites, which are often energetically frustrated from the point of view of folding. Overall, we find that about 25% of the positions of the folded globular proteins display some significant Dark Energy. When a function relies on a free energy that can be thermodynamically quantified, such as a binding energy to a partner, the relationship of this physical free energy with Dark Energy can be used to define a Functional Selection Temperature. We show that selection for folding and binding functions bear similar weights in specific protein-protein interactions.

2025-08-11T15:48:20Z 34 pages, 5 figures, Sup Table and 4 Sup figures Ezequiel A. Galpern Carlos Bueno Ignacio E. Sánchez Peter G. Wolynes Diego U. Ferreiro http://arxiv.org/abs/2406.13839v4 RNA-FrameFlow: Flow Matching for de novo 3D RNA Backbone Design 2025-08-11T08:09:33Z

We introduce RNA-FrameFlow, the first generative model for 3D RNA backbone design. We build upon SE(3) flow matching for protein backbone generation and establish protocols for data preparation and evaluation to address unique challenges posed by RNA modeling. We formulate RNA structures as a set of rigid-body frames and associated loss functions which account for larger, more conformationally flexible RNA backbones (13 atoms per nucleotide) vs. proteins (4 atoms per residue). Toward tackling the lack of diversity in 3D RNA datasets, we explore training with structural clustering and cropping augmentations. Additionally, we define a suite of evaluation metrics to measure whether the generated RNA structures are globally self-consistent (via inverse folding followed by forward folding) and locally recover RNA-specific structural descriptors. The most performant version of RNA-FrameFlow generates locally realistic RNA backbones of 40-150 nucleotides, over 40% of which pass our validity criteria as measured by a self-consistency TM-score >= 0.45, at which two RNAs have the same global fold. Open-source code: https://github.com/rish-16/rna-backbone-design

2024-06-19T21:06:44Z Published in Transactions on Machine Learning Research (https://openreview.net/forum?id=wOc1Yx5s09). Also presented as an Oral at Machine Learning in Computational Biology 2024, ICML 2024 Structured Probabilistic Inference & Generative Modeling Workshop, and a Spotlight at ICML 2024 AI4Science Workshop Rishabh Anand Chaitanya K. Joshi Alex Morehead Arian R. Jamasb Charles Harris Simon V. Mathis Kieran Didi Rex Ying Bryan Hooi Pietro Liò http://arxiv.org/abs/2405.14968v2 Compound Mutations in the Abl1 Kinase Cause Inhibitor Resistance by Shifting DFG Flip Mechanisms and Relative State Populations 2025-08-11T00:59:55Z

The intrinsic dynamics of most proteins are central to their function. Protein tyrosine kinases such as Abl1 undergo significant conformational changes that modulate their activity in response to different stimuli. These conformational changes constitute a conserved mechanism for self-regulation that dramatically impacts kinases' affinities for inhibitors. Few studies have attempted to extensively sample the pathways and elucidate the mechanisms that underlie kinase inactivation. Seeking to bridge this knowledge gap, we present a thorough analysis of the ``DFG flip'' inactivation pathway in Abl1 kinase. By leveraging the power of the Weighted Ensemble methodology, which accelerates sampling without the use of biasing forces, we have comprehensively simulated DFG flip events in Abl1 and its inhibitor-resistant variants, revealing a rugged landscape punctuated by potentially druggable intermediate states. Through our strategy, we successfully simulated dozens of uncorrelated DFG flip events distributed along two principal pathways, identified the molecular mechanisms that govern them, and measured their relative probabilities. Further, we show that the compound Glu255Lys/Val Thr315Ile Abl1 variants owe their inhibitor resistance phenotype to an increase in the free energy barrier associated with completing the DFG flip. This barrier stabilizes Abl1 variants in conformations that can lead to loss of binding for Type-II inhibitors such as Imatinib or Ponatinib. Finally, we contrast our Abl1 observations with the relative state distributions and propensity for undergoing a DFG flip of evolutionarily-related protein tyrosine kinases with diverging Type-II inhibitor binding affinities. Altogether, we expect that our work will be of significant importance for protein tyrosine kinase inhibitor discovery.

2024-05-23T18:17:41Z 34 pages, 10 figures eLife14:RP104519 (2025) Gabriel Monteiro da Silva Kyle Lam David C. Dalgarno Brenda M. Rubenstein 10.7554/eLife.104519.1 http://arxiv.org/abs/2502.06891v3 ScaffoldGPT: A Scaffold-based GPT Model for Drug Optimization 2025-08-10T13:08:35Z

Drug optimization has become increasingly crucial in light of fast-mutating virus strains and drug-resistant cancer cells. Nevertheless, it remains challenging as it necessitates retaining the beneficial properties of the original drug while simultaneously enhancing desired attributes beyond its scope. In this work, we aim to tackle this challenge by introducing ScaffoldGPT, a novel Generative Pretrained Transformer (GPT) designed for drug optimization based on molecular scaffolds. Our work comprises three key components: (1) A three-stage drug optimization approach that integrates pretraining, finetuning, and decoding optimization. (2) A novel two-phase incremental pre-training strategy for scaffold-based drug optimization. (3) A token-level decoding optimization strategy, Top-N, that enabling controlled, reward-guided generation using the pretrained or finetuned GPT. We demonstrate via a comprehensive evaluation on COVID and cancer benchmarks that ScaffoldGPT outperforms the competing baselines in drug optimization benchmarks, while excelling in preserving original functional scaffold and enhancing desired properties.

2025-02-09T10:36:33Z Xuefeng Liu Songhao Jiang Ian Foster Jinbo Xu Rick Stevens http://arxiv.org/abs/2507.08920v3 AMix-1: A Pathway to Test-Time Scalable Protein Foundation Model 2025-08-08T17:43:12Z

We introduce AMix-1, a powerful protein foundation model built on Bayesian Flow Networks and empowered by a systematic training methodology, encompassing pretraining scaling laws, emergent capability analysis, in-context learning mechanism, and test-time scaling algorithm. To guarantee robust scalability, we establish a predictive scaling law and reveal the progressive emergence of structural understanding via loss perspective, culminating in a strong 1.7-billion model. Building on this foundation, we devise a multiple sequence alignment (MSA)-based in-context learning strategy to unify protein design into a general framework, where AMix-1 recognizes deep evolutionary signals among MSAs and consistently generates structurally and functionally coherent proteins. This framework enables the successful design of a dramatically improved AmeR variant with an up to $50\times$ activity increase over its wild type. Pushing the boundaries of protein engineering, we further empower AMix-1 with an evolutionary test-time scaling algorithm for in silico directed evolution that delivers substantial, scalable performance gains as verification budgets are intensified, laying the groundwork for next-generation lab-in-the-loop protein design.

2025-07-11T17:02:25Z Changze Lv Jiang Zhou Siyu Long Lihao Wang Jiangtao Feng Dongyu Xue Yu Pei Hao Wang Zherui Zhang Yuchen Cai Zhiqiang Gao Ziyuan Ma Jiakai Hu Chaochen Gao Jingjing Gong Yuxuan Song Shuyi Zhang Xiaoqing Zheng Deyi Xiong Lei Bai Wanli Ouyang Ya-Qin Zhang Wei-Ying Ma Bowen Zhou Hao Zhou http://arxiv.org/abs/2508.06364v1 ActivityDiff: A diffusion model with Positive and Negative Activity Guidance for De Novo Drug Design 2025-08-08T14:48:47Z

Achieving precise control over a molecule's biological activity-encompassing targeted activation/inhibition, cooperative multi-target modulation, and off-target toxicity mitigation-remains a critical challenge in de novo drug design. However, existing generative methods primarily focus on producing molecules with a single desired activity, lacking integrated mechanisms for the simultaneous management of multiple intended and unintended molecular interactions. Here, we propose ActivityDiff, a generative approach based on the classifier-guidance technique of diffusion models. It leverages separately trained drug-target classifiers for both positive and negative guidance, enabling the model to enhance desired activities while minimizing harmful off-target effects. Experimental results show that ActivityDiff effectively handles essential drug design tasks, including single-/dual-target generation, fragment-constrained dual-target design, selective generation to enhance target specificity, and reduction of off-target effects. These results demonstrate the effectiveness of classifier-guided diffusion in balancing efficacy and safety in molecular design. Overall, our work introduces a novel paradigm for achieving integrated control over molecular activity, and provides ActivityDiff as a versatile and extensible framework.

2025-08-08T14:48:47Z Renyi Zhou Huimin Zhu Jing Tang Min Li http://arxiv.org/abs/2508.16587v1 HemePLM-Diffuse: A Scalable Generative Framework for Protein-Ligand Dynamics in Large Biomolecular System 2025-08-07T17:29:52Z

Comprehending the long-timescale dynamics of protein-ligand complexes is very important for drug discovery and structural biology, but it continues to be computationally challenging for large biomolecular systems. We introduce HemePLM-Diffuse, an innovative generative transformer model that is designed for accurate simulation of protein-ligand trajectories, inpaints the missing ligand fragments, and sample transition paths in systems with more than 10,000 atoms. HemePLM-Diffuse has features of SE(3)-Invariant tokenization approach for proteins and ligands, that utilizes time-aware cross-attentional diffusion to effectively capture atomic motion. We also demonstrate its capabilities using the 3CQV HEME system, showing enhanced accuracy and scalability compared to leading models such as TorchMD-Net, MDGEN, and Uni-Mol.

2025-08-07T17:29:52Z 7 pages, 9 figures and 1 table Rakesh Thakur Riya Gupta http://arxiv.org/abs/2508.14050v1 The Importance of Aqueous Metabolites in the Martian Subsurface for Understanding Habitability, Organic Chemical Evolution, and Potential Biology 2025-08-06T13:06:04Z

Aqueous metabolites in terrestrial subsurface environments provide critical analog frameworks for assessing the habitability of Martian subsurface ice. On Earth, they play critical roles in sustaining microbial life within soils, permafrost, and groundwater environments and their availability shape microbial community compositions, activity, and adaptability to changes in environmental conditions, enabling communities to persist over millennial timescales. The counterpart to aqueous-soluble organics is the insoluble organic matter pool that makes up the largest portion of organic matter in natural samples and includes most types of organic signatures indicative of biological processes. Employing a range of sample preparation, molecular separation, detection, and imaging techniques enables the characterization of both labile (i.e., soluble and reactive) and recalcitrant (i.e., insoluble, non-reactive; include macromolecules) organic pools. Multiple orthogonal analytical modalities strengthen interpretations of signatures that we associate with biology as we know it and don't know it, by constraining possible abiotic sources, validating measurements across distinct techniques, and ensuring flexibility to interrogate diverse organic chemistries encountered in Martian subsurface environments. This holistic triage approach aligns with the priorities articulated in the Mars Exploration Program Analysis Group's Search for Life -Science Analysis Group (SFL-SAG) Charter for a medium-class Mars mission focused on extant life detection.

2025-08-06T13:06:04Z a white paper for MEPAG SFL-SAG Jennifer L. Eigenbrode Luoth Chou