https://arxiv.org/api/BSbeaAO0JGU7kd3HCQIfltgSt58 2026-03-16T08:20:29Z 6635 15 15 http://arxiv.org/abs/2601.16151v2 In vitro binding energies capture Klf4 occupancy across the human genome 2026-03-04T09:32:15Z

Transcription factors (TFs) regulate gene expression by binding to specific genomic loci determined by DNA sequence. Their sequence specificity is commonly summarized by a consensus binding motif. However, eukaryotic genomes contain billions of low-affinity DNA sequences to which TFs associate with a sequence-dependent binding energy. We currently lack insight into how the genomic sequence defines this spectrum of binding energies and the resulting pattern of TF localization. Here, we set out to obtain a quantitative understanding of sequence-dependent TF binding to both motif and non-motif sequences. We achieve this by first pursuing accurate measurements of physical binding energies of the human TF Klf4 to a library of short DNA sequences in a fluorescence-anisotropy-based bulk competitive binding assay. Second, we show that the highly non-linear sequence dependence of Klf4 binding energies can be captured by combining a linear model of binding energies with an Ising model of the coupled recognition of nucleotides by a TF. We find that this statistical mechanics model parametrized by our in vitro measurements captures Klf4 binding patterns on individual long DNA molecules stretched in the optical tweezer, and is predictive for Klf4 occupancy across the entire human genome without additional fit parameters.

2026-01-22T17:49:53Z A.S., J.N., and Y.S. contributed equally to this work. Update 2025/03: correction of a few typos Anne Schwager Jonas Neipel Yahor Savich Douglas Diehl Frank Jülicher Anthony A. Hyman Stephan W. Grill http://arxiv.org/abs/2510.02578v4 FLOWR.root: A flow matching based foundation model for joint multi-purpose structure-aware 3D ligand generation and affinity prediction 2026-03-03T21:29:52Z

We present FLOWR.root, an SE(3)-equivariant flow-matching model for pocket-aware 3D ligand generation with joint potency and binding affinity prediction and confidence estimation. The model supports de novo generation, interaction- and pharmacophore-conditional sampling, fragment elaboration and replacement, and multi-endpoint affinity prediction (pIC50, pKi, pKd, pEC50). Training combines large-scale ligand libraries with mixed-fidelity protein-ligand complexes, refined on curated co-crystal datasets and adapted to project-specific data through parameter-efficient finetuning. The base FLOWR.root model achieves state-of-the-art performance in unconditional 3D molecule and pocket-conditional ligand generation. On HiQBind, the pre-trained and finetuned model demonstrates highly accurate affinity predictions, and outperforms recent state-of-the-art methods such as Boltz-2 on the FEP+/OpenFE benchmark with substantial speed advantages. However, we show that addressing unseen structure-activity landscapes requires domain adaptation; parameter-efficient LoRA finetuning yields marked improvements on diverse proprietary datasets and PDE10A. Joint generation and affinity prediction enable inference-time scaling through importance sampling, steering design toward higher-affinity compounds. Case studies validate this: selective CK2$α$ ligand generation against CLK3 shows significant correlation between predicted and quantum-mechanical binding energies. Scaffold elaboration on ER$α$, TYK2, and BACE1 demonstrates strong agreement between predicted affinities and QM calculations while confirming geometric fidelity. By integrating structure-aware generation, affinity estimation, property-guided sampling, and efficient domain adaptation, FLOWR.root provides a comprehensive foundation for structure-based drug design from hit identification through lead optimization.

2025-10-02T21:38:26Z Julian Cremer Tuan Le Mohammad M. Ghahremanpour Emilia Sługocka Filipe Menezes Djork-Arné Clevert http://arxiv.org/abs/2508.07326v2 Nonparametric Reaction Coordinate Optimization with Histories: A Framework for Rare Event Dynamics 2026-03-03T12:43:35Z

Rare but critical events in complex systems, such as protein folding, chemical reactions, disease progression, and extreme weather or climate phenomena, are governed by complex, high-dimensional, stochastic dynamics. Identifying an optimal reaction coordinate (RC) that accurately captures the progress of these dynamics is crucial for understanding and simulating such processes. However, determining an optimal RC for realistic systems is notoriously difficult, due to methodological challenges that limit the success of standard machine learning techniques. These challenges include the absence of ground truth, the lack of a loss function for general nonequilibrium dynamics, the difficulty of selecting expressive neural network architectures that avoid overfitting, the irregular and incomplete nature of many real world trajectories, limited sampling and the extreme data imbalance inherent in rare event problems. Here, we introduce a nonparametric RC optimization framework that incorporates trajectory histories and circumvents these challenges, enabling robust analysis of irregular or incomplete data without requiring extensive sampling. The power of the method is demonstrated through increasingly challenging analyses of protein folding dynamics, where it yields accurate committor estimates that pass stringent validation tests and produce high resolution free energy profiles. Its generality is further illustrated through applications to phase space dynamics, a conceptual ocean circulation model, and a longitudinal clinical dataset. These results demonstrate that rare event dynamics can be accurately characterized without extensive sampling of the configuration space, establishing a general, flexible, and robust framework for analyzing complex dynamical systems and longitudinal datasets.

2025-08-10T12:54:41Z expanded the discussion of conceptual and methodological challenges in the Introduction; no changes to results Polina V. Banushkina Sergei V. Krivov http://arxiv.org/abs/2510.08946v2 Physically Valid Biomolecular Interaction Modeling with Gauss-Seidel Projection 2026-03-03T08:43:52Z

Biomolecular interaction modeling has been substantially advanced by foundation models, yet they often produce all-atom structures that violate basic steric feasibility. We address this limitation by enforcing physical validity as a strict constraint during both training and inference with a uniffed module. At its core is a differentiable projection that maps the provisional atom coordinates from the diffusion model to the nearest physically valid conffguration. This projection is achieved using a Gauss-Seidel scheme, which exploits the locality and sparsity of the constraints to ensure stable and fast convergence at scale. By implicit differentiation to obtain gradients, our module integrates seamlessly into existing frameworks for end-to-end ffnetuning. With our Gauss-Seidel projection module in place, two denoising steps are sufffcient to produce biomolecular complexes that are both physically valid and structurally accurate. Across six benchmarks, our 2-step model achieves the same structural accuracy as state-of-the-art 200-step diffusion baselines, delivering approximately 10 times faster wall-clock speed while guaranteeing physical validity. The code is available at https://github.com/chensiyuan030105/ProteinGS.git.

2025-10-10T02:52:15Z Siyuan Chen Minghao Guo Caoliwen Wang Anka He Chen Yikun Zhang Jingjing Chai Yin Yang Wojciech Matusik Peter Yichen Chen http://arxiv.org/abs/2503.20513v2 A Principal Submanifold-based Approach for Clustering and Multiscale RNA Correction 2026-03-03T06:17:21Z

RNA structure determination is essential for understanding its biological functions. However, the reconstruction process often faces challenges, such as atomic clashes, which can lead to inaccurate models. To address these challenges, we introduce the principal submanifold (PSM) approach for analyzing RNA data on a torus. This method provides an accurate, low-dimensional feature representation, overcoming the limitations of previous torus-based methods. By combining PSM with DBSCAN, we propose a novel clustering technique, the principal submanifold-based DBSCAN (PSM-DBSCAN). Our approach achieves superior clustering accuracy and increased robustness to noise. Additionally, we apply this new method for multiscale corrections, effectively resolving RNA backbone clashes at both microscopic and mesoscopic scales. Extensive simulations and comparative studies highlight the enhanced precision and scalability of our method, demonstrating significant improvements over existing approaches. The proposed methodology offers a robust foundation for correcting complex RNA structures and has broad implications for applications in structural biology and bioinformatics.

2025-03-26T12:55:23Z 30 pages, 15 figures Menghao Wu Zhigang Yao http://arxiv.org/abs/2603.02572v1 Molecular Dynamics Simulations Reveal PolyQ-Length-Dependent Conformational Changes in Huntingtin Exon-1: Implications for Environmental Co-Solvent Modulation of Aggregation-Prone States 2026-03-03T03:45:26Z

Huntington's disease (HD) is caused by CAG-repeat expansion in HTT, which lengthens the polyglutamine (polyQ) tract in huntingtin (HTT) and promotes misfolding and aggregation. While polyQ-length-dependent aggregation is well established, the atomistic conformational dynamics preceding aggregation remain less defined. Here we perform all-atom molecular dynamics simulations of HTT exon-1 constructs containing the N17 domain, polyQ tracts of clinically relevant lengths (Q21, wildtype; Q40, adult onset threshold; Q70, juvenile onset), and the polyproline (polyP) region. Multi-copy simulations (four chains) were run for 100 ns in explicit SPC/E water using the OPLS-AA force field. We quantified radius of gyration (Rg), solvent-accessible surface area (SASA), root-mean-square deviation (RMSD), and intra-protein hydrogen bonds as proxies for conformational expansion and aggregation propensity. PolyQ expansion drove progressive increases in Rg and SASA, consistent with more extended, solvent-exposed ensembles. We further tested organic co-solvents (methanol, hexane, trichloroethylene; 0.5 to 1.0 M), which modulated these landscapes in a solvent-dependent manner. Trichloroethylene induced marked expansion in Q21 and Q40, whereas methanol produced mild compaction in Q21. To our knowledge, this is the first MD study to systematically examine co-solvent effects on HTT exon-1 conformational dynamics. Although limited sampling precludes definitive mechanistic conclusions, the observed trends suggest that hydrophobic co-solvents can bias HTT exon-1 toward more expanded ensembles, motivating computational studies of gene-environment modulation in HD.

2026-03-03T03:45:26Z Jai Geddes-Nelson Xiaochen Liu Ken-Tye Yong http://arxiv.org/abs/2603.01873v1 Bi-TEAM: A Unified Cross-Scale Representation Learning Framework for Chemically Modified Biomolecules 2026-03-02T13:51:28Z

Representation learning for protein biochemical space faces a difficult trade-off: protein language models excel at capturing long-range biological semantics but often miss fine-grained chemical details. Conversely, chemical language models encode atomic information but lack broader sequence context. To address this, we introduce Bi-TEAM (Bi-gated Residual Space Modification), a general framework that injects localized chemical variation into global protein contexts. By ensuring robustness against perturbations such as non-canonical amino acids, post-translational modifications (PTMs), and topological constraints, Bi-TEAM uncovers functional chemical dependencies often missed by evolutionary baselines. Mechanistically, Bi-TEAM maps non-canonical residues to their natural counterparts and injects atomic-level data via a bi-gated residual fusion mechanism. Crucially, this process uses modification-aware prompts to ensure that local structural changes influence global functional representations without requiring alphabet expansion. We evaluated Bi-TEAM on ten datasets spanning chemically modified peptides, PTMs, and natural proteins. The model consistently outperformed state-of-the-art baselines, achieving up to a 66 percent improvement in Matthews correlation coefficient (MCC) on scaffold-similarity splits and a 350 percent increase in hemolysis prediction accuracy. Furthermore, when deployed as an oracle for generative modeling, Bi-TEAM nearly quadrupled the success rate for designing cell-penetrating cyclic peptides. By unifying biological semantics with chemical precision, Bi-TEAM provides a versatile foundation for machine learning driven exploration of peptide and protein biochemical space.

2026-03-02T13:51:28Z 57 pages, 16 figures Chunbin Gu Zijun Gao Mutian He Jingjie Zhang Haipeng Wen Zihao Luo Xiaorui Wang Hanqun Cao Jiajun Bu Chang-Yu Hsieh Pheng Ann Heng http://arxiv.org/abs/2404.00962v2 Distributional Priors Guided Diffusion for Generating 3D Molecules in Low Data Regimes 2026-03-02T10:47:32Z

Can we train a 3D molecule generator using data from dense regions to generate samples in sparse regions? This challenge can be framed as an out-of-distribution (OOD) generation problem. While prior research on OOD generation predominantly targets property shifts, structural shifts -- such as differences in molecular scaffolds or functional groups -- represent an equally critical source of distributional shifts. This work introduces the Geometric OOD Diffusion Model (GODD), a novel diffusion-based framework that enables training on data-abundant molecular distributions while generalizing to data-scarce distributions under distributional structural shifts. Central to our approach is a designated equivariant asymmetric autoencoder to capture distributional structural priors. The asymmetric design allows the model to generalize to unseen structural variations by capturing distributional priors representing distinct distributions. The encoded structural-grained priors guide generation toward sparse regions without requiring explicit training on such data. Evaluated across standard benchmarks encompassing OOD structural shifts (e.g., scaffolds, rings), GODD achieves an improvement of 12.6% in success rate, defined based on molecular validity, uniqueness, and novelty. Furthermore, the framework demonstrates promising performance and generalization on canonical fragment-based drug design tasks, highlighting its utility in learning-based molecular discovery.

2024-04-01T07:12:27Z 24 pages. Accepted by AAAI 2026 Haokai Hong Wanyu Lin Ming Yang Kay Chen Tan http://arxiv.org/abs/2603.01537v1 Pharmacology Knowledge Graphs: Do We Need Chemical Structure for Drug Repurposing? 2026-03-02T07:07:32Z

The contributions of model complexity, data volume, and feature modalities to knowledge graph-based drug repurposing remain poorly quantified under rigorous temporal validation. We constructed a pharmacology knowledge graph from ChEMBL 36 comprising 5,348 entities including 3,127 drugs, 1,156 proteins, and 1,065 indications. A strict temporal split was enforced with training data up to 2022 and testing data from 2023 to 2025, together with biologically verified hard negatives mined from failed assays and clinical trials. We benchmarked five knowledge graph embedding models and a standard graph neural network with 3.44 million parameters that incorporates drug chemical structure using a graph attention encoder and ESM-2 protein embeddings. Scaling experiments ranging from 0.78 to 9.75 million parameters and from 25 to 100 percent of the data, together with feature ablation studies, were used to isolate the contributions of model capacity, graph density, and node feature modalities. Removing the graph attention based drug structure encoder and retaining only topological embeddings combined with ESM-2 protein features improved drug protein PR-AUC from 0.5631 to 0.5785 while reducing VRAM usage from 5.30 GB to 353 MB. Replacing the drug encoder with Morgan fingerprints further degraded performance, indicating that explicit chemical structure representations can be detrimental for predicting pharmacological network interactions. Increasing model size beyond 2.44 million parameters yielded diminishing returns, whereas increasing training data consistently improved performance. External validation confirmed 6 of the top 14 novel predictions as established therapeutic indications. These results show that drug pharmacological behavior can be accurately predicted using target-centric information and drug network topology alone, without requiring explicit chemical structure representations.

2026-03-02T07:07:32Z 34 pages, 5 figures. Under review at Discover Artificial Intelligence Youssef Abo-Dahab Ruby Hernandez Ismael Caleb Arechiga Duran http://arxiv.org/abs/2603.02283v1 Fast and Versatile RNA Design via Motif-level Divide-and-Conquer and Structure-level Rival Search 2026-03-02T03:54:12Z

RNA design aims to identify RNA sequences that fold into a target secondary structure. This task is challenging in terms of computational efficiency. Most existing methods focus on either minimum free energy (MFE)-based or ensemble-based metrics, leaving a gap for a unified approach that performs well across both. We introduce a fast and versatile RNA design algorithm inspired by our previous work on the undesignability of RNA structures and motifs (i.e., sets of contiguous structural loops). Our approach decomposes a target structure into a tree of sub-targets where each leaf node corresponds to a motif and each internal node corresponds to a substructure. We first design partial sequences for each motif, then these partial sequences are selectively and recursively combined via the cube pruning strategy borrowed from computational linguistics, enabling effective optimization of ensemble-based metrics. Finally, a novel whole-structure rival search further refines sequences to suppress misfolded alternatives and enhance MFE-based performance. Our method is highly efficient and also achieves state-of-the-art results on native RNAsolo structures and the Eterna100 benchmark, excelling in both ensemble- and MFE-based metrics. Additionally, it substantially improves the design of long-structure benchmark derived from 16S rRNA, increasing average folding probability from 0.18 to 0.39 with an order-of-magnitude speedup, demonstrating its effectiveness and scalability. Availability: Source code and data are available at: https://github.com/shanry/FastDesign.

2026-03-02T03:54:12Z Tianshuo Zhou David H. Mathews Liang Huang http://arxiv.org/abs/2506.02052v3 General Protein Pretraining or Domain-Specific Designs? Benchmarking Protein Modeling on Realistic Applications 2026-03-02T00:30:52Z

Recently, extensive deep learning architectures and pretraining strategies have been explored to support downstream protein applications. Additionally, domain-specific models incorporating biological knowledge have been developed to enhance performance in specialized tasks. In this work, we introduce $\textbf{Protap}$, a comprehensive benchmark that systematically compares backbone architectures, pretraining strategies, and domain-specific models across diverse and realistic downstream protein applications. Specifically, Protap covers five applications: three general tasks and two novel specialized tasks, i.e., enzyme-catalyzed protein cleavage site prediction and targeted protein degradation, which are industrially relevant yet missing from existing benchmarks. For each application, Protap compares various domain-specific models and general architectures under multiple pretraining settings. Our empirical studies imply that: (i) Though large-scale pretraining encoders achieve great results, they often underperform supervised encoders trained on small downstream training sets. (ii) Incorporating structural information during downstream fine-tuning can match or even outperform protein language models pretrained on large-scale sequence corpora. (iii) Domain-specific biological priors can enhance performance on specialized downstream tasks. Code and datasets are publicly available at https://github.com/Trust-App-AI-Lab/protap.

2025-06-01T08:48:42Z Shuo Yan Yuliang Yan Bin Ma Chenao Li Haochun Tang Jiahua Lu Minhua Lin Yuyuan Feng Enyan Dai http://arxiv.org/abs/2411.02109v3 One protein is all you need 2026-03-01T23:55:04Z

Generalization beyond training data remains a central challenge in machine learning for biology. A common way to enhance generalization is self-supervised pre-training on large datasets. However, aiming to perform well on all possible proteins can limit a model's capacity to excel on any specific one, whereas experimentalists typically need accurate predictions for individual proteins they study, often not covered in training data. To address this limitation, we propose a method that enables self-supervised customization of protein language models to one target protein at a time, on the fly, and without assuming any additional data. We show that our Protein Test-Time Training (ProteinTTT) method consistently enhances generalization across different models, their sizes, and datasets. ProteinTTT improves structure prediction for challenging targets, achieves new state-of-the-art results on protein fitness prediction, and enhances function prediction on two tasks. Through two challenging case studies, we also show that customization via ProteinTTT achieves more accurate antibody-antigen loop modeling and enhances 19% of structures in the Big Fantastic Virus Database, delivering improved predictions where general-purpose AlphaFold2 and ESMFold struggle.

2024-11-04T14:23:59Z Anton Bushuiev Roman Bushuiev Olga Pimenova Nikola Zadorozhny Raman Samusevich Elisabet Manaskova Rachel Seongeun Kim Hannes Stärk Jiri Sedlar Martin Steinegger Tomáš Pluskal Josef Sivic http://arxiv.org/abs/2508.10760v2 FROGENT: An End-to-End Full-process Drug Design Multi-Agent System 2026-03-01T20:34:53Z

Drug discovery is a complex, multi-step pipeline that remains heavily dependent on manual, experience-driven operations; meanwhile, existing customized artificial intelligence tools are fragmented across web applications, desktop software, and code libraries, resulting in incompatible interfaces and inefficient, burdensome workflows. To overcome these challenges, we propose FROGENT, a full-process drug design multi-agent system that leverages the planning, reasoning, and tool-use capabilities of large language models (LLMs) to unify drug discovery within a closed-loop and autonomous framework. FROGENT is a collaborative multi-agent system comprising a central Orchestrate Agent for strategic workflow coordination and three distributed agents, Retrieve, Forge, and Gauge, that employ dynamic biochemical databases, extensible tool libraries, and task-specific computational models via the Model Context Protocol. This architecture enables end-to-end execution of complex drug discovery pipelines, covering target identification, small-molecule generation, peptide optimization, and retrosynthetic planning. Across eight benchmarks spanning core drug discovery tasks, FROGENT consistently outperforms six increasingly advanced ReAct-style agents. Case studies further demonstrate its practicality and generalization across real-world small-molecule and peptide design scenarios. Overall, FROGENT not only achieves substantial gains in efficiency and accuracy, but also demonstrates the potential of LLM-based agentic systems to autonomously orchestrate drug development pipelines, reducing, or even replacing, reliance on manual, experience-driven human intervention.

2025-08-14T15:45:53Z 37 pages, 20 figures Qihua Pan Dong Xu Qianwei Yang Jenna Xinyi Yao Sisi Yuan Zexuan Zhu Jianqiang Li Junkai Ji http://arxiv.org/abs/2603.00614v1 Designing the Haystack: Programmable Chemical Space for Generative Molecular Discovery 2026-02-28T12:16:01Z

Chemical space exploration underlies drug discovery, yet most generative models treat chemical space as a fixed, implicitly learned distribution, focusing on sampling molecules rather than deliberately designing the space itself. We introduce SpaceGFN, a generative framework that elevates chemical space to a programmable computational object: a controllable degree of freedom enabling explicit construction and adaptive traversal of structured molecular universes. SpaceGFN decouples space definition from exploration. Users specify building blocks and reaction rules to construct chemically and synthetically coherent spaces, while a GFlowNet performs efficient, property-biased sampling within them. In Discovery mode, we demonstrate programmable space design through two strategies. A pseudo-natural product space assembles natural product-like architectures. An evolution-inspired (Evo) space recombines endogenous metabolite fragments via enzyme-consistent transformations, introducing an evolutionary prior into chemical generation. This bias yields favorable shifts in predicted metabolic and toxicological profiles while preserving pharmacological diversity, supported by broad docking enrichment across therapeutic targets. In Editing mode, SpaceGFN enables reaction-consistent lead optimization through a curated toolkit of executable synthetic transformations, allowing local, synthesis-aware modification of existing compounds instead of unrestricted graph mutation. Across 96 drug targets, SpaceGFN achieves strong optimization performance while maintaining structural diversity under synthetic constraints. By integrating programmable chemical universe construction with flow-based exploration and reaction-level editing, SpaceGFN establishes a general paradigm for deliberate navigation of therapeutic chemical space.

2026-02-28T12:16:01Z Yuchen Zhu Donghai Zhao Yangyang Zhang Yitong Li Xiaorui Wang Shuwang Li Yue Kong Beichen Zhang Ricki Chen Chang Liu Xingcai Zhang Tingjun Hou Chang-Yu Hsieh http://arxiv.org/abs/2511.13550v2 MDIntrinsicDimension: Dimensionality-Based Analysis of Collective Motions in Macromolecules from Molecular Dynamics Trajectories 2026-02-27T16:36:21Z

Molecular dynamics (MD) simulations provide atomistic insights into the structure, dynamics, and function of biomolecules by generating time-resolved, high-dimensional trajectories. Analyzing such data benefits from estimating the minimal number of variables required to describe the explored conformational manifold, known as the intrinsic dimension (ID). We present MDIntrinsicDimension, an open-source Python package that estimates ID directly from MD trajectories by combining rotation- and translation-invariant molecular projections (e.g., backbone dihedrals and inter-residue distances) with state-of-the-art estimators. The package provides three complementary analysis modes: whole-molecule ID; sliding windows along the sequence; and per-secondary-structure elements. It computes both overall ID (a single summary value) and instantaneous, time-resolved ID that can reveal transitions and heterogeneity over time. We illustrate the approach on fast folding-unfolding trajectories from the DESRES dataset, demonstrating that ID complements conventional geometric descriptors by highlighting spatially localized flexibility and differences across structural segments.

2025-11-17T16:21:05Z Published version Irene Cazzaniga Toni Giorgino 10.1021/acs.jcim.5c02716