https://arxiv.org/api/G2/BVs3aqYYdu2/dp2tBGqDuy0k 2026-03-23T04:53:51Z 6644 345 15 http://arxiv.org/abs/2504.18554v2 XDIP: A Curated X-ray Absorption Spectrum Dataset for Iron-Containing Proteins 2025-09-23T17:23:17Z

Earth-abundant iron is an essential metal in regulating the structure and function of proteins. This study presents the development of a comprehensive X-ray Absorption Spectroscopy (XAS) database focused on iron-containing proteins, addressing a critical gap in available high-quality annotated spectral data for iron-containing proteins. The database integrates detailed XAS spectra with their corresponding local structural data of proteins and enables direct comparison between spectral features and structural motifs. Utilizing a combination of manual curation and semi-automated data extraction techniques, we developed a comprehensive dataset via extensive literature review, ensuring the quality and accuracy of data, which contains 437 protein structures and 1954 XAS spectrums. Our methods included careful documentation and validation processes to ensure accuracy and reproducibility. This dataset not only centralizes information on iron-containing proteins but also supports advanced data-driven discoveries, such as machine learning, to predict and analyze protein structure and functions. This work underscores the potential of integrating detailed spectroscopic data with structural biology to advance the field of biological chemistry and catalysis.

2025-04-14T17:49:28Z Yufeng Wang Peiyao Wang Lu Wei Emerita Mendoza Rengifo Dali Yang Lu Ma Yuewei Lin Qun Liu Haibin Ling http://arxiv.org/abs/2509.11044v2 FragmentGPT: A Unified GPT Model for Fragment Growing, Linking, and Merging in Molecular Design 2025-09-23T16:41:27Z

Fragment-Based Drug Discovery (FBDD) is a popular approach in early drug development, but designing effective linkers to combine disconnected molecular fragments into chemically and pharmacologically viable candidates remains challenging. Further complexity arises when fragments contain structural redundancies, like duplicate rings, which cannot be addressed by simply adding or removing atoms or bonds. To address these challenges in a unified framework, we introduce FragmentGPT, which integrates two core components: (1) a novel chemically-aware, energy-based bond cleavage pre-training strategy that equips the GPT-based model with fragment growing, linking, and merging capabilities, and (2) a novel Reward Ranked Alignment with Expert Exploration (RAE) algorithm that combines expert imitation learning for diversity enhancement, data selection and augmentation for Pareto and composite score optimality, and Supervised Fine-Tuning (SFT) to align the learner policy with multi-objective goals. Conditioned on fragment pairs, FragmentGPT generates linkers that connect diverse molecular subunits while simultaneously optimizing for multiple pharmaceutical goals. It also learns to resolve structural redundancies-such as duplicated fragments-through intelligent merging, enabling the synthesis of optimized molecules. FragmentGPT facilitates controlled, goal-driven molecular assembly. Experiments and ablation studies on real-world cancer datasets demonstrate its ability to generate chemically valid, high-quality molecules tailored for downstream drug discovery tasks.

2025-09-14T02:17:07Z Xuefeng Liu Songhao Jiang Qinan Huang Tinson Xu Ian Foster Mengdi Wang Hening Lin Rick Stevens http://arxiv.org/abs/2412.19191v2 Biology-Instructions: A Dataset and Benchmark for Multi-Omics Sequence Understanding Capability of Large Language Models 2025-09-23T12:55:03Z

Large language models (LLMs) have shown remarkable capabilities in general domains, but their application to multi-omics biology remains underexplored. To address this gap, we introduce Biology-Instructions, the first large-scale instruction-tuning dataset for multi-omics biological sequences, including DNA, RNA, proteins, and multi-molecules. This dataset bridges LLMs and complex biological sequence-related tasks, enhancing their versatility and reasoning while maintaining conversational fluency. We also highlight significant limitations of current state-of-the-art LLMs on multi-omics tasks without specialized training. To overcome this, we propose ChatMultiOmics, a strong baseline with a novel three-stage training pipeline, demonstrating superior biological understanding through Biology-Instructions. Both resources are publicly available, paving the way for better integration of LLMs in multi-omics analysis. The Biology-Instructions is publicly available at: https://github.com/hhnqqq/Biology-Instructions.

2024-12-26T12:12:23Z EMNLP 2025 findings Haonan He Yuchen Ren Yining Tang Ziyang Xu Junxian Li Minghao Yang Di Zhang Dong Yuan Tao Chen Shufei Zhang Yuqiang Li Nanqing Dong Wanli Ouyang Dongzhan Zhou Peng Ye http://arxiv.org/abs/2509.18882v1 A Novel Mathematical Model of Protein Interactions from the Perspective of Electron Delocalization 2025-09-23T10:23:04Z

Proteins are the workhorse molecules of the cell and perform their biological functions by binding to other molecules through physical contact. Protein function is then regulated through coupling of bindings on the protein (allosteric regulation). Just as the genetic code provides the blueprint for protein synthesis, the coupling is thought to provide the basis for protein communication and interaction. However, it is not yet fully understood how binding of a molecule at one site affects binding of another molecule at another distal site on a protein, even more than $60$ years after its discovery in 1961. In this paper, I propose a simple mathematical model of protein interactions, using a ``quantized'' version of differential geometry, i.e., the discrete differential geometry of $n$-simplices. The model is based on the concept of electron delocalization, one of the main features of quantum chemistry, Allosteric regulation then follows tautologically from the definition of interactions. No prior knowledge of conventional discrete differential geometry, protein science, or quantum chemistry is required. I hope this paper will provide a starting point for many mathematicians to study chemistry and molecular biology.

2025-09-23T10:23:04Z 36 pages, 12 figures Naoto Morikawa http://arxiv.org/abs/2509.17448v1 Monitoring Nitric Oxide in Trigeminal Neuralgia Rats with a Cerium Single-Atom Nanozyme Electrochemical Biosensor 2025-09-22T07:40:03Z

Trigeminal neuralgia (TN) is the most common neuropathic disorder; however, its pathogenesis remains unclear. A prevailing theory suggests that nitric oxide (NO) may induce nerve compression and irritation via vascular dilation, thereby being responsible for the condition, making real-time detection of generated NO critical. However, traditional evaluations of NO rely on indirect colorimetric or chemiluminescence techniques, which offer limited sensitivity and spatial resolution for its real-time assessment in biological environments. Herein, we reported the development of a highly sensitive NO electrochemical biosensor based cerium single-atom nanozyme (Ce1-CN) with ultrawide linear range from 1.08 nM to 143.9 μM, and ultralow detection limit of 0.36 nM, which enables efficient and real-time evaluation of NO in TN rats. In-situ attenuated total reflection surface-enhanced infrared spectroscopy combined with density functional theory calculations revealed the high-performance biosensing mechanism, whereby the Ce centers in Ce1-CN nanoenzymes adsorb NO and subsequently react with OH- to form *HNO2. Results demonstrated that NO concentration was associated with TN onset. Following carbamazepine treatment, NO production from nerves decreased, accompanied by an alleviation of pain. These findings indicate that the biosensor serves as a valuable tool for investigating the pathogenesis of TN and guiding subsequent therapeutic strategies.

2025-09-22T07:40:03Z Kangling Tian Fuhua Li Ran Chen Shihong Chen Wenbin Wei Yihang Shen Muzi Xu Chunxian Guo Luigi G. Occhipinti Hong Bin Yang Fangxin Hu http://arxiv.org/abs/2509.17224v1 AI-based Methods for Simulating, Sampling, and Predicting Protein Ensembles 2025-09-21T20:14:45Z

Advances in deep learning have opened an era of abundant and accurate predicted protein structures; however, similar progress in protein ensembles has remained elusive. This review highlights several recent research directions towards AI-based predictions of protein ensembles, including coarse-grained force fields, generative models, multiple sequence alignment perturbation methods, and modeling of ensemble descriptors. An emphasis is placed on realistic assessments of the technological maturity of current methods, the strengths and weaknesses of broad families of techniques, and promising machine learning frameworks at an early stage of development. We advocate for "closing the loop" between model training, simulation, and inference to overcome challenges in training data availability and to enable the next generation of models.

2025-09-21T20:14:45Z Bowen Jing Bonnie Berger Tommi Jaakkola http://arxiv.org/abs/2509.16877v1 A review of topological data analysis and topological deep learning in molecular sciences 2025-09-21T02:05:06Z

Topological Data Analysis (TDA) has emerged as a powerful framework for extracting robust, multiscale, and interpretable features from complex molecular data for artificial intelligence (AI) modeling and topological deep learning (TDL). This review provides a comprehensive overview of the development, methodologies, and applications of TDA in molecular sciences. We trace the evolution of TDA from early qualitative tools to advanced quantitative and predictive models, highlighting innovations such as persistent homology, persistent Laplacians, and topological machine learning. The paper explores TDA's transformative impact across diverse domains, including biomolecular stability, protein-ligand interactions, drug discovery, materials science, and viral evolution. Special attention is given to recent advances in integrating TDA with machine learning and AI, enabling breakthroughs in protein engineering, solubility and toxicity prediction, and the discovery of novel materials and therapeutics. We also discuss the limitations of current TDA approaches and outline future directions, including the integration of TDA with advanced AI models and the development of new topological invariants. This review aims to serve as a foundational reference for researchers seeking to harness the power of topology in molecular science.

2025-09-21T02:05:06Z JunJie Wee Jian Jiang http://arxiv.org/abs/2509.21346v1 Spiking Neural Networks for Mental Workload Classification with a Multimodal Approach 2025-09-17T15:26:42Z

Accurately assessing mental workload is crucial in cognitive neuroscience, human-computer interaction, and real-time monitoring, as cognitive load fluctuations affect performance and decision-making. While Electroencephalography (EEG) based machine learning (ML) models can be used to this end, their high computational cost hinders embedded real-time applications. Hardware implementations of spiking neural networks (SNNs) offer a promising alternative for low-power, fast, event-driven processing. This study compares hardware compatible SNN models with various traditional ML ones, using an open-source multimodal dataset. Our results show that multimodal integration improves accuracy, with SNN performance comparable to the ML one, demonstrating their potential for real-time implementations of cognitive load detection. These findings position event-based processing as a promising solution for low-latency, energy efficient workload monitoring in adaptive closed-loop embedded devices that dynamically regulate cognitive load.

2025-09-17T15:26:42Z 8 pages Jiahui An Sara Irina Fabrikant Giacomo Indiveri Elisa Donati http://arxiv.org/abs/2509.14029v1 Deep Learning-Driven Peptide Classification in Biological Nanopores 2025-09-17T14:30:55Z

A device capable of performing real time classification of proteins in a clinical setting would allow for inexpensive and rapid disease diagnosis. One such candidate for this technology are nanopore devices. These devices work by measuring a current signal that arises when a protein or peptide enters a nanometer-length-scale pore. Should this current be uniquely related to the structure of the peptide and its interactions with the pore, the signals can be used to perform identification. While such a method would allow for real time identification of peptides and proteins in a clinical setting, to date, the complexities of these signals limit their accuracy. In this work, we tackle the issue of classification by converting the current signals into scaleogram images via wavelet transforms, capturing amplitude, frequency, and time information in a modality well-suited to machine learning algorithms. When tested on 42 peptides, our method achieved a classification accuracy of ~$81\,\%$, setting a new state-of-the-art in the field and taking a step toward practical peptide/protein diagnostics at the point of care. In addition, we demonstrate model transfer techniques that will be critical when deploying these models into real hardware, paving the way to a new method for real-time disease diagnosis.

2025-09-17T14:30:55Z 29 pages (incl. references) 7 figures Samuel Tovey Julian Hoßbach Sandro Kuppel Tobias Ensslen Jan C. Behrends Christian Holm http://arxiv.org/abs/2509.18153v1 A deep reinforcement learning platform for antibiotic discovery 2025-09-16T18:21:42Z

Antimicrobial resistance (AMR) is projected to cause up to 10 million deaths annually by 2050, underscoring the urgent need for new antibiotics. Here we present ApexAmphion, a deep-learning framework for de novo design of antibiotics that couples a 6.4-billion-parameter protein language model with reinforcement learning. The model is first fine-tuned on curated peptide data to capture antimicrobial sequence regularities, then optimised with proximal policy optimization against a composite reward that combines predictions from a learned minimum inhibitory concentration (MIC) classifier with differentiable physicochemical objectives. In vitro evaluation of 100 designed peptides showed low MIC values (nanomolar range in some cases) for all candidates (100% hit rate). Moreover, 99 our of 100 compounds exhibited broad-spectrum antimicrobial activity against at least two clinically relevant bacteria. The lead molecules killed bacteria primarily by potently targeting the cytoplasmic membrane. By unifying generation, scoring and multi-objective optimization with deep reinforcement learning in a single pipeline, our approach rapidly produces diverse, potent candidates, offering a scalable route to peptide antibiotics and a platform for iterative steering toward potency and developability within hours.

2025-09-16T18:21:42Z 42 pages, 16 figures Hanqun Cao Marcelo D. T. Torres Jingjie Zhang Zijun Gao Fang Wu Chunbin Gu Jure Leskovec Yejin Choi Cesar de la Fuente-Nunez Guangyong Chen Pheng-Ann Heng http://arxiv.org/abs/2509.13294v1 Accelerating Protein Molecular Dynamics Simulation with DeepJump 2025-09-16T17:48:58Z

Unraveling the dynamical motions of biomolecules is essential for bridging their structure and function, yet it remains a major computational challenge. Molecular dynamics (MD) simulation provides a detailed depiction of biomolecular motion, but its high-resolution temporal evolution comes at significant computational cost, limiting its applicability to timescales of biological relevance. Deep learning approaches have emerged as promising solutions to overcome these computational limitations by learning to predict long-timescale dynamics. However, generalizable kinetics models for proteins remain largely unexplored, and the fundamental limits of achievable acceleration while preserving dynamical accuracy are poorly understood. In this work, we fill this gap with DeepJump, an Euclidean-Equivariant Flow Matching-based model for predicting protein conformational dynamics across multiple temporal scales. We train DeepJump on trajectories of the diverse proteins of mdCATH, systematically studying our model's performance in generalizing to long-term dynamics of fast-folding proteins and characterizing the trade-off between computational acceleration and prediction accuracy. We demonstrate the application of DeepJump to ab initio folding, showcasing prediction of folding pathways and native states. Our results demonstrate that DeepJump achieves significant $\approx$1000$\times$ computational acceleration while effectively recovering long-timescale dynamics, providing a stepping stone for enabling routine simulation of proteins.

2025-09-16T17:48:58Z Allan dos Santos Costa Manvitha Ponnapati Dana Rubin Tess Smidt Joseph Jacobson http://arxiv.org/abs/2509.13216v1 Flow-Based Fragment Identification via Binding Site-Specific Latent Representations 2025-09-16T16:20:45Z

Fragment-based drug design is a promising strategy leveraging the binding of small chemical moieties that can efficiently guide drug discovery. The initial step of fragment identification remains challenging, as fragments often bind weakly and non-specifically. We developed a protein-fragment encoder that relies on a contrastive learning approach to map both molecular fragments and protein surfaces in a shared latent space. The encoder captures interaction-relevant features and allows to perform virtual screening as well as generative design with our new method LatentFrag. In LatentFrag, fragment embeddings and positions are generated conditioned on the protein surface while being chemically realistic by construction. Our expressive fragment and protein representations allow location of protein-fragment interaction sites with high sensitivity and we observe state-of-the-art fragment recovery rates when sampling from the learned distribution of latent fragment embeddings. Our generative method outperforms common methods such as virtual screening at a fraction of its computational cost providing a valuable starting point for fragment hit discovery. We further show the practical utility of LatentFrag and extend the workflow to full ligand design tasks. Together, these approaches contribute to advancing fragment identification and provide valuable tools for fragment-based drug discovery.

2025-09-16T16:20:45Z Rebecca Manuela Neeser Ilia Igashov Arne Schneuing Michael Bronstein Philippe Schwaller Bruno Correia http://arxiv.org/abs/2509.12976v1 SHREC 2025: Protein surface shape retrieval including electrostatic potential 2025-09-16T11:35:31Z

This SHREC 2025 track dedicated to protein surface shape retrieval involved 9 participating teams. We evaluated the performance in retrieval of 15 proposed methods on a large dataset of 11,555 protein surfaces with calculated electrostatic potential (a key molecular surface descriptor). The performance in retrieval of the proposed methods was evaluated through different metrics (Accuracy, Balanced accuracy, F1 score, Precision and Recall). The best retrieval performance was achieved by the proposed methods that used the electrostatic potential complementary to molecular surface shape. This observation was also valid for classes with limited data which highlights the importance of taking into account additional molecular surface descriptors.

2025-09-16T11:35:31Z Published in Computers & Graphics, Elsevier. 59 pages, 12 figures Computers & Graphics Volume 132, November 2025, Article 104394 Taher Yacoub Camille Depenveiller Atsushi Tatsuma Tin Barisin Eugen Rusakov Udo Gobel Yuxu Peng Shiqiang Deng Yuki Kagaya Joon Hong Park Daisuke Kihara Marco Guerra Giorgio Palmieri Andrea Ranieri Ulderico Fugacci Silvia Biasotti Ruiwen He Halim Benhabiles Adnane Cabani Karim Hammoudi Haotian Li Hao Huang Chunyan Li Alireza Tehrani Fanwang Meng Farnaz Heidar-Zadeh Tuan-Anh Yang Matthieu Montes 10.1016/j.cag.2025.104394 http://arxiv.org/abs/2509.12915v1 Synthetic Protein-Ligand Complex Generation for Deep Molecular Docking 2025-09-16T10:09:09Z

The scarcity of experimental protein-ligand complexes poses a significant challenge for training robust deep learning models for molecular docking. Given the prohibitive cost and time constraints associated with experimental structure determination, scalable generation of realistic protein-ligand complexes is needed to expand available datasets for model development. In this study, we introduce a novel workflow for the procedural generation and validation of synthetic protein-ligand complexes, combining a diverse ensemble of generation techniques and rigorous quality control. We assessed the utility of these synthetic datasets by retraining established docking models, Smina and Gnina, and evaluating their performance on standard benchmarks including the PDBBind core set and the PoseBusters dataset. Our results demonstrate that models trained on synthetic data achieve performance comparable to models trained on experimental data, indicating that current synthetic complexes can effectively capture many salient features of protein-ligand interactions. However, we did not observe significant improvements in docking or scoring accuracy over conventional methods or experimental data augmentation. These findings highlight the promise as well as the current limitations of synthetic data for deep learning-based molecular docking and underscore the need for further refinement in generation methodologies and evaluation strategies to fully exploit the potential of synthetic datasets for this application.

2025-09-16T10:09:09Z Sofiene Khiari Matthew R. Masters Amr H. Mahmoud Markus A. Lill http://arxiv.org/abs/2509.12460v1 Computational design of intrinsically disordered proteins 2025-09-15T21:14:13Z

Protein design has the potential to revolutionize biotechnology and medicine. While most efforts have focused on proteins with well-defined structures, increased recognition of the functional significance of intrinsically disordered regions, together with improvements in their modeling, has paved the way to their computational de novo design. This review summarizes recent advances in engineering intrinsically disordered regions with tailored conformational ensembles, molecular recognition, and phase behavior. We discuss challenges in combining models with predictive accuracy with scalable design workflows and outline emerging strategies that integrate knowledge-based, physics-based, and machine-learning approaches.

2025-09-15T21:14:13Z 10 pages, 3 figures Giulio Tesei Francesco Pesce Kresten Lindorff-Larsen