https://arxiv.org/api/qsLE08Le8iEgUMtokZsM8ZO5uRA 2026-03-25T08:38:00Z 6650 360 15 http://arxiv.org/abs/2509.16877v1 A review of topological data analysis and topological deep learning in molecular sciences 2025-09-21T02:05:06Z

Topological Data Analysis (TDA) has emerged as a powerful framework for extracting robust, multiscale, and interpretable features from complex molecular data for artificial intelligence (AI) modeling and topological deep learning (TDL). This review provides a comprehensive overview of the development, methodologies, and applications of TDA in molecular sciences. We trace the evolution of TDA from early qualitative tools to advanced quantitative and predictive models, highlighting innovations such as persistent homology, persistent Laplacians, and topological machine learning. The paper explores TDA's transformative impact across diverse domains, including biomolecular stability, protein-ligand interactions, drug discovery, materials science, and viral evolution. Special attention is given to recent advances in integrating TDA with machine learning and AI, enabling breakthroughs in protein engineering, solubility and toxicity prediction, and the discovery of novel materials and therapeutics. We also discuss the limitations of current TDA approaches and outline future directions, including the integration of TDA with advanced AI models and the development of new topological invariants. This review aims to serve as a foundational reference for researchers seeking to harness the power of topology in molecular science.

2025-09-21T02:05:06Z JunJie Wee Jian Jiang http://arxiv.org/abs/2509.21346v1 Spiking Neural Networks for Mental Workload Classification with a Multimodal Approach 2025-09-17T15:26:42Z

Accurately assessing mental workload is crucial in cognitive neuroscience, human-computer interaction, and real-time monitoring, as cognitive load fluctuations affect performance and decision-making. While Electroencephalography (EEG) based machine learning (ML) models can be used to this end, their high computational cost hinders embedded real-time applications. Hardware implementations of spiking neural networks (SNNs) offer a promising alternative for low-power, fast, event-driven processing. This study compares hardware compatible SNN models with various traditional ML ones, using an open-source multimodal dataset. Our results show that multimodal integration improves accuracy, with SNN performance comparable to the ML one, demonstrating their potential for real-time implementations of cognitive load detection. These findings position event-based processing as a promising solution for low-latency, energy efficient workload monitoring in adaptive closed-loop embedded devices that dynamically regulate cognitive load.

2025-09-17T15:26:42Z 8 pages Jiahui An Sara Irina Fabrikant Giacomo Indiveri Elisa Donati http://arxiv.org/abs/2509.14029v1 Deep Learning-Driven Peptide Classification in Biological Nanopores 2025-09-17T14:30:55Z

A device capable of performing real time classification of proteins in a clinical setting would allow for inexpensive and rapid disease diagnosis. One such candidate for this technology are nanopore devices. These devices work by measuring a current signal that arises when a protein or peptide enters a nanometer-length-scale pore. Should this current be uniquely related to the structure of the peptide and its interactions with the pore, the signals can be used to perform identification. While such a method would allow for real time identification of peptides and proteins in a clinical setting, to date, the complexities of these signals limit their accuracy. In this work, we tackle the issue of classification by converting the current signals into scaleogram images via wavelet transforms, capturing amplitude, frequency, and time information in a modality well-suited to machine learning algorithms. When tested on 42 peptides, our method achieved a classification accuracy of ~$81\,\%$, setting a new state-of-the-art in the field and taking a step toward practical peptide/protein diagnostics at the point of care. In addition, we demonstrate model transfer techniques that will be critical when deploying these models into real hardware, paving the way to a new method for real-time disease diagnosis.

2025-09-17T14:30:55Z 29 pages (incl. references) 7 figures Samuel Tovey Julian Hoßbach Sandro Kuppel Tobias Ensslen Jan C. Behrends Christian Holm http://arxiv.org/abs/2509.18153v1 A deep reinforcement learning platform for antibiotic discovery 2025-09-16T18:21:42Z

Antimicrobial resistance (AMR) is projected to cause up to 10 million deaths annually by 2050, underscoring the urgent need for new antibiotics. Here we present ApexAmphion, a deep-learning framework for de novo design of antibiotics that couples a 6.4-billion-parameter protein language model with reinforcement learning. The model is first fine-tuned on curated peptide data to capture antimicrobial sequence regularities, then optimised with proximal policy optimization against a composite reward that combines predictions from a learned minimum inhibitory concentration (MIC) classifier with differentiable physicochemical objectives. In vitro evaluation of 100 designed peptides showed low MIC values (nanomolar range in some cases) for all candidates (100% hit rate). Moreover, 99 our of 100 compounds exhibited broad-spectrum antimicrobial activity against at least two clinically relevant bacteria. The lead molecules killed bacteria primarily by potently targeting the cytoplasmic membrane. By unifying generation, scoring and multi-objective optimization with deep reinforcement learning in a single pipeline, our approach rapidly produces diverse, potent candidates, offering a scalable route to peptide antibiotics and a platform for iterative steering toward potency and developability within hours.

2025-09-16T18:21:42Z 42 pages, 16 figures Hanqun Cao Marcelo D. T. Torres Jingjie Zhang Zijun Gao Fang Wu Chunbin Gu Jure Leskovec Yejin Choi Cesar de la Fuente-Nunez Guangyong Chen Pheng-Ann Heng http://arxiv.org/abs/2509.13294v1 Accelerating Protein Molecular Dynamics Simulation with DeepJump 2025-09-16T17:48:58Z

Unraveling the dynamical motions of biomolecules is essential for bridging their structure and function, yet it remains a major computational challenge. Molecular dynamics (MD) simulation provides a detailed depiction of biomolecular motion, but its high-resolution temporal evolution comes at significant computational cost, limiting its applicability to timescales of biological relevance. Deep learning approaches have emerged as promising solutions to overcome these computational limitations by learning to predict long-timescale dynamics. However, generalizable kinetics models for proteins remain largely unexplored, and the fundamental limits of achievable acceleration while preserving dynamical accuracy are poorly understood. In this work, we fill this gap with DeepJump, an Euclidean-Equivariant Flow Matching-based model for predicting protein conformational dynamics across multiple temporal scales. We train DeepJump on trajectories of the diverse proteins of mdCATH, systematically studying our model's performance in generalizing to long-term dynamics of fast-folding proteins and characterizing the trade-off between computational acceleration and prediction accuracy. We demonstrate the application of DeepJump to ab initio folding, showcasing prediction of folding pathways and native states. Our results demonstrate that DeepJump achieves significant $\approx$1000$\times$ computational acceleration while effectively recovering long-timescale dynamics, providing a stepping stone for enabling routine simulation of proteins.

2025-09-16T17:48:58Z Allan dos Santos Costa Manvitha Ponnapati Dana Rubin Tess Smidt Joseph Jacobson http://arxiv.org/abs/2509.13216v1 Flow-Based Fragment Identification via Binding Site-Specific Latent Representations 2025-09-16T16:20:45Z

Fragment-based drug design is a promising strategy leveraging the binding of small chemical moieties that can efficiently guide drug discovery. The initial step of fragment identification remains challenging, as fragments often bind weakly and non-specifically. We developed a protein-fragment encoder that relies on a contrastive learning approach to map both molecular fragments and protein surfaces in a shared latent space. The encoder captures interaction-relevant features and allows to perform virtual screening as well as generative design with our new method LatentFrag. In LatentFrag, fragment embeddings and positions are generated conditioned on the protein surface while being chemically realistic by construction. Our expressive fragment and protein representations allow location of protein-fragment interaction sites with high sensitivity and we observe state-of-the-art fragment recovery rates when sampling from the learned distribution of latent fragment embeddings. Our generative method outperforms common methods such as virtual screening at a fraction of its computational cost providing a valuable starting point for fragment hit discovery. We further show the practical utility of LatentFrag and extend the workflow to full ligand design tasks. Together, these approaches contribute to advancing fragment identification and provide valuable tools for fragment-based drug discovery.

2025-09-16T16:20:45Z Rebecca Manuela Neeser Ilia Igashov Arne Schneuing Michael Bronstein Philippe Schwaller Bruno Correia http://arxiv.org/abs/2509.12976v1 SHREC 2025: Protein surface shape retrieval including electrostatic potential 2025-09-16T11:35:31Z

This SHREC 2025 track dedicated to protein surface shape retrieval involved 9 participating teams. We evaluated the performance in retrieval of 15 proposed methods on a large dataset of 11,555 protein surfaces with calculated electrostatic potential (a key molecular surface descriptor). The performance in retrieval of the proposed methods was evaluated through different metrics (Accuracy, Balanced accuracy, F1 score, Precision and Recall). The best retrieval performance was achieved by the proposed methods that used the electrostatic potential complementary to molecular surface shape. This observation was also valid for classes with limited data which highlights the importance of taking into account additional molecular surface descriptors.

2025-09-16T11:35:31Z Published in Computers & Graphics, Elsevier. 59 pages, 12 figures Computers & Graphics Volume 132, November 2025, Article 104394 Taher Yacoub Camille Depenveiller Atsushi Tatsuma Tin Barisin Eugen Rusakov Udo Gobel Yuxu Peng Shiqiang Deng Yuki Kagaya Joon Hong Park Daisuke Kihara Marco Guerra Giorgio Palmieri Andrea Ranieri Ulderico Fugacci Silvia Biasotti Ruiwen He Halim Benhabiles Adnane Cabani Karim Hammoudi Haotian Li Hao Huang Chunyan Li Alireza Tehrani Fanwang Meng Farnaz Heidar-Zadeh Tuan-Anh Yang Matthieu Montes 10.1016/j.cag.2025.104394 http://arxiv.org/abs/2509.12915v1 Synthetic Protein-Ligand Complex Generation for Deep Molecular Docking 2025-09-16T10:09:09Z

The scarcity of experimental protein-ligand complexes poses a significant challenge for training robust deep learning models for molecular docking. Given the prohibitive cost and time constraints associated with experimental structure determination, scalable generation of realistic protein-ligand complexes is needed to expand available datasets for model development. In this study, we introduce a novel workflow for the procedural generation and validation of synthetic protein-ligand complexes, combining a diverse ensemble of generation techniques and rigorous quality control. We assessed the utility of these synthetic datasets by retraining established docking models, Smina and Gnina, and evaluating their performance on standard benchmarks including the PDBBind core set and the PoseBusters dataset. Our results demonstrate that models trained on synthetic data achieve performance comparable to models trained on experimental data, indicating that current synthetic complexes can effectively capture many salient features of protein-ligand interactions. However, we did not observe significant improvements in docking or scoring accuracy over conventional methods or experimental data augmentation. These findings highlight the promise as well as the current limitations of synthetic data for deep learning-based molecular docking and underscore the need for further refinement in generation methodologies and evaluation strategies to fully exploit the potential of synthetic datasets for this application.

2025-09-16T10:09:09Z Sofiene Khiari Matthew R. Masters Amr H. Mahmoud Markus A. Lill http://arxiv.org/abs/2509.12460v1 Computational design of intrinsically disordered proteins 2025-09-15T21:14:13Z

Protein design has the potential to revolutionize biotechnology and medicine. While most efforts have focused on proteins with well-defined structures, increased recognition of the functional significance of intrinsically disordered regions, together with improvements in their modeling, has paved the way to their computational de novo design. This review summarizes recent advances in engineering intrinsically disordered regions with tailored conformational ensembles, molecular recognition, and phase behavior. We discuss challenges in combining models with predictive accuracy with scalable design workflows and outline emerging strategies that integrate knowledge-based, physics-based, and machine-learning approaches.

2025-09-15T21:14:13Z 10 pages, 3 figures Giulio Tesei Francesco Pesce Kresten Lindorff-Larsen http://arxiv.org/abs/2411.05795v2 Polymerization and replication of primordial RNA induced by clay-water interface dynamics 2025-09-15T16:46:10Z

In the study of life's origins, a key challenge is understanding how RNA could have polymerized and subsequently replicated in early Earth. We present a theoretical and computational framework to model the non-enzymatic polymerization of ribonucleotides and the template-dependent replication of primordial RNA molecules, at the interfaces between the aqueous solution and a clay mineral. Our results demonstrate that systematic polymerization and replication of single-stranded RNA polymers, sufficiently long to fold and acquire basic functions ($>$15 nt), were feasible under these conditions. Crucially, this process required a physico-chemical environment characterized by large-amplitude oscillations with periodicity compatible with spring tide dynamics, suggesting that large moons may have played a role in the emergence of RNA-based life on planetary bodies. Interestingly, the theoretical analysis presents rigorous evidence that RNA replication efficiency increases in oscillating environments compared to constant ones. Moreover, the versatility of our framework enables comparisons between different genetic alphabets, showing that a four-letter alphabet -- particularly when allowing non-canonical base pairs, as in current RNA -- represents an optimal balance of replication speed and sequence diversity in the pathway to life.

2024-10-25T08:12:05Z This file contains the main manuscript (6 figures) and the Supplementary Information (4 sections and 11 Supplementary figures) Communication Chemistry 8, 236 (2025) Carla Alejandre Adrián Aguirre-Tamaral Carlos Briones Jacobo Aguirre 10.1038/s42004-025-01632-w http://arxiv.org/abs/2509.13476v1 A Geometric Graph-Based Deep Learning Model for Drug-Target Affinity Prediction 2025-09-15T14:06:39Z

In structure-based drug design, accurately estimating the binding affinity between a candidate ligand and its protein receptor is a central challenge. Recent advances in artificial intelligence, particularly deep learning, have demonstrated superior performance over traditional empirical and physics-based methods for this task, enabled by the growing availability of structural and experimental affinity data. In this work, we introduce DeepGGL, a deep convolutional neural network that integrates residual connections and an attention mechanism within a geometric graph learning framework. By leveraging multiscale weighted colored bipartite subgraphs, DeepGGL effectively captures fine-grained atom-level interactions in protein-ligand complexes across multiple scales. We benchmarked DeepGGL against established models on CASF-2013 and CASF-2016, where it achieved state-of-the-art performance with significant improvements across diverse evaluation metrics. To further assess robustness and generalization, we tested the model on the CSAR-NRC-HiQ dataset and the PDBbind v2019 holdout set. DeepGGL consistently maintained high predictive accuracy, highlighting its adaptability and reliability for binding affinity prediction in structure-based drug discovery.

2025-09-15T14:06:39Z Md Masud Rana Farjana Tasnim Mukta Duc D. Nguyen http://arxiv.org/abs/2509.11782v1 Multimodal Regression for Enzyme Turnover Rates Prediction 2025-09-15T11:07:26Z

The enzyme turnover rate is a fundamental parameter in enzyme kinetics, reflecting the catalytic efficiency of enzymes. However, enzyme turnover rates remain scarce across most organisms due to the high cost and complexity of experimental measurements. To address this gap, we propose a multimodal framework for predicting the enzyme turnover rate by integrating enzyme sequences, substrate structures, and environmental factors. Our model combines a pre-trained language model and a convolutional neural network to extract features from protein sequences, while a graph neural network captures informative representations from substrate molecules. An attention mechanism is incorporated to enhance interactions between enzyme and substrate representations. Furthermore, we leverage symbolic regression via Kolmogorov-Arnold Networks to explicitly learn mathematical formulas that govern the enzyme turnover rate, enabling interpretable and accurate predictions. Extensive experiments demonstrate that our framework outperforms both traditional and state-of-the-art deep learning approaches. This work provides a robust tool for studying enzyme kinetics and holds promise for applications in enzyme engineering, biotechnology, and industrial biocatalysis.

2025-09-15T11:07:26Z 9 pages, 5 figures. This paper was withdrawn from the IJCAI 2025 proceedings due to the lack of participation in the conference and presentation Bozhen Hu Cheng Tan Siyuan Li Jiangbin Zheng Sizhe Qiu Jun Xia Stan Z. Li http://arxiv.org/abs/2402.16445v3 ProLLaMA: A Protein Large Language Model for Multi-Task Protein Language Processing 2025-09-14T04:03:16Z

Recent advances in Protein Language Models (PLMs) have transformed protein engineering, yet unlike their counterparts in Natural Language Processing (NLP), current PLMs exhibit a fundamental limitation: they excel in either Protein Language Understanding (PLU) or Protein Language Generation (PLG), but rarely both. This fragmentation hinders progress in protein engineering. To bridge this gap, we introduce ProLLaMA, a multitask protein language model enhanced by the Evolutionary Protein Generation Framework (EPGF). We construct a comprehensive instruction dataset containing approximately 13 million samples with over 11,000 superfamily annotations to facilitate better modeling of sequence-function landscapes. We leverage a two-stage training approach to develop ProLLaMA, a multitask LLM with protein domain expertise. Our EPGF addresses the mismatch between statistic language modeling and biological constraints through three innovations: a multi-dimensional interpretable scorer, hierarchical efficient decoding, and a probabilistic-biophysical joint selection mechanism. Extensive experiments demonstrate that ProLLaMA excels in both unconditional and controllable protein generation tasks, achieving superior structural quality metrics compared to existing PLMs. Additionally, ProLLaMA demonstrates strong understanding capabilities with a 67.1% exact match rate in superfamily prediction. EPGF significantly enhances the biological viability of generated sequences, as evidenced by improved biophysical scores (+4.3%) and structural metrics (+14.5%). The project is available at https://github.com/PKU-YuanGroup/ProLLaMA.

2024-02-26T09:43:52Z IEEE Transactions on Artificial Intelligence Liuzhenghao Lv Zongying Lin Hao Li Yuyang Liu Jiaxi Cui Calvin Yu-Chian Chen Li Yuan Yonghong Tian http://arxiv.org/abs/2509.11046v1 Hybrid Quantum Neural Networks for Efficient Protein-Ligand Binding Affinity Prediction 2025-09-14T02:20:21Z

Protein-ligand binding affinity is critical in drug discovery, but experimentally determining it is time-consuming and expensive. Artificial intelligence (AI) has been used to predict binding affinity, significantly accelerating this process. However, the high-performance requirements and vast datasets involved in affinity prediction demand increasingly large AI models, requiring substantial computational resources and training time. Quantum machine learning has emerged as a promising solution to these challenges. In particular, hybrid quantum-classical models can reduce the number of parameters while maintaining or improving performance compared to classical counterparts. Despite these advantages, challenges persist: why hybrid quantum models achieve these benefits, whether quantum neural networks (QNNs) can replace classical neural networks, and whether such models are feasible on noisy intermediate-scale quantum (NISQ) devices. This study addresses these challenges by proposing a hybrid quantum neural network (HQNN) that empirically demonstrates the capability to approximate non-linear functions in the latent feature space derived from classical embedding. The primary goal of this study is to achieve a parameter-efficient model in binding affinity prediction while ensuring feasibility on NISQ devices. Numerical results indicate that HQNN achieves comparable or superior performance and parameter efficiency compared to classical neural networks, underscoring its potential as a viable replacement. This study highlights the potential of hybrid QML in computational drug discovery, offering insights into its applicability and advantages in addressing the computational challenges of protein-ligand binding affinity prediction.

2025-09-14T02:20:21Z 43 pages, 9 figures, and 12 tables. Accepted by EPJ Quantum Technology Seon-Geun Jeong Kyeong-Hwan Moon Won-Joo Hwang http://arxiv.org/abs/2509.10871v1 Optimal message passing for molecular prediction is simple, attentive and spatial 2025-09-13T15:55:02Z

Strategies to improve the predicting performance of Message-Passing Neural-Networks for molecular property predictions can be achieved by simplifying how the message is passed and by using descriptors that capture multiple aspects of molecular graphs. In this work, we designed model architectures that achieved state-of-the-art performance, surpassing more complex models such as those pre-trained on external databases. We assessed dataset diversity to complement our performance results, finding that structural diversity influences the need for additional components in our MPNNs and feature sets. In most datasets, our best architecture employs bidirectional message-passing with an attention mechanism, applied to a minimalist message formulation that excludes self-perception, highlighting that relatively simpler models, compared to classical MPNNs, yield higher class separability. In contrast, we found that convolution normalization factors do not benefit the predictive power in all the datasets tested. This was corroborated in both global and node-level outputs. Additionally, we analyzed the influence of both adding spatial features and working with 3D graphs, finding that 2D molecular graphs are sufficient when complemented with appropriately chosen 3D descriptors. This approach not only preserves predictive performance but also reduces computational cost by over 50%, making it particularly advantageous for high-throughput screening campaigns.

2025-09-13T15:55:02Z 32 pages, 12 figures. Preprint submitted to RSC Drug Discovery Digital Discovery, 2025, 4 Alma C. Castaneda-Leautaud Rommie E. Amaro 10.1039/D5DD00193E