https://arxiv.org/api/MK/IvTPAumHBnazQhlMnEcgIKJc 2026-03-22T17:37:48Z 6642 225 15 http://arxiv.org/abs/2506.03237v3 UniSite: The First Cross-Structure Dataset and Learning Framework for End-to-End Ligand Binding Site Detection 2025-11-11T18:16:03Z

The detection of ligand binding sites for proteins is a fundamental step in Structure-Based Drug Design. Despite notable advances in recent years, existing methods, datasets, and evaluation metrics are confronted with several key challenges: (1) current datasets and methods are centered on individual protein-ligand complexes and neglect that diverse binding sites may exist across multiple complexes of the same protein, introducing significant statistical bias; (2) ligand binding site detection is typically modeled as a discontinuous workflow, employing binary segmentation and subsequent clustering algorithms; (3) traditional evaluation metrics do not adequately reflect the actual performance of different binding site prediction methods. To address these issues, we first introduce UniSite-DS, the first UniProt (Unique Protein)-centric ligand binding site dataset, which contains 4.81 times more multi-site data and 2.08 times more overall data compared to the previously most widely used datasets. We then propose UniSite, the first end-to-end ligand binding site detection framework supervised by set prediction loss with bijective matching. In addition, we introduce Average Precision based on Intersection over Union (IoU) as a more accurate evaluation metric for ligand binding site prediction. Extensive experiments on UniSite-DS and several representative benchmark datasets demonstrate that IoU-based Average Precision provides a more accurate reflection of prediction quality, and that UniSite outperforms current state-of-the-art methods in ligand binding site detection. The dataset and codes will be made publicly available at https://github.com/quanlin-wu/unisite.

2025-06-03T17:49:41Z Accepted by NeurIPS 2025 as a Spotlight paper NeurIPS 2025 (Spotlight) Jigang Fan Quanlin Wu Shengjie Luo Liwei Wang http://arxiv.org/abs/2511.08648v1 Compact Artificial Neural Network Models for Predicting Protein Residue -- RNA Base Binding 2025-11-11T01:38:52Z

Large Artificial Neural Network (ANN) models have demonstrated success in various domains, including general text and image generation, drug discovery, and protein-RNA (ribonucleic acid) binding tasks. However, these models typically demand substantial computational resources, time, and data for effective training. Given that such extensive resources are often inaccessible to many researchers and that life sciences data sets are frequently limited, we investigated whether small ANN models could achieve acceptable accuracy in protein-RNA prediction. We experimented with shallow feed-forward ANNs comprising two hidden layers and various non-linearities. These models did not utilize explicit structural information; instead, a sliding window approach was employed to implicitly consider the context of neighboring residues and bases. We explored different training techniques to address the issue of highly unbalanced data. Among the seven most popular non-linearities for feed-forward ANNs, only three: Rectified Linear Unit (ReLU), Gated Linear Unit (GLU), and Hyperbolic Tangent (Tanh) yielded converging models. Common re-balancing techniques, such as under- and over-sampling of training sets, proved ineffective, whereas increasing the volume of training data and using model ensembles significantly improved performance. The optimal context window size, balancing both false negative and false positive errors, was found to be approximately 30 residues and bases. Our findings indicate that high-accuracy protein-RNA binding prediction is achievable using computing hardware accessible to most educational and research institutions.

2025-11-11T01:38:52Z Stanislav Selitskiy 10.1007/978-3-031-82484-5_11 http://arxiv.org/abs/2511.07406v1 Entangled Schrödinger Bridge Matching 2025-11-10T18:55:35Z

Simulating trajectories of multi-particle systems on complex energy landscapes is a central task in molecular dynamics (MD) and drug discovery, but remains challenging at scale due to computationally expensive and long simulations. Previous approaches leverage techniques such as flow or Schrödinger bridge matching to implicitly learn joint trajectories through data snapshots. However, many systems, including biomolecular systems and heterogeneous cell populations, undergo dynamic interactions that evolve over their trajectory and cannot be captured through static snapshots. To close this gap, we introduce Entangled Schrödinger Bridge Matching (EntangledSBM), a framework that learns the first- and second-order stochastic dynamics of interacting, multi-particle systems where the direction and magnitude of each particle's path depend dynamically on the paths of the other particles. We define the Entangled Schrödinger Bridge (EntangledSB) problem as solving a coupled system of bias forces that entangle particle velocities. We show that our framework accurately simulates heterogeneous cell populations under perturbations and rare transitions in high-dimensional biomolecular systems.

2025-11-10T18:55:35Z Sophia Tang Yinuo Zhang Pranam Chatterjee http://arxiv.org/abs/2511.07264v1 edible polysaccharides as stabilizers and carriers for the delivery of phenolic compounds and pigments in food formulations 2025-11-10T16:08:58Z

Food polysaccharides have emerged as suitable carriers of active substances and as additives to food and nutraceutical formulations, showing potential to stabilize bioactive compounds during the storage of microencapsulate preparations, even in the gastrointestinal tract following the intake of bioactive compounds, thereby improving their bioaccessibility and bioavailability. This review provides a comprehensive overview of the main polysaccharides employed as wall materials, including starch, maltodextrin, alginate, pectin, inulin, chitosan, and gum arabic, and discusses how structural interactions and physicochemical properties can benefit the microencapsulation of polyphenols and pigments. The main findings and principles of the major encapsulation techniques, including spray drying, freeze drying, extrusion, emulsification, and coacervation, related to the production of microparticles, were briefly described. Polysaccharides can entrap hydrophilic and hydrophobic compounds by physical interactions, forming a barrier around the nucleus or binding to the bioactive compound. Intermolecular binding between polysaccharides in the wall matrix, polyphenols, and pigments in the nucleus can confer up to 90% of encapsulation efficiency, governed mainly by hydrogen bonds and electrostatic interactions. The mixture of wall polysaccharides in the microparticles synthesis favors the encapsulation solubility, storage stability, bioaccessibility, and bioactivity of the microencapsulate compounds. Clinical trials on the bioefficacy of polyphenols and pigments loaded in polysaccharide microparticles are scarce and require further evidence to reinforce the use of this technology.

2025-11-10T16:08:58Z Liliane Siqueira de Oliveira Davi Vieira Teixeira da Silva Lucileno Rodrigues da Trindade Diego dos Santos Baião Cristine Couto de Almeida Vitor Francisco Ferreira Vania Margaret Flosi Paschoalin http://arxiv.org/abs/2511.06930v1 De Novo Design of SIK3 Inhibitors via Feedback-Driven Fine-Tuning of Seq2Seq-VAE 2025-11-10T10:22:07Z

Alzheimers disease (AD), a progressive neuro-degenerative disorder, currently lacks effective therapeutic strategies that can modify disease progression. Recent studies have highlighted the circadian rhythm critical role in AD pathophysiology, implicating circadian clock kinases, such as the Salt-Inducible Kinase 3 (SIK3), as promising therapeutic target. Generative AI models have surpassed traditional methods of drug discovery, untapping the vast unexplored chemical space of drug-like molecules. We present a sequence-to-sequence Variational Autoencoder (Seq2Seq-VAE) model guided by an Active Learning (AL) approach to optimize molecular generation. Our pipeline iteratively guided a pre-trained Seq2Seq-VAE model towards the pharmacological landscape relevant to SIK3 using a two-step framework, an inner loop that iteratively improves physiochemical properties profile, drug likeliness and synthesizability, followed by an outer loop that steer the latent space towards high-affinity ligands for SIK3. Our approach introduces feedback-driven optimization without requiring large labeled datasets, making it particularly suited for early-stage drug discovery in under-explored therapeutic targets. Our results demonstrate the models convergence toward SIK3-specific small molecules with desired properties and high binding affinity. This work highlights the use of generative AI combined with AL for rational drug discovery that can be extended to other protein targets with minimal modifications, offering a scalable solution to the molecular design bottleneck in drug design.

2025-11-10T10:22:07Z ShahZeb Khan Chiara Pallara Barbara Monti Alexis Molina http://arxiv.org/abs/2511.06585v1 Learning Biomolecular Motion: The Physics-Informed Machine Learning Paradigm 2025-11-10T00:24:06Z

The convergence of statistical learning and molecular physics is transforming our approach to modeling biomolecular systems. Physics-informed machine learning (PIML) offers a systematic framework that integrates data-driven inference with physical constraints, resulting in models that are accurate, mechanistic, generalizable, and able to extrapolate beyond observed domains. This review surveys recent advances in physics-informed neural networks and operator learning, differentiable molecular simulation, and hybrid physics-ML potentials, with emphasis on long-timescale kinetics, rare events, and free-energy estimation. We frame these approaches as solutions to the "biomolecular closure problem", recovering unresolved interactions beyond classical force fields while preserving thermodynamic consistency and mechanistic interpretability. We examine theoretical foundations, tools and frameworks, computational trade-offs, and unresolved issues, including model expressiveness and stability. We outline prospective research avenues at the intersection of machine learning, statistical physics, and computational chemistry, contending that future advancements will depend on mechanistic inductive biases, and integrated differentiable physical learning frameworks for biomolecular simulation and discovery.

2025-11-10T00:24:06Z 31 pages, 4 figures, 3 tables. Review article Aaryesh Deshpande http://arxiv.org/abs/2511.04892v1 LG-NuSegHop: A Local-to-Global Self-Supervised Pipeline For Nuclei Instance Segmentation 2025-11-07T00:34:10Z

Nuclei segmentation is the cornerstone task in histology image reading, shedding light on the underlying molecular patterns and leading to disease or cancer diagnosis. Yet, it is a laborious task that requires expertise from trained physicians. The large nuclei variability across different organ tissues and acquisition processes challenges the automation of this task. On the other hand, data annotations are expensive to obtain, and thus, Deep Learning (DL) models are challenged to generalize to unseen organs or different domains. This work proposes Local-to-Global NuSegHop (LG-NuSegHop), a self-supervised pipeline developed on prior knowledge of the problem and molecular biology. There are three distinct modules: (1) a set of local processing operations to generate a pseudolabel, (2) NuSegHop a novel data-driven feature extraction model and (3) a set of global operations to post-process the predictions of NuSegHop. Notably, even though the proposed pipeline uses { no manually annotated training data} or domain adaptation, it maintains a good generalization performance on other datasets. Experiments in three publicly available datasets show that our method outperforms other self-supervised and weakly supervised methods while having a competitive standing among fully supervised methods. Remarkably, every module within LG-NuSegHop is transparent and explainable to physicians.

2025-11-07T00:34:10Z 42 pages, 8 figures, 7 tables Asia Pacific Signal and Information Processing Association (APSIPA), 2025 http://www.apsipa.org Vasileios Magoulianitis Catherine A. Alexander Jiaxin Yang C. -C. Jay Kuo http://arxiv.org/abs/2511.04814v1 A Standardized Benchmark for Multilabel Antimicrobial Peptide Classification 2025-11-06T21:10:48Z

Antimicrobial peptides have emerged as promising molecules to combat antimicrobial resistance. However, fragmented datasets, inconsistent annotations, and the lack of standardized benchmarks hinder computational approaches and slow down the discovery of new candidates. To address these challenges, we present the Expanded Standardized Collection for Antimicrobial Peptide Evaluation (ESCAPE), an experimental framework integrating over 80.000 peptides from 27 validated repositories. Our dataset separates antimicrobial peptides from negative sequences and incorporates their functional annotations into a biologically coherent multilabel hierarchy, capturing activities across antibacterial, antifungal, antiviral, and antiparasitic classes. Building on ESCAPE, we propose a transformer-based model that leverages sequence and structural information to predict multiple functional activities of peptides. Our method achieves up to a 2.56% relative average improvement in mean Average Precision over the second-best method adapted for this task, establishing a new state-of-the-art multilabel peptide classification. ESCAPE provides a comprehensive and reproducible evaluation framework to advance AI-driven antimicrobial peptide research.

2025-11-06T21:10:48Z 39th Conference on Neural Information Processing Systems (NeurIPS 2025). Camera-ready version. Code: https://github.com/BCV-Uniandes/ESCAPE. Dataset DOI: https://doi.org/10.7910/DVN/C69MCD Sebastian Ojeda Rafael Velasquez Nicolás Aparicio Juanita Puentes Paula Cárdenas Nicolás Andrade Gabriel González Sergio Rincón Carolina Muñoz-Camargo Pablo Arbeláez http://arxiv.org/abs/2511.14781v1 Quantifying the Role of OpenFold Components in Protein Structure Prediction 2025-11-06T20:41:34Z

Models such as AlphaFold2 and OpenFold have transformed protein structure prediction, yet their inner workings remain poorly understood. We present a methodology to systematically evaluate the contribution of individual OpenFold components to structure prediction accuracy. We identify several components that are critical for most proteins, while others vary in importance across proteins. We further show that the contribution of several components is correlated with protein length. These findings provide insight into how OpenFold achieves accurate predictions and highlight directions for interpreting protein prediction networks more broadly.

2025-11-06T20:41:34Z Accepted to the NeurIPS 2025 Workshop on Multi-modal Foundation Models and Large Language Models for Life Sciences Tyler L. Hayes Giri P. Krishnan http://arxiv.org/abs/2511.04174v1 Protein aggregation in Huntington's disease 2025-11-06T08:25:58Z

The presence of an expanded polyglutamine produces a toxic gain of function in huntingtin. Protein aggregation resulting from this gain of function is likely to be the cause of neuronal death. Two main mechanisms of aggregation have been proposed: hydrogen bonding by polar-zipper formation and covalent bonding by transglutaminase-catalyzed cross-linking. In cell culture models of Huntington's disease, aggregates are mostly stabilized by hydrogen bonds, but covalent bonds are also likely to occur. Nothing is known about the nature of the bonds that stabilize the aggregates in the brain of patients with Huntington's disease. It seems that the nature of the bond stabilizing the aggregates is one of the most important questions, as the answer would condition the therapeutic approach to Huntington's disease.

2025-11-06T08:25:58Z Biochimie, 2002, 84 (4), pp.273-278 Guylaine Hoffner UNICOG-U992, NEUROSPIN Philippe Djian http://arxiv.org/abs/2511.04040v1 Enhancing Multimodal Protein Function Prediction Through Dual-Branch Dynamic Selection with Reconstructive Pre-Training 2025-11-06T04:19:42Z

Multimodal protein features play a crucial role in protein function prediction. However, these features encompass a wide range of information, ranging from structural data and sequence features to protein attributes and interaction networks, making it challenging to decipher their complex interconnections. In this work, we propose a multimodal protein function prediction method (DSRPGO) by utilizing dynamic selection and reconstructive pre-training mechanisms. To acquire complex protein information, we introduce reconstructive pre-training to mine more fine-grained information with low semantic levels. Moreover, we put forward the Bidirectional Interaction Module (BInM) to facilitate interactive learning among multimodal features. Additionally, to address the difficulty of hierarchical multi-label classification in this task, a Dynamic Selection Module (DSM) is designed to select the feature representation that is most conducive to current protein function prediction. Our proposed DSRPGO model improves significantly in BPO, MFO, and CCO on human datasets, thereby outperforming other benchmark models.

2025-11-06T04:19:42Z Proceedings of the IJCAI-25, 7598--7606 (2025) Xiaoling Luo Peng Chen Chengliang Liu Xiaopeng Jin Jie Wen Yumeng Liu Junsong Wang http://arxiv.org/abs/2508.14233v5 Excitonic Coupling and Photon Antibunching in Venus Yellow Fluorescent Protein Dimers: A Lindblad Master Equation Approach 2025-11-05T09:58:35Z

Strong excitonic coupling and photon antibunching (AB) have been observed together in Venus yellow fluorescent protein dimers and currently lack a cohesive theoretical explanation. In 2019, Kim et al. demonstrated Davydov splitting in circular dichroism spectra, revealing strong J-like coupling, while antibunched fluorescence emission was confirmed by combined antibunching--fluorescence correlation spectroscopy (AB/FCS fingerprinting). To investigate the implications of this coexistence, Venus yellow fluorescent protein (YFP) dimer population dynamics are modeled within a Lindblad master equation framework, testing its ability to cope with typical, data-informed, Venus YFP dimer time and energy values. Simulations predict multiple-femtosecond (fs) decoherence, yielding bright/dark state mixtures consistent with antibunched fluorescence emission at room temperature. Thus, excitonic coupling and photon AB in Venus YFP dimers are reconciled without invoking long-lived quantum coherence. However, clear violations of several Lindblad approximation validity conditions appear imminent, calling for careful modifications to choices of standard system and bath definitions and parameter values.

2025-08-19T19:44:59Z 25 pages, 4 figures, 7 appendices. Minor technical corrections and consistency updates from v4. Discusses fluorescent proteins, excitonic coupling, photon antibunching, open quantum systems modeling, Lindblad formalism, thermodynamics, information theory, evolutionary biology, photosynthetic energy transfer, quantum biophotonics, and quantum technology Ian T. Abrahams http://arxiv.org/abs/2502.09860v2 Gradient GA: Gradient Genetic Algorithm for Drug Molecular Design 2025-11-04T18:02:45Z

Molecular discovery has brought great benefits to the chemical industry. Various molecule design techniques are developed to identify molecules with desirable properties. Traditional optimization methods, such as genetic algorithms, continue to achieve state-of-the-art results across multiple molecular design benchmarks. However, these techniques rely solely on random walk exploration, which hinders both the quality of the final solution and the convergence speed. To address this limitation, we propose a novel approach called Gradient Genetic Algorithm (Gradient GA), which incorporates gradient information from the objective function into genetic algorithms. Instead of random exploration, each proposed sample iteratively progresses toward an optimal solution by following the gradient direction. We achieve this by designing a differentiable objective function parameterized by a neural network and utilizing the Discrete Langevin Proposal to enable gradient guidance in discrete molecular spaces. Experimental results demonstrate that our method significantly improves both convergence speed and solution quality, outperforming cutting-edge techniques. For example, it achieves up to a 25% improvement in the top-10 score over the vanilla genetic algorithm. The code is publicly available at https://github.com/debadyuti23/GradientGA.

2025-02-14T02:03:39Z Chris Zhuang Debadyuti Mukherjee Yingzhou Lu Tianfan Fu Ruqi Zhang http://arxiv.org/abs/2511.02769v1 STAR-VAE: Latent Variable Transformers for Scalable and Controllable Molecular Generation 2025-11-04T17:56:00Z

The chemical space of drug-like molecules is vast, motivating the development of generative models that must learn broad chemical distributions, enable conditional generation by capturing structure-property representations, and provide fast molecular generation. Meeting the objectives depends on modeling choices, including the probabilistic modeling approach, the conditional generative formulation, the architecture, and the molecular input representation. To address the challenges, we present STAR-VAE (Selfies-encoded, Transformer-based, AutoRegressive Variational Auto Encoder), a scalable latent-variable framework with a Transformer encoder and an autoregressive Transformer decoder. It is trained on 79 million drug-like molecules from PubChem, using SELFIES to guarantee syntactic validity. The latent-variable formulation enables conditional generation: a property predictor supplies a conditioning signal that is applied consistently to the latent prior, the inference network, and the decoder. Our contributions are: (i) a Transformer-based latent-variable encoder-decoder model trained on SELFIES representations; (ii) a principled conditional latent-variable formulation for property-guided generation; and (iii) efficient finetuning with low-rank adapters (LoRA) in both encoder and decoder, enabling fast adaptation with limited property and activity data. On the GuacaMol and MOSES benchmarks, our approach matches or exceeds baselines, and latent-space analyses reveal smooth, semantically structured representations that support both unconditional exploration and property-aware generation. On the Tartarus benchmarks, the conditional model shifts docking-score distributions toward stronger predicted binding. These results suggest that a modernized, scale-appropriate VAE remains competitive for molecular generation when paired with principled conditioning and parameter-efficient finetuning.

2025-11-04T17:56:00Z 16 pages, 3 figures, 2 tables Bum Chul Kwon Ben Shapira Moshiko Raboh Shreyans Sethi Shruti Murarka Joseph A Morrone Jianying Hu Parthasarathy Suryanarayanan http://arxiv.org/abs/2511.02622v1 Machine Learning for RNA Secondary Structure Prediction: a review of current methods and challenges 2025-11-04T14:52:11Z

Predicting the secondary structure of RNA is a core challenge in computational biology, essential for understanding molecular function and designing novel therapeutics. The field has evolved from foundational but accuracy-limited thermodynamic approaches to a new data-driven paradigm dominated by machine learning and deep learning. These models learn folding patterns directly from data, leading to significant performance gains. This review surveys the modern landscape of these methods, covering single-sequence, evolutionary-based, and hybrid models that blend machine learning with biophysics. A central theme is the field's "generalization crisis," where powerful models were found to fail on new RNA families, prompting a community-wide shift to stricter, homology-aware benchmarking. In response to the underlying challenge of data scarcity, RNA foundation models have emerged, learning from massive, unlabeled sequence corpora to improve generalization. Finally, we look ahead to the next set of major hurdles-including the accurate prediction of complex motifs like pseudoknots, scaling to kilobase-length transcripts, incorporating the chemical diversity of modified nucleotides, and shifting the prediction target from static structures to the dynamic ensembles that better capture biological function. We also highlight the need for a standardized, prospective benchmarking system to ensure unbiased validation and accelerate progress.

2025-11-04T14:52:11Z Giuseppe Sacco Giovanni Bussi Guido Sanguinetti