https://arxiv.org/api/MK/IvTPAumHBnazQhlMnEcgIKJc2026-03-22T17:37:48Z664222515http://arxiv.org/abs/2506.03237v3UniSite: The First Cross-Structure Dataset and Learning Framework for End-to-End Ligand Binding Site Detection2025-11-11T18:16:03ZThe detection of ligand binding sites for proteins is a fundamental step in Structure-Based Drug Design. Despite notable advances in recent years, existing methods, datasets, and evaluation metrics are confronted with several key challenges: (1) current datasets and methods are centered on individual protein-ligand complexes and neglect that diverse binding sites may exist across multiple complexes of the same protein, introducing significant statistical bias; (2) ligand binding site detection is typically modeled as a discontinuous workflow, employing binary segmentation and subsequent clustering algorithms; (3) traditional evaluation metrics do not adequately reflect the actual performance of different binding site prediction methods. To address these issues, we first introduce UniSite-DS, the first UniProt (Unique Protein)-centric ligand binding site dataset, which contains 4.81 times more multi-site data and 2.08 times more overall data compared to the previously most widely used datasets. We then propose UniSite, the first end-to-end ligand binding site detection framework supervised by set prediction loss with bijective matching. In addition, we introduce Average Precision based on Intersection over Union (IoU) as a more accurate evaluation metric for ligand binding site prediction. Extensive experiments on UniSite-DS and several representative benchmark datasets demonstrate that IoU-based Average Precision provides a more accurate reflection of prediction quality, and that UniSite outperforms current state-of-the-art methods in ligand binding site detection. The dataset and codes will be made publicly available at https://github.com/quanlin-wu/unisite.2025-06-03T17:49:41ZAccepted by NeurIPS 2025 as a Spotlight paperNeurIPS 2025 (Spotlight)Jigang FanQuanlin WuShengjie LuoLiwei Wanghttp://arxiv.org/abs/2511.08648v1Compact Artificial Neural Network Models for Predicting Protein Residue -- RNA Base Binding2025-11-11T01:38:52ZLarge Artificial Neural Network (ANN) models have demonstrated success in various domains, including general text and image generation, drug discovery, and protein-RNA (ribonucleic acid) binding tasks. However, these models typically demand substantial computational resources, time, and data for effective training. Given that such extensive resources are often inaccessible to many researchers and that life sciences data sets are frequently limited, we investigated whether small ANN models could achieve acceptable accuracy in protein-RNA prediction. We experimented with shallow feed-forward ANNs comprising two hidden layers and various non-linearities. These models did not utilize explicit structural information; instead, a sliding window approach was employed to implicitly consider the context of neighboring residues and bases. We explored different training techniques to address the issue of highly unbalanced data. Among the seven most popular non-linearities for feed-forward ANNs, only three: Rectified Linear Unit (ReLU), Gated Linear Unit (GLU), and Hyperbolic Tangent (Tanh) yielded converging models. Common re-balancing techniques, such as under- and over-sampling of training sets, proved ineffective, whereas increasing the volume of training data and using model ensembles significantly improved performance. The optimal context window size, balancing both false negative and false positive errors, was found to be approximately 30 residues and bases. Our findings indicate that high-accuracy protein-RNA binding prediction is achievable using computing hardware accessible to most educational and research institutions.2025-11-11T01:38:52ZStanislav Selitskiy10.1007/978-3-031-82484-5_11http://arxiv.org/abs/2511.07406v1Entangled Schrödinger Bridge Matching2025-11-10T18:55:35ZSimulating trajectories of multi-particle systems on complex energy landscapes is a central task in molecular dynamics (MD) and drug discovery, but remains challenging at scale due to computationally expensive and long simulations. Previous approaches leverage techniques such as flow or Schrödinger bridge matching to implicitly learn joint trajectories through data snapshots. However, many systems, including biomolecular systems and heterogeneous cell populations, undergo dynamic interactions that evolve over their trajectory and cannot be captured through static snapshots. To close this gap, we introduce Entangled Schrödinger Bridge Matching (EntangledSBM), a framework that learns the first- and second-order stochastic dynamics of interacting, multi-particle systems where the direction and magnitude of each particle's path depend dynamically on the paths of the other particles. We define the Entangled Schrödinger Bridge (EntangledSB) problem as solving a coupled system of bias forces that entangle particle velocities. We show that our framework accurately simulates heterogeneous cell populations under perturbations and rare transitions in high-dimensional biomolecular systems.2025-11-10T18:55:35ZSophia TangYinuo ZhangPranam Chatterjeehttp://arxiv.org/abs/2511.07264v1edible polysaccharides as stabilizers and carriers for the delivery of phenolic compounds and pigments in food formulations2025-11-10T16:08:58ZFood polysaccharides have emerged as suitable carriers of active substances and as additives to food and nutraceutical formulations, showing potential to stabilize bioactive compounds during the storage of microencapsulate preparations, even in the gastrointestinal tract following the intake of bioactive compounds, thereby improving their bioaccessibility and bioavailability. This review provides a comprehensive overview of the main polysaccharides employed as wall materials, including starch, maltodextrin, alginate, pectin, inulin, chitosan, and gum arabic, and discusses how structural interactions and physicochemical properties can benefit the microencapsulation of polyphenols and pigments. The main findings and principles of the major encapsulation techniques, including spray drying, freeze drying, extrusion, emulsification, and coacervation, related to the production of microparticles, were briefly described. Polysaccharides can entrap hydrophilic and hydrophobic compounds by physical interactions, forming a barrier around the nucleus or binding to the bioactive compound. Intermolecular binding between polysaccharides in the wall matrix, polyphenols, and pigments in the nucleus can confer up to 90% of encapsulation efficiency, governed mainly by hydrogen bonds and electrostatic interactions. The mixture of wall polysaccharides in the microparticles synthesis favors the encapsulation solubility, storage stability, bioaccessibility, and bioactivity of the microencapsulate compounds. Clinical trials on the bioefficacy of polyphenols and pigments loaded in polysaccharide microparticles are scarce and require further evidence to reinforce the use of this technology.2025-11-10T16:08:58ZLiliane Siqueira de OliveiraDavi Vieira Teixeira da SilvaLucileno Rodrigues da TrindadeDiego dos Santos BaiãoCristine Couto de AlmeidaVitor Francisco FerreiraVania Margaret Flosi Paschoalinhttp://arxiv.org/abs/2511.06930v1De Novo Design of SIK3 Inhibitors via Feedback-Driven Fine-Tuning of Seq2Seq-VAE2025-11-10T10:22:07ZAlzheimers disease (AD), a progressive neuro-degenerative disorder, currently lacks effective therapeutic strategies that can modify disease progression. Recent studies have highlighted the circadian rhythm critical role in AD pathophysiology, implicating circadian clock kinases, such as the Salt-Inducible Kinase 3 (SIK3), as promising therapeutic target. Generative AI models have surpassed traditional methods of drug discovery, untapping the vast unexplored chemical space of drug-like molecules. We present a sequence-to-sequence Variational Autoencoder (Seq2Seq-VAE) model guided by an Active Learning (AL) approach to optimize molecular generation. Our pipeline iteratively guided a pre-trained Seq2Seq-VAE model towards the pharmacological landscape relevant to SIK3 using a two-step framework, an inner loop that iteratively improves physiochemical properties profile, drug likeliness and synthesizability, followed by an outer loop that steer the latent space towards high-affinity ligands for SIK3. Our approach introduces feedback-driven optimization without requiring large labeled datasets, making it particularly suited for early-stage drug discovery in under-explored therapeutic targets. Our results demonstrate the models convergence toward SIK3-specific small molecules with desired properties and high binding affinity. This work highlights the use of generative AI combined with AL for rational drug discovery that can be extended to other protein targets with minimal modifications, offering a scalable solution to the molecular design bottleneck in drug design.2025-11-10T10:22:07ZShahZeb KhanChiara PallaraBarbara MontiAlexis Molinahttp://arxiv.org/abs/2511.06585v1Learning Biomolecular Motion: The Physics-Informed Machine Learning Paradigm2025-11-10T00:24:06ZThe convergence of statistical learning and molecular physics is transforming our approach to modeling biomolecular systems. Physics-informed machine learning (PIML) offers a systematic framework that integrates data-driven inference with physical constraints, resulting in models that are accurate, mechanistic, generalizable, and able to extrapolate beyond observed domains. This review surveys recent advances in physics-informed neural networks and operator learning, differentiable molecular simulation, and hybrid physics-ML potentials, with emphasis on long-timescale kinetics, rare events, and free-energy estimation. We frame these approaches as solutions to the "biomolecular closure problem", recovering unresolved interactions beyond classical force fields while preserving thermodynamic consistency and mechanistic interpretability. We examine theoretical foundations, tools and frameworks, computational trade-offs, and unresolved issues, including model expressiveness and stability. We outline prospective research avenues at the intersection of machine learning, statistical physics, and computational chemistry, contending that future advancements will depend on mechanistic inductive biases, and integrated differentiable physical learning frameworks for biomolecular simulation and discovery.2025-11-10T00:24:06Z31 pages, 4 figures, 3 tables. Review articleAaryesh Deshpandehttp://arxiv.org/abs/2511.04892v1LG-NuSegHop: A Local-to-Global Self-Supervised Pipeline For Nuclei Instance Segmentation2025-11-07T00:34:10ZNuclei segmentation is the cornerstone task in histology image reading, shedding light on the underlying molecular patterns and leading to disease or cancer diagnosis. Yet, it is a laborious task that requires expertise from trained physicians. The large nuclei variability across different organ tissues and acquisition processes challenges the automation of this task. On the other hand, data annotations are expensive to obtain, and thus, Deep Learning (DL) models are challenged to generalize to unseen organs or different domains. This work proposes Local-to-Global NuSegHop (LG-NuSegHop), a self-supervised pipeline developed on prior knowledge of the problem and molecular biology. There are three distinct modules: (1) a set of local processing operations to generate a pseudolabel, (2) NuSegHop a novel data-driven feature extraction model and (3) a set of global operations to post-process the predictions of NuSegHop. Notably, even though the proposed pipeline uses { no manually annotated training data} or domain adaptation, it maintains a good generalization performance on other datasets. Experiments in three publicly available datasets show that our method outperforms other self-supervised and weakly supervised methods while having a competitive standing among fully supervised methods. Remarkably, every module within LG-NuSegHop is transparent and explainable to physicians.2025-11-07T00:34:10Z42 pages, 8 figures, 7 tablesAsia Pacific Signal and Information Processing Association (APSIPA), 2025 http://www.apsipa.orgVasileios MagoulianitisCatherine A. AlexanderJiaxin YangC. -C. Jay Kuohttp://arxiv.org/abs/2511.04814v1A Standardized Benchmark for Multilabel Antimicrobial Peptide Classification2025-11-06T21:10:48ZAntimicrobial peptides have emerged as promising molecules to combat antimicrobial resistance. However, fragmented datasets, inconsistent annotations, and the lack of standardized benchmarks hinder computational approaches and slow down the discovery of new candidates. To address these challenges, we present the Expanded Standardized Collection for Antimicrobial Peptide Evaluation (ESCAPE), an experimental framework integrating over 80.000 peptides from 27 validated repositories. Our dataset separates antimicrobial peptides from negative sequences and incorporates their functional annotations into a biologically coherent multilabel hierarchy, capturing activities across antibacterial, antifungal, antiviral, and antiparasitic classes. Building on ESCAPE, we propose a transformer-based model that leverages sequence and structural information to predict multiple functional activities of peptides. Our method achieves up to a 2.56% relative average improvement in mean Average Precision over the second-best method adapted for this task, establishing a new state-of-the-art multilabel peptide classification. ESCAPE provides a comprehensive and reproducible evaluation framework to advance AI-driven antimicrobial peptide research.2025-11-06T21:10:48Z39th Conference on Neural Information Processing Systems (NeurIPS 2025). Camera-ready version. Code: https://github.com/BCV-Uniandes/ESCAPE. Dataset DOI: https://doi.org/10.7910/DVN/C69MCDSebastian OjedaRafael VelasquezNicolás AparicioJuanita PuentesPaula CárdenasNicolás AndradeGabriel GonzálezSergio RincónCarolina Muñoz-CamargoPablo Arbeláezhttp://arxiv.org/abs/2511.14781v1Quantifying the Role of OpenFold Components in Protein Structure Prediction2025-11-06T20:41:34ZModels such as AlphaFold2 and OpenFold have transformed protein structure prediction, yet their inner workings remain poorly understood. We present a methodology to systematically evaluate the contribution of individual OpenFold components to structure prediction accuracy. We identify several components that are critical for most proteins, while others vary in importance across proteins. We further show that the contribution of several components is correlated with protein length. These findings provide insight into how OpenFold achieves accurate predictions and highlight directions for interpreting protein prediction networks more broadly.2025-11-06T20:41:34ZAccepted to the NeurIPS 2025 Workshop on Multi-modal Foundation Models and Large Language Models for Life SciencesTyler L. HayesGiri P. Krishnanhttp://arxiv.org/abs/2511.04174v1Protein aggregation in Huntington's disease2025-11-06T08:25:58ZThe presence of an expanded polyglutamine produces a toxic gain of function in huntingtin. Protein aggregation resulting from this gain of function is likely to be the cause of neuronal death. Two main mechanisms of aggregation have been proposed: hydrogen bonding by polar-zipper formation and covalent bonding by transglutaminase-catalyzed cross-linking. In cell culture models of Huntington's disease, aggregates are mostly stabilized by hydrogen bonds, but covalent bonds are also likely to occur. Nothing is known about the nature of the bonds that stabilize the aggregates in the brain of patients with Huntington's disease. It seems that the nature of the bond stabilizing the aggregates is one of the most important questions, as the answer would condition the therapeutic approach to Huntington's disease.2025-11-06T08:25:58ZBiochimie, 2002, 84 (4), pp.273-278Guylaine HoffnerUNICOG-U992, NEUROSPINPhilippe Djianhttp://arxiv.org/abs/2511.04040v1Enhancing Multimodal Protein Function Prediction Through Dual-Branch Dynamic Selection with Reconstructive Pre-Training2025-11-06T04:19:42ZMultimodal protein features play a crucial role in protein function prediction. However, these features encompass a wide range of information, ranging from structural data and sequence features to protein attributes and interaction networks, making it challenging to decipher their complex interconnections. In this work, we propose a multimodal protein function prediction method (DSRPGO) by utilizing dynamic selection and reconstructive pre-training mechanisms. To acquire complex protein information, we introduce reconstructive pre-training to mine more fine-grained information with low semantic levels. Moreover, we put forward the Bidirectional Interaction Module (BInM) to facilitate interactive learning among multimodal features. Additionally, to address the difficulty of hierarchical multi-label classification in this task, a Dynamic Selection Module (DSM) is designed to select the feature representation that is most conducive to current protein function prediction. Our proposed DSRPGO model improves significantly in BPO, MFO, and CCO on human datasets, thereby outperforming other benchmark models.2025-11-06T04:19:42ZProceedings of the IJCAI-25, 7598--7606 (2025)Xiaoling LuoPeng ChenChengliang LiuXiaopeng JinJie WenYumeng LiuJunsong Wanghttp://arxiv.org/abs/2508.14233v5Excitonic Coupling and Photon Antibunching in Venus Yellow Fluorescent Protein Dimers: A Lindblad Master Equation Approach2025-11-05T09:58:35ZStrong excitonic coupling and photon antibunching (AB) have been observed together in Venus yellow fluorescent protein dimers and currently lack a cohesive theoretical explanation. In 2019, Kim et al. demonstrated Davydov splitting in circular dichroism spectra, revealing strong J-like coupling, while antibunched fluorescence emission was confirmed by combined antibunching--fluorescence correlation spectroscopy (AB/FCS fingerprinting). To investigate the implications of this coexistence, Venus yellow fluorescent protein (YFP) dimer population dynamics are modeled within a Lindblad master equation framework, testing its ability to cope with typical, data-informed, Venus YFP dimer time and energy values. Simulations predict multiple-femtosecond (fs) decoherence, yielding bright/dark state mixtures consistent with antibunched fluorescence emission at room temperature. Thus, excitonic coupling and photon AB in Venus YFP dimers are reconciled without invoking long-lived quantum coherence. However, clear violations of several Lindblad approximation validity conditions appear imminent, calling for careful modifications to choices of standard system and bath definitions and parameter values.2025-08-19T19:44:59Z25 pages, 4 figures, 7 appendices. Minor technical corrections and consistency updates from v4. Discusses fluorescent proteins, excitonic coupling, photon antibunching, open quantum systems modeling, Lindblad formalism, thermodynamics, information theory, evolutionary biology, photosynthetic energy transfer, quantum biophotonics, and quantum technologyIan T. Abrahamshttp://arxiv.org/abs/2502.09860v2Gradient GA: Gradient Genetic Algorithm for Drug Molecular Design2025-11-04T18:02:45ZMolecular discovery has brought great benefits to the chemical industry. Various molecule design techniques are developed to identify molecules with desirable properties. Traditional optimization methods, such as genetic algorithms, continue to achieve state-of-the-art results across multiple molecular design benchmarks. However, these techniques rely solely on random walk exploration, which hinders both the quality of the final solution and the convergence speed. To address this limitation, we propose a novel approach called Gradient Genetic Algorithm (Gradient GA), which incorporates gradient information from the objective function into genetic algorithms. Instead of random exploration, each proposed sample iteratively progresses toward an optimal solution by following the gradient direction. We achieve this by designing a differentiable objective function parameterized by a neural network and utilizing the Discrete Langevin Proposal to enable gradient guidance in discrete molecular spaces. Experimental results demonstrate that our method significantly improves both convergence speed and solution quality, outperforming cutting-edge techniques. For example, it achieves up to a 25% improvement in the top-10 score over the vanilla genetic algorithm. The code is publicly available at https://github.com/debadyuti23/GradientGA.2025-02-14T02:03:39ZChris ZhuangDebadyuti MukherjeeYingzhou LuTianfan FuRuqi Zhanghttp://arxiv.org/abs/2511.02769v1STAR-VAE: Latent Variable Transformers for Scalable and Controllable Molecular Generation2025-11-04T17:56:00ZThe chemical space of drug-like molecules is vast, motivating the development of generative models that must learn broad chemical distributions, enable conditional generation by capturing structure-property representations, and provide fast molecular generation. Meeting the objectives depends on modeling choices, including the probabilistic modeling approach, the conditional generative formulation, the architecture, and the molecular input representation. To address the challenges, we present STAR-VAE (Selfies-encoded, Transformer-based, AutoRegressive Variational Auto Encoder), a scalable latent-variable framework with a Transformer encoder and an autoregressive Transformer decoder. It is trained on 79 million drug-like molecules from PubChem, using SELFIES to guarantee syntactic validity. The latent-variable formulation enables conditional generation: a property predictor supplies a conditioning signal that is applied consistently to the latent prior, the inference network, and the decoder. Our contributions are: (i) a Transformer-based latent-variable encoder-decoder model trained on SELFIES representations; (ii) a principled conditional latent-variable formulation for property-guided generation; and (iii) efficient finetuning with low-rank adapters (LoRA) in both encoder and decoder, enabling fast adaptation with limited property and activity data. On the GuacaMol and MOSES benchmarks, our approach matches or exceeds baselines, and latent-space analyses reveal smooth, semantically structured representations that support both unconditional exploration and property-aware generation. On the Tartarus benchmarks, the conditional model shifts docking-score distributions toward stronger predicted binding. These results suggest that a modernized, scale-appropriate VAE remains competitive for molecular generation when paired with principled conditioning and parameter-efficient finetuning.2025-11-04T17:56:00Z16 pages, 3 figures, 2 tablesBum Chul KwonBen ShapiraMoshiko RabohShreyans SethiShruti MurarkaJoseph A MorroneJianying HuParthasarathy Suryanarayananhttp://arxiv.org/abs/2511.02622v1Machine Learning for RNA Secondary Structure Prediction: a review of current methods and challenges2025-11-04T14:52:11ZPredicting the secondary structure of RNA is a core challenge in computational biology, essential for understanding molecular function and designing novel therapeutics. The field has evolved from foundational but accuracy-limited thermodynamic approaches to a new data-driven paradigm dominated by machine learning and deep learning. These models learn folding patterns directly from data, leading to significant performance gains. This review surveys the modern landscape of these methods, covering single-sequence, evolutionary-based, and hybrid models that blend machine learning with biophysics. A central theme is the field's "generalization crisis," where powerful models were found to fail on new RNA families, prompting a community-wide shift to stricter, homology-aware benchmarking. In response to the underlying challenge of data scarcity, RNA foundation models have emerged, learning from massive, unlabeled sequence corpora to improve generalization. Finally, we look ahead to the next set of major hurdles-including the accurate prediction of complex motifs like pseudoknots, scaling to kilobase-length transcripts, incorporating the chemical diversity of modified nucleotides, and shifting the prediction target from static structures to the dynamic ensembles that better capture biological function. We also highlight the need for a standardized, prospective benchmarking system to ensure unbiased validation and accelerate progress.2025-11-04T14:52:11ZGiuseppe SaccoGiovanni BussiGuido Sanguinetti