https://arxiv.org/api/lN0Cx9fGLown2odsuHPaguBOJts2026-03-22T11:59:11Z664216515http://arxiv.org/abs/2512.08367v1Integrating Coarse-Grained Simulations and Deep Learning to Unveil Entropy-Driven dsRNA Unwinding by DDX3X2025-12-09T08:46:20ZDEAD-box RNA helicases (DDXs) are essential RNA metabolism regulators that typically unwind dsRNA in an ATP-dependent manner. However, recent studies show some DDXs can also unwind dsRNA without ATP, a phenomenon that remains poorly understood. Here, we developed HelixTriad coarse-grained RNA model, incorporating Watson-Crick base pairing, base stacking, and electrostatics within a three-bead-per-nucleotide scheme to accurately reproduce experimental RNA melting curves. Molecular dynamics simulations showed that weak, specific DDX3X-dsRNA interactions drive stochastic strand separation without ATP. Free energy analysis revealed that successful unwinding via high-entropy, stand-displacing intermediates. Furthermore, we introduced Entropy-Unet, a deep learning framework for entropy prediction, which corroborated theoretical estimates and uncovered a hierarchical pattern of entropy contributions. Together, our findings suggest that ATP-independent dsRNA unwinding by DDXs is predominantly entropy-driven, offering new mechanistic insights into RNA helicases versatility.2025-12-09T08:46:20Z18 pages, 4 figuresKang WangChun-Lai RenYu-Qiang Mahttp://arxiv.org/abs/2507.17876v2Look the Other Way: Designing 'Positive' Molecules with Negative Data via Task Arithmetic2025-12-08T14:26:37ZThe scarcity of molecules with desirable properties (i.e., `positive' molecules) is an inherent bottleneck for generative molecule design. To sidestep such obstacle, here we propose molecular task arithmetic: training a model on diverse and abundant negative examples to learn 'property directions' - without accessing any positively labeled data - and moving models in the opposite property directions to generate positive molecules. When analyzed on 33 design experiments with distinct molecular entities (small molecules, proteins), model architectures, and scales, molecular task arithmetic generated more diverse and successful designs than models trained on positive molecules in general. Moreover, we employed molecular task arithmetic in dual-objective and few-shot design tasks. We find that molecular task arithmetic can consistently increase the diversity of designs while maintaining desirable complex design properties, such as good docking scores to a protein. With its simplicity, data efficiency, and performance, molecular task arithmetic bears the potential to become the de facto transfer learning strategy for de novo molecule design.2025-07-23T19:05:37ZRıza ÖzçelikSarah de RuiterFrancesca Grisonihttp://arxiv.org/abs/2512.06592v1On fine-tuning Boltz-2 for protein-protein affinity prediction2025-12-06T23:07:10ZAccurate prediction of protein-protein binding affinity is vital for understanding molecular interactions and designing therapeutics. We adapt Boltz-2, a state-of-the-art structure-based protein-ligand affinity predictor, for protein-protein affinity regression and evaluate it on two datasets, TCR3d and PPB-affinity. Despite high structural accuracy, Boltz-2-PPI underperforms relative to sequence-based alternatives in both small- and larger-scale data regimes. Combining embeddings from Boltz-2-PPI with sequence-based embeddings yields complementary improvements, particularly for weaker sequence models, suggesting different signals are learned by sequence- and structure-based models. Our results echo known biases associated with training with structural data and suggest that current structure-based representations are not primed for performant affinity prediction.2025-12-06T23:07:10ZMLSB 2025James KingLewis CornwallAndrei Cristian NicaJames DayAaron SimNeil DalchauLilly WollmanJoshua Meyershttp://arxiv.org/abs/2512.06496v1PRIMRose: Insights into the Per-Residue Energy Metrics of Proteins with Double InDel Mutations using Deep Learning2025-12-06T16:57:56ZUnderstanding how protein mutations affect protein structure is essential for advancements in computational biology and bioinformatics. We introduce PRIMRose, a novel approach that predicts energy values for each residue given a mutated protein sequence. Unlike previous models that assess global energy shifts, our method analyzes the localized energetic impact of double amino acid insertions or deletions (InDels) at the individual residue level, enabling residue-specific insights into structural and functional disruption. We implement a Convolutional Neural Network architecture to predict the energy changes of each residue in a protein mutation. We train our model on datasets constructed from nine proteins, grouped into three categories: one set with exhaustive double InDel mutations, another with approximately 145k randomly sampled double InDel mutations, and a third with approximately 80k randomly sampled double InDel mutations. Our model achieves high predictive accuracy across a range of energy metrics as calculated by the Rosetta molecular modeling suite and reveals localized patterns that influence model performance, such as solvent accessibility and secondary structure context. This per-residue analysis offers new insights into the mutational tolerance of specific regions within proteins and provides higher interpretable and biologically meaningful predictions of InDels' effects.2025-12-06T16:57:56ZPresented at Computational Structural Bioinformatics Workshop 2025BCB Companion 2025: Companion Proceedings of the 16th ACM International Conference on Bioinformatics, Computational Biology and Health InformaticsStella BrownNicolas PreisigAutumn DavisBrian HutchinsonFilip Jagodzinski10.1145/3768322.3769032http://arxiv.org/abs/2506.19532v4Toward the Explainability of Protein Language Models2025-12-05T15:53:27ZProtein language models (pLMs) excel in a variety of tasks that range from structure prediction to the design of functional enzymes. However, these models operate as black boxes, and their underlying working principles remain unclear. Here, we survey emerging applications of explainable artificial intelligence (XAI) to pLMs and describe the potential of XAI in protein research. We divide the workflow of protein AI modeling into four information contexts: (i) training sequences, (ii) input prompt, (iii) model architecture, and (iv) input-output pairs. For each, we describe existing methods and applications of XAI. Additionally, from published studies we distil five (potential) roles that XAI can play in protein research: Evaluator, Multitasker, Engineer, Coach, and Teacher, with the Evaluator role being the only one widely adopted so far. These roles aim to help both protein scientists and model developers understand the possibilities and limitations of implementing XAI for predictive and generative tasks. While our analysis focuses on pLMs, both this categorization and roles are broadly applicable to any other model architectures. We conclude by highlighting critical areas of application for the future, including risks related to security, trustworthiness, and bias, and we call for community benchmarks, open-source tooling, domain-specific visualizations, and wet-lab characterization to advance the interpretability of protein AI.2025-06-24T11:36:24Z15 pages, 6 figures; version 4: Additional revision of the manuscriptAndrea HunklingerNoelia Ferruzhttp://arxiv.org/abs/2512.05245v1STAR-GO: Improving Protein Function Prediction by Learning to Hierarchically Integrate Ontology-Informed Semantic Embeddings2025-12-04T20:48:08ZAccurate prediction of protein function is essential for elucidating molecular mechanisms and advancing biological and therapeutic discovery. Yet experimental annotation lags far behind the rapid growth of protein sequence data. Computational approaches address this gap by associating proteins with Gene Ontology (GO) terms, which encode functional knowledge through hierarchical relations and textual definitions. However, existing models often emphasize one modality over the other, limiting their ability to generalize, particularly to unseen or newly introduced GO terms that frequently arise as the ontology evolves, and making the previously trained models outdated. We present STAR-GO, a Transformer-based framework that jointly models the semantic and structural characteristics of GO terms to enhance zero-shot protein function prediction. STAR-GO integrates textual definitions with ontology graph structure to learn unified GO representations, which are processed in hierarchical order to propagate information from general to specific terms. These representations are then aligned with protein sequence embeddings to capture sequence-function relationships. STAR-GO achieves state-of-the-art performance and superior zero-shot generalization, demonstrating the utility of integrating semantics and structure for robust and adaptable protein function prediction. Code is available at https://github.com/boun-tabi-lifelu/stargo.2025-12-04T20:48:08Z14 pages, 2 figures, 6 tablesMehmet Efe AkçaGökçe UludoğanArzucan Özgürİnci M. Baytaşhttp://arxiv.org/abs/2512.04340v1Collective adsorption of pheromones at the water-air interface2025-12-04T00:03:58ZUnderstanding the phase behaviour of pheromones and other messaging molecules remains a significant and largely unexplored challenge, even though it plays a central role in chemical communication. Here, we present all-atom molecular dynamics simulations to investigate the behavior of bombykol, a model insect pheromone, adsorbed at the water-air interface. This system serves as a proxy for studying the amphiphilic nature of pheromones and their interactions with aerosol particles in the atmosphere. Our simulations reveal the molecular organization of the bombykol monolayer and its adsorption isotherm. A soft-sticky particle equation of state accurately describes the monolayer's behavior. The analysis uncovers a two-dimensional liquid-gas phase transition within the monolayer. Collective adsorption stabilises the molecules at the interface and the calculated free energy gain is approximately $2\:k_\mathrm{B}T$. This value increases under lower estimates of the condensing surface concentration, thereby enhancing pheromone adsorption onto aerosols. Overall, our findings hold broad relevance for molecular interface science, atmospheric chemistry, and organismal chemical communication, particularly in highlighting the critical role of phase transition phenomena.2025-12-04T00:03:58Z15 pages, 5 figuresLudovic JamiAix Marseille Univ CNRS Centrale Med IRPHE Marseille FranceBertrand SibouletICSM CEA CNRS ENSCM Univ Montpellier Marcoule FranceThomas ZembICSM CEA CNRS ENSCM Univ Montpellier Marcoule FranceJérôme CasasInstitut de Recherche sur la Biologie de l'Insecte CNRS-Université de Tours Tours FranceJean-François DufrêcheICSM CEA CNRS ENSCM Univ Montpellier Marcoule Francehttp://arxiv.org/abs/2403.02706v2DeepBioisostere: Discovering Bioisosteres with Deep Learning for a Fine Control of Multiple Molecular Properties2025-12-03T15:04:59ZOptimizing molecular properties while preserving biological activity is a central challenge in drug design. Bioisosteric replacement, which substitutes a molecular fragment with a chemically or biologically analogous moiety, offers a powerful strategy for fine-tuning properties without disrupting target binding. However, existing in silico approaches often rely on expert-defined modification sites or suffer from modulating multiple molecular properties simultaneously. Here, we present DeepBioisostere, a deep generative model that performs end-to-end bioisosteric replacement by autonomously selecting and substituting molecular fragments to satisfy multiple target properties. The model captures complex relationships across the molecular graph, enabling the optimization of sophisticated properties such as drug-likeness and synthetic accessibility. By learning from experimental bio-assay data, DeepBioisostere proposes replacements that maintain biological activities, even generating potential bioisosteres beyond the training data. We demonstrate the effectiveness of the model in computational hit-to-lead optimization scenarios, highlighting its potential to accelerate rational molecular design without relying on expert heuristics or pre-established substitution rules.2024-03-05T06:55:43Z31 pages, 7 figures, and 3 tables for main textHyeongwoo KimSeokhyun MoonWonho ZhungShinwoo KimJaechang LimWoo Youn Kimhttp://arxiv.org/abs/2505.17478v2ConfRover: Simultaneous Modeling of Protein Conformation and Dynamics via Autoregression2025-12-02T22:16:19ZUnderstanding protein dynamics is critical for elucidating their biological functions. The increasing availability of molecular dynamics (MD) data enables the training of deep generative models to efficiently explore the conformational space of proteins. However, existing approaches either fail to explicitly capture the temporal dependencies between conformations or do not support direct generation of time-independent samples. To address these limitations, we introduce ConfRover, an autoregressive model that simultaneously learns protein conformation and dynamics from MD trajectories, supporting both time-dependent and time-independent sampling. At the core of our model is a modular architecture comprising: (i) an encoding layer, adapted from protein folding models, that embeds protein-specific information and conformation at each time frame into a latent space; (ii) a temporal module, a sequence model that captures conformational dynamics across frames; and (iii) an SE(3) diffusion model as the structure decoder, generating conformations in continuous space. Experiments on ATLAS, a large-scale protein MD dataset of diverse structures, demonstrate the effectiveness of our model in learning conformational dynamics and supporting a wide range of downstream tasks. ConfRover is the first model to sample both protein conformations and trajectories within a single framework, offering a novel and flexible approach for learning from protein MD data. Project website: https://bytedance-seed.github.io/ConfRover.2025-05-23T05:00:15Z35 pages, 17 figures; Camera ready for NeurIPS 2025; Website: https://bytedance-seed.github.io/ConfRoverYuning ShenLihao WangHuizhuo YuanYan WangBangji YangQuanquan Guhttp://arxiv.org/abs/2512.02864v1Modulation of DNA rheology by a transcription factor that forms aging microgels2025-12-02T15:25:55ZProteins and nucleic acids form non-Newtonian liquids with complex rheological properties that contribute to their function in vivo. Here we investigate the rheology of the transcription factor NANOG, a key protein in sustaining embryonic stem cell self-renewal. We discover that at high concentrations NANOG forms macroscopic aging gels through its intrinsically disordered tryptophan-rich domain. By combining molecular dynamics simulations, mass photometry and Cryo-EM, we also discover that NANOG forms self-limiting micelle-like clusters which expose their DNA-binding domains. In dense solutions of DNA, NANOG micelle-like structures stabilize intermolecular entanglements and crosslinks, forming microgel-like structures. Our findings suggest that NANOG may contribute to regulate gene expression in a unconventional way: by restricting and stabilizing genome dynamics at key transcriptional sites through the formation of an aging microgel-like structure, potentially enabling mechanical memory in the gene network.2025-12-02T15:25:55ZAmandine Hong-MinhYair Augusto Gutiérrez FosadoAbbie GuildNicholas MullinLaura SpagnoloIan ChambersDavide Michielettohttp://arxiv.org/abs/2512.02315v1Few-shot Protein Fitness Prediction via In-context Learning and Test-time Training2025-12-02T01:20:40ZAccurately predicting protein fitness with minimal experimental data is a persistent challenge in protein engineering. We introduce PRIMO (PRotein In-context Mutation Oracle), a transformer-based framework that leverages in-context learning and test-time training to adapt rapidly to new proteins and assays without large task-specific datasets. By encoding sequence information, auxiliary zero-shot predictions, and sparse experimental labels from many assays as a unified token set in a pre-training masked-language modeling paradigm, PRIMO learns to prioritize promising variants through a preference-based loss function. Across diverse protein families and properties-including both substitution and indel mutations-PRIMO outperforms zero-shot and fully supervised baselines. This work underscores the power of combining large-scale pre-training with efficient test-time adaptation to tackle challenging protein design tasks where data collection is expensive and label availability is limited.2025-12-02T01:20:40ZAI for Science Workshop (NeurIPS 2025)Felix TeufelAaron W. KollaschYining HuangOle WintherKevin K. YangPascal NotinDebora S. Markshttp://arxiv.org/abs/2512.02303v1Training Dynamics of Learning 3D-Rotational Equivariance2025-12-02T00:48:09ZWhile data augmentation is widely used to train symmetry-agnostic models, it remains unclear how quickly and effectively they learn to respect symmetries. We investigate this by deriving a principled measure of equivariance error that, for convex losses, calculates the percent of total loss attributable to imperfections in learned symmetry. We focus our empirical investigation to 3D-rotation equivariance on high-dimensional molecular tasks (flow matching, force field prediction, denoising voxels) and find that models reduce equivariance error quickly to $\leq$2\% held-out loss within 1k-10k training steps, a result robust to model and dataset size. This happens because learning 3D-rotational equivariance is an easier learning task, with a smoother and better-conditioned loss landscape, than the main prediction task. For 3D rotations, the loss penalty for non-equivariant models is small throughout training, so they may achieve lower test loss than equivariant models per GPU-hour unless the equivariant ``efficiency gap'' is narrowed. We also experimentally and theoretically investigate the relationships between relative equivariance error, learning gradients, and model parameters.2025-12-02T00:48:09ZAccepted to Transactions on Machine Learning Research (TMLR)Max W. ShenEwa NowaraMichael MaserKyunghyun Chohttp://arxiv.org/abs/2512.02204v1MoRSAIK: Sequence Motif Reactor Simulation, Analysis and Inference Kit in Python2025-12-01T20:52:23ZOrigins of life research investigates how life could emerge from prebiotic chemistry only. One possible explanation provides the RNA world hypothesis. It states that life could emerge from RNA strands only, storing and transferring biological information, as well as catalyzing reactions as ribozymes. Before this state could have emerged, however, the prebiotic world was probably a purely chemical pool of short RNA strands with random sequences and without biological function performing hybridization and dehybridization, as well as ligation and cleavage. In this context relevant questions are what are the conditions that allow longer RNA strands to be built and how can information carrying in RNA sequence emerge?
In order to investigate such RNA reactors, efficient simulations are needed because the space of possible RNA sequences increases exponentially with the length of the strands, as well as the number of reactions between two strands. In addition, simulations have to be compared to experimental data for validation and parameter calibration. Here, we present the MoRSAIK python package for sequence motif (or k-mer) reactor simulation, analysis and inference. It enables users to simulate RNA sequence motif dynamics in the mean field approximation as well as to infer the reaction parameters from data with Bayesian methods and to analyze results by computing observables and plotting. MoRSAIK simulates an RNA reactor by following the reactions and the concentrations of all strands inside up to a certain length (of four nucleotides by default). Longer strands are followed indirectly, by tracking the concentrations of their containing sequence motifs of that maximum length.2025-12-01T20:52:23Z5 pages, 1 figureJohannes Harth-KitzerowUlrich GerlandTorsten A. Enßlinhttp://arxiv.org/abs/2512.01903v1Realistic Transition Paths for Large Biomolecular Systems: A Langevin Bridge Approach2025-12-01T17:24:14ZWe introduce a computational framework for generating realistic transition paths between distinct conformations of large bio-molecular systems. The method is built on a stochastic integro-differential formulation derived from the Langevin bridge formalism, which constrains molecular trajectories to reach a prescribed final state within a finite time and yields an efficient low-temperature approximation of the exact bridge equation. To obtain physically meaningful protein transitions, we couple this formulation to a new coarse-grained potential combining a Go-like term that preserves native backbone geometry with a Rouse-type elastic energy term from polymer physics; we refer to the resulting approach as SIDE. We evaluate SIDE on several proteins undergoing large-scale conformational changes and compare its performance with established methods such as MinActionPath and EBDIMS. SIDE generates smooth, low-energy trajectories that maintain molecular geometry and frequently recover experimentally supported intermediate states. Although challenges remain for highly complex motions-largely due to the simplified coarse-grained potential-our results demonstrate that SIDE offers a powerful and computationally efficient strategy for modeling bio-molecular conformational transitions.2025-12-01T17:24:14Z20 pages, 11 figuresPatrice KoehlMarc DelarueHenri Orlandhttp://arxiv.org/abs/2512.01320v1A Review of Wearable Sweat Monitoring Platforms: From Biomarker Detection to Signal Processing Systems2025-12-01T06:24:32ZWearable electronics hold great potential in defining new paradigms of modern healthcare, including personalized health management, precision medicine, and athletic performance optimization. This stems from their ability in enabling continuous, real-time health monitoring. To enable molecular-level analysis, biofluids rich in molecular analytes have become one of the most important target samples for wearable sensors. Among them, sweat stands out as an ideal candidate for next-generation wearable health monitoring platforms due to its completely noninvasive nature and ease of acquisition. In recent years, several studies have demonstrated feasible prototype designs for sweat-based wearable sensors. However, one of the major gaps toward large-scale commercialization is the development of clinically validated standards for sweat analysis. One key requirement is to establish the relationship between sweat analytes and those of blood, the latter serving as the gold standard in modern diagnostics. This review provides an overview of sweat biomarkers, with a particular focus on their partitioning mechanisms, which reveal the underlying connections between sweat analytes and their counterparts within systemic metabolic pathways. In addition, this review offers a mechanistic-level examination of biosensors employed in sweat sensing, addressing a gap that has not been adequately covered in prior reviews. Given the critical role of electronic systems in constructing highly integrated wearable sweat-monitoring platforms, this review also analyzes the electronic architectures used for sensor signal processing from an interdisciplinary perspective, with particular emphasis on the analog circuitry that interfaces with electrochemical sensors.2025-12-01T06:24:32Z30 pages, 8 figures. Review article. Intended for submission to Biosensors and BioelectronicsYuhan Zheng