https://arxiv.org/api/XIf9qx9ullZS0eB9d7dj7BZhzf8 2026-03-18T08:47:22Z 6638 90 15 http://arxiv.org/abs/2510.24736v2 RNAGenScape: Property-Guided, Optimized Generation of mRNA Sequences with Manifold Langevin Dynamics 2026-01-29T16:34:46Z

Generating property-optimized mRNA sequences is central to applications such as vaccine design and protein replacement therapy, but remains challenging due to limited data, complex sequence-function relationships, and the narrow space of biologically viable sequences. Generative methods that drift away from the data manifold can yield sequences that fail to fold, translate poorly, or are otherwise nonfunctional. We present RNAGenScape, a property-guided manifold Langevin dynamics framework for mRNA sequence generation that operates directly on a learned manifold of real data. By performing iterative local optimization constrained to this manifold, RNAGenScape preserves biological viability, accesses reliable guidance, and avoids excursions into nonfunctional regions of the ambient sequence space. The framework integrates three components: (1) an autoencoder jointly trained with a property predictor to learn a property-organized latent manifold, (2) a denoising autoencoder that projects updates back onto the manifold, and (3) a property-guided Langevin dynamics procedure that performs optimization along the manifold. Across three real-world mRNA datasets spanning two orders of magnitude in size, RNAGenScape increases median property gain by up to 148% and success rate by up to 30% while ensuring biological viability of generated sequences, and achieves competitive inference efficiency relative to existing generative approaches.

2025-10-14T19:55:41Z ICML 2025 Generative AI and Biology (GenBio) Workshop, Oral presentation (top 9.7%) Danqi Liao Chen Liu Xingzhi Sun Dié Tang Haochen Wang Scott Youlten Srikar Krishna Gopinath Haejeong Lee Ethan C. Strayer Antonio J. Giraldez Smita Krishnaswamy http://arxiv.org/abs/2601.21216v1 Multiple binding modes of AKT on PIP$_3$-containing membranes 2026-01-29T03:26:27Z

The PI3K/AKT signaling pathway is triggered by recruitment of AKT to cellular membranes. Although AKT is a multidomain serine/threonine kinase composed of an N-terminal pleckstrin homology (PH) domain and a C-terminal kinase domain, how these domains cooperate to regulate AKT activation on membranes remains unclear at the molecular level. Here, using molecular dynamics simulations of full-length AKT on PIP$_3$-containing lipid bilayers, we identify four distinct membrane-binding modes that differ in the orientations and membrane contacts of the PH and kinase domains. In addition to PIP$_3$ binding to the PH domain, we observe specific PIP$_3$ interactions with basic residues in the kinase domain. In the most stable mode, PIP$_3$ interacts with both the canonical and a secondary binding site in the PH domain, while the kinase domain adopts an orientation in which the activation-loop phosphorylation site is exposed to the solvent. Interestingly, the populations of these binding modes depend on the PIP$_3$ concentration in the membrane, leading to changes in the preferred orientation of AKT. These findings shed light on how lipid recognition by the PH domain and the kinase domain of AKT cooperatively shape its membrane-bound conformations.

2026-01-29T03:26:27Z Yuki Nakagaki Eiji Yamamoto http://arxiv.org/abs/2601.17138v2 AI Developments for T and B Cell Receptor Modeling and Therapeutic Design 2026-01-28T19:03:26Z

Artificial intelligence (AI) is accelerating progress in modeling T and B cell receptors by enabling predictive and generative frameworks grounded in sequence data and immune context. This chapter surveys recent advances in the use of protein language models, machine learning, and multimodal integration for immune receptor modeling. We highlight emerging strategies to leverage single-cell and repertoire-scale datasets, and optimize immune receptor candidates for therapeutic design. These developments point toward a new generation of data-efficient, generalizable, and clinically relevant models that better capture the diversity and complexity of adaptive immunity.

2026-01-23T19:28:08Z Linhui Xie Aurelien Pelissier Yanjun Shao Maria Rodriguez Martinez http://arxiv.org/abs/2601.11013v2 De novo emergence of metabolically active protocells 2026-01-27T16:28:47Z

A continuous route from a disordered soup of simple chemical feedstocks to a functional protocell -- a compartment that metabolizes, grows, and propagates -- remains elusive. Here, we show that a homogeneous aqueous chemical mixture containing phosphorus, iron, molybdenum salts and formaldehyde spontaneously self-organizes into compartments that couple robust non-equilibrium chemical dynamics to their own growth. These structures mature to a sustained, dissipative steady state and support an organic synthetic engine, producing diverse molecular species including many core biomolecular classes. Internal spherules that are themselves growth-competent are produced within the protocells, establishing a rudimentary mode of self-perpetuation. The chemical dynamics we observe in controlled laboratory conditions also occur in reaction mixtures exposed to natural day-night cycles. Strikingly, the morphology and chemical composition of the protocells in our experiments closely resemble molybdenum-rich microspheres recently discovered in current oceanic environments. Our work establishes a robust, testable route to de novo protocell formation. The emergence of life-like spatiotemporal organization and chemical dynamics from minimal initial conditions is more facile than previously thought and could be a recurring natural phenomenon.

2026-01-16T06:08:36Z Nayan Chakraborty Shashi Thutupalli http://arxiv.org/abs/2601.19257v1 PCEvo: Path-Consistent Molecular Representation via Virtual Evolutionary 2026-01-27T06:40:11Z

Molecular representation learning aims to learn vector embeddings that capture molecular structure and geometry, thereby enabling property prediction and downstream scientific applications. In many AI for science tasks, labeled data are expensive to obtain and therefore limited in availability. Under the few-shot setting, models trained with scarce supervision often learn brittle structure-property relationships, resulting in substantially higher prediction errors and reduced generalization to unseen molecules. To address this limitation, we propose PCEvo, a path-consistent representation method that learns from virtual paths through dynamic structural evolution. PCEvo enumerates multiple chemically feasible edit paths between retrieved similar molecular pairs under topological dependency constraints. It transforms the labels of the two molecules into stepwise supervision along each virtual evolutionary path. It introduces a path-consistency objective that enforces prediction invariance across alternative paths connecting the same two molecules. Comprehensive experiments on the QM9 and MoleculeNet datasets demonstrate that PCEvo substantially improves the few-shot generalization performance of baseline methods. The code is available at https://anonymous.4open.science/r/PCEvo-4BF2.

2026-01-27T06:40:11Z 10 pages, 4 figures, 5 tables Kun Li Longtao Hu Yida Xiong Jiajun Yu Hongzhi Zhang Jiameng Chen Xiantao Cai Jia Wu Wenbin Hu http://arxiv.org/abs/2601.19205v1 EnzyPGM: Pocket-conditioned Generative Model for Substrate-specific Enzyme Design 2026-01-27T05:07:55Z

Designing enzymes with substrate-binding pockets is a critical challenge in protein engineering, as catalytic activity depends on the precise interaction between pockets and substrates. Currently, generative models dominate functional protein design but cannot model pocket-substrate interactions, which limits the generation of enzymes with precise catalytic environments. To address this issue, we propose EnzyPGM, a unified framework that jointly generates enzymes and substrate-binding pockets conditioned on functional priors and substrates, with a particular focus on learning accurate pocket-substrate interactions. At its core, EnzyPGM includes two main modules: a Residue-atom Bi-scale Attention (RBA) that jointly models intra-residue dependencies and fine-grained interactions between pocket residues and substrate atoms, and a Residue Function Fusion (RFF) that incorporates enzyme function priors into residue representations. Also, we curate EnzyPock, an enzyme-pocket dataset comprising 83,062 enzyme-substrate pairs across 1,036 four-level enzyme families. Extensive experiments demonstrate that EnzyPGM achieves state-of-the-art performance on EnzyPock. Notably, EnzyPGM reduces the average binding energy of 0.47 kcal/mol over EnzyGen, showing its superior performance on substrate-specific enzyme design. The code and dataset will be released later.

2026-01-27T05:07:55Z 9 pages, 4 figures, under review Zefeng Lin Zhihang Zhang Weirong Zhu Tongchang Han Xianyong Fang Tianfan Fu Xiaohua Xu http://arxiv.org/abs/2601.18716v1 Conditioned Generative Modeling of Molecular Glues: A Realistic AI Approach for Synthesizable Drug-like Molecules 2026-01-26T17:39:59Z

Alzheimer's disease (AD) is marked by the pathological accumulation of amyloid beta-42 (Abeta-42), contributing to synaptic dysfunction and neurodegeneration. While extracellular amyloid plaques are well-studied, increasing evidence highlights intracellular Abeta-42 as an early and toxic driver of disease progression. In this study, we present a novel, AI-assisted drug design approach to promote targeted degradation of Abeta-42 via the ubiquitin-proteasome system (UPS), using E3 ligase-directed molecular glues. We systematically evaluated the ternary complex formation potential of Abeta-42 with three E3 ligases: CRBN, VHL, and MDM2, through structure-based modeling, ADMET screening, and docking. We then developed a Ligase-Conditioned Junction Tree Variational Autoencoder (LC-JT-VAE) to generate ligase-specific small molecules, incorporating protein sequence embeddings and torsional angle-aware molecular graphs. Our results demonstrate that this generative model can produce chemically valid, novel, and target-specific molecular glues capable of facilitating Abeta-42 degradation. This integrated approach offers a promising framework for designing UPS-targeted therapies for neurodegenerative diseases.

2026-01-26T17:39:59Z 30 pages, 8 figures Biomolecules 2025, 15, 849 Naeyma N. Islam Thomas R. Caulfield 10.3390/biom15060849 http://arxiv.org/abs/2509.14788v2 Structure-Aware Contrastive Learning with Fine-Grained Binding Representations for Drug Discovery 2026-01-26T08:19:42Z

Accurate identification of drug-target interactions (DTI) remains a central challenge in computational pharmacology, where sequence-based methods offer scalability. This work introduces a sequence-based drug-target interaction framework that integrates structural priors into protein representations while maintaining high-throughput screening capability. Evaluated across multiple benchmarks, the model achieves state-of-the-art performance on Human and BioSNAP datasets and remains competitive on BindingDB. In virtual screening tasks, it surpasses prior methods on LIT-PCBA, yielding substantial gains in AUROC and BEDROC. Ablation studies confirm the critical role of learned aggregation, bilinear attention, and contrastive alignment in enhancing predictive robustness. Embedding visualizations reveal improved spatial correspondence with known binding pockets and highlight interpretable attention patterns over ligand-residue contacts. These results validate the framework's utility for scalable and structure-aware DTI prediction.

2025-09-18T09:38:46Z Accepted by 2026 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2026) Jing Lan Hexiao Ding Hongzhao Chen Yufeng Jiang Nga-Chun Ng Gwing Kei Yip Gerald W. Y. Cheng Yunlin Mao Jing Cai Liang-ting Lin Jung Sun Yoo http://arxiv.org/abs/2409.02588v2 Multiview Random Vector Functional Link Network for Predicting DNA-Binding Proteins 2026-01-26T06:22:52Z

The identification of DNA-binding proteins (DBPs) is essential due to their significant impact on various biological activities. Understanding the mechanisms underlying protein-DNA interactions is essential for elucidating various life activities. In recent years, machine learning-based models have been prominently utilized for DBP prediction. In this paper, to predict DBPs, we propose a novel framework termed a multiview random vector functional link (MvRVFL) network, which fuses neural network architecture with multiview learning. The MvRVFL model integrates both late and early fusion advantages, enabling separate regularization parameters for each view, while utilizing a closed-form solution for efficiently determining unknown parameters. The primal objective function incorporates a coupling term aimed at minimizing a composite of errors stemming from all views. From each of the three protein views of the DBP datasets, we extract five features. These features are then fused together by incorporating a hidden feature during the model training process. The performance of the proposed MvRVFL model on the DBP dataset surpasses that of baseline models, demonstrating its superior effectiveness. We further validate the practicality of the proposed model across diverse benchmark datasets, and both theoretical analysis and empirical results consistently demonstrate its superior generalization performance over baseline models.

2024-09-04T10:14:17Z A. Quadir M. Sajid M. Tanveer http://arxiv.org/abs/2601.15771v1 Rethinking Drug-Drug Interaction Modeling as Generalizable Relation Learning 2026-01-22T09:00:30Z

Drug-drug interaction (DDI) prediction is central to drug discovery and clinical development, particularly in the context of increasingly prevalent polypharmacy. Although existing computational methods achieve strong performance on standard benchmarks, they often fail to generalize to realistic deployment scenarios, where most candidate drug pairs involve previously unseen drugs and validated interactions are scarce. We demonstrate that proximity in the embedding spaces of prevailing molecule-centric DDI models does not reliably correspond to interaction labels, and that simply scaling up model capacity therefore fails to improve generalization. To address these limitations, we propose GenRel-DDI, a generalizable relation learning framework that reformulates DDI prediction as a relation-centric learning problem, in which interaction representations are learned independently of drug identities. This relation-level abstraction enables the capture of transferable interaction patterns that generalize to unseen drugs and novel drug pairs. Extensive experiments across multiple benchmark demonstrate that GenRel-DDI consistently and significantly outperforms state-of-the-art methods, with particularly large gains on strict entity-disjoint evaluations, highlighting the effectiveness and practical utility of relation learning for robust DDI prediction. The code is available at https://github.com/SZU-ADDG/GenRel-DDI.

2026-01-22T09:00:30Z 9 pages, 5 figures Dong Xu Jiantao Wu Qihua Pan Sisi Yuan Zexuan Zhu Junkai Ji http://arxiv.org/abs/2601.14574v1 De novo design of protein binders targeting the human sweet taste receptor as potential sweet proteins 2026-01-21T01:18:55Z

Excessive consumption of dietary sugars is a major contributor to metabolic disorders, driving global interest in finding alternative sweeteners with reduced caloric impact. Natural sweet proteins, such as brazzein, offer exceptional sweetness intensity with little caloric contribution. However, their widespread use is limited by restricted natural diversity, low stability, and high production costs. Recent advances in structural biology and de novo protein design provide new opportunities to overcome these limitations through rational engineering. In this study, we report an integrated computational pipeline for the de novo design of protein binders targeting the human sweet taste receptor subunit TAS1R2, a key component of the heterodimeric class C G protein-coupled receptor mediating sweetness perception. The workflow combines diffusion-based backbone generation (RFdiffusion), neural network-guided sequence design (ProteinMPNN), structure-based filtering using Boltz-1, and binding energy evaluation via MM/GBSA calculations. Using the recently resolved cryo-EM structure of the TAS1R2 receptor, protein binders were designed to target both the Venus Flytrap Domain and the cysteine-rich domain of TAS1R2. A few designed binders exhibited favorable structural confidence and predicted binding energetics. In particular, Binder2 exhibited brazzein-like structural plausibility through specific short-range CRD contacts, while Binder1 displayed the strongest predicted binding affinity. Structural analyses of the binder-receptor complex revealed distinct binding modes and secondary structure profiles among the designs. This study demonstrates the feasibility of de novo designing protein binders that emulate key functional properties of natural sweet proteins, establishing a computational framework for the rational development of next-generation protein-based sweeteners.

2026-01-21T01:18:55Z Saisai Ding Yi Zhang http://arxiv.org/abs/2601.13693v1 End-to-End Reverse Screening Identifies Protein Targets of Small Molecules Using HelixFold3 2026-01-20T07:45:53Z

Identifying protein targets for small molecules, or reverse screening, is essential for understanding drug action, guiding compound repurposing, predicting off-target effects, and elucidating the molecular mechanisms of bioactive compounds. Despite its critical role, reverse screening remains challenging because accurately capturing interactions between a small molecule and structurally diverse proteins is inherently complex, and conventional step-wise workflows often propagate errors across decoupled steps such as target structure modeling, pocket identification, docking, and scoring. Here, we present an end-to-end reverse screening strategy leveraging HelixFold3, a high-accuracy biomolecular structure prediction model akin to AlphaFold3, which simultaneously models the folding of proteins from a protein library and the docking of small-molecule ligands within a unified framework. We validate this approach on a diverse and representative set of approximately one hundred small molecules. Compared with conventional reverse docking, our method improves screening accuracy and demonstrates enhanced structural fidelity, binding-site precision, and target prioritization. By systematically linking small molecules to their protein targets, this framework establishes a scalable and straightforward platform for dissecting molecular mechanisms, exploring off-target interactions, and supporting rational drug discovery.

2026-01-20T07:45:53Z Shengjie Xu Xianbin Ye Mengran Zhu Xiaonan Zhang Shanzhuo Zhang Xiaomin Fang http://arxiv.org/abs/2601.13564v1 Multi-objective fluorescent molecule design with a data-physics dual-driven generative framework 2026-01-20T03:41:02Z

Designing fluorescent small molecules with tailored optical and physicochemical properties requires navigating vast, underexplored chemical space while satisfying multiple objectives and constraints. Conventional generate-score-screen approaches become impractical under such realistic design specifications, owing to their low search efficiency, unreliable generalizability of machine-learning prediction, and the prohibitive cost of quantum chemical calculation. Here we present LUMOS, a data-and-physics driven framework for inverse design of fluorescent molecules. LUMOS couples generator and predictor within a shared latent representation, enabling direct specification-to-molecule design and efficient exploration. Moreover, LUMOS combines neural networks with a fast time-dependent density functional theory (TD-DFT) calculation workflow to build a suite of complementary predictors spanning different trade-offs in speed, accuracy, and generalizability, enabling reliable property prediction across diverse scenarios. Finally, LUMOS employs a property-guided diffusion model integrated with multi-objective evolutionary algorithms, enabling de novo design and molecular optimization under multiple objectives and constraints. Across comprehensive benchmarks, LUMOS consistently outperforms baseline models in terms of accuracy, generalizability and physical plausibility for fluorescence property prediction, and demonstrates superior performance in multi-objective scaffold- and fragment-level molecular optimization. Further validation using TD-DFT and molecular dynamics (MD) simulations demonstrates that LUMOS can generate valid fluorophores that meet various target specifications. Overall, these results establish LUMOS as a data-physics dual-driven framework for general fluorophore inverse design.

2026-01-20T03:41:02Z Total 43 pages: 32 pages Main Text + 11 pages SI Yanheng Li Zhichen Pu Lijiang Yang Zehao Zhou Yi Qin Gao http://arxiv.org/abs/2601.12381v1 Multimodal Spatial Omics: From Data Acquisition to Computational Integration 2026-01-18T12:28:14Z

Recent developments in spatial omics technologies have enabled the generation of high dimensional molecular data, such as transcriptomes, proteomes, and epigenomes, within their spatial tissue context, either through coprofiling on the same slice or through serial tissue sections. These datasets, which are often complemented by images, have given rise to multimodal frameworks that capture both the cellular and architectural complexity of tissues across multiple molecular layers. Integration in such multimodal data poses significant computational challenges due to differences in scale, resolution, and data modality. In this review, we present a comprehensive overview of computational methods developed to integrate multimodal spatial omics and imaging datasets. We highlight key algorithmic principles underlying these methods, ranging from probabilistic to the latest deep learning approaches.

2026-01-18T12:28:14Z Esra Busra Isik Yusuf Hakan Usta Haozhe Liu Maryam Riazi William Roach Hongpeng Zhou Magnus Rattray Sokratia Georgaka http://arxiv.org/abs/2408.16245v6 Large-Scale Multi-omic Biosequence Transformers for Modeling Protein-Nucleic Acid Interactions 2026-01-17T22:19:06Z

The transformer architecture has revolutionized bioinformatics and driven progress in the understanding and prediction of the properties of biomolecules. To date, most biosequence transformers have been trained on single-omic data - either proteins or nucleic acids - and have seen incredible success in downstream tasks in each domain, with particularly noteworthy breakthroughs in protein structural modeling. However, single-omic pretraining limits the ability of these models to capture cross-modal interactions. Here we present OmniBioTE, the largest open-source multi-omic model trained on over 250 billion tokens of mixed protein and nucleic acid data. We show that despite only being trained on unlabeled sequence data, OmniBioTE learns joint representations mapping genes to their corresponding protein sequences. We further demonstrate that OmniBioTE achieves state-of-the-art results predicting the change in Gibbs free energy ({ΔG}) of the binding interaction between a given nucleic acid and protein. Remarkably, we show that multi-omic biosequence transformers emergently learn useful structural information without any a priori structural training, allowing us to predict which protein residues are most involved in the protein-nucleic acid binding interaction. Compared to single-omic controls trained with identical compute, OmniBioTE also demonstrates superior performance-per-FLOP across both multi-omic and single-omic benchmarks. Together, these results highlight the power of a unified modeling approach for biological sequences and establish OmniBioTE as a foundation model for multi-omic discovery.

2024-08-29T03:56:40Z 47 pages, 5 figures Sully F. Chen Robert J. Steele Glen M. Hocky Beakal Lemeneh Shivanand P. Lad Eric K. Oermann