https://arxiv.org/api/mvzcnndmbzXi9FvXLJO+8B/f6r4 2026-06-14T22:54:14Z 3848 495 15 http://arxiv.org/abs/2412.01124v2 SUICA: Learning Super-high Dimensional Sparse Implicit Neural Representations for Spatial Transcriptomics 2025-05-07T04:41:24Z

Spatial Transcriptomics (ST) is a method that captures gene expression profiles aligned with spatial coordinates. The discrete spatial distribution and the super-high dimensional sequencing results make ST data challenging to be modeled effectively. In this paper, we manage to model ST in a continuous and compact manner by the proposed tool, SUICA, empowered by the great approximation capability of Implicit Neural Representations (INRs) that can enhance both the spatial density and the gene expression. Concretely within the proposed SUICA, we incorporate a graph-augmented Autoencoder to effectively model the context information of the unstructured spots and provide informative embeddings that are structure-aware for spatial mapping. We also tackle the extremely skewed distribution in a regression-by-classification fashion and enforce classification-based loss functions for the optimization of SUICA. By extensive experiments of a wide range of common ST platforms under varying degradations, SUICA outperforms both conventional INR variants and SOTA methods regarding numerical fidelity, statistical correlation, and bio-conservation. The prediction by SUICA also showcases amplified gene signatures that enriches the bio-conservation of the raw data and benefits subsequent analysis. The code is available at https://github.com/Szym29/SUICA.

2024-12-02T05:02:18Z ICML 2025 Qingtian Zhu Yumin Zheng Yuling Sang Yifan Zhan Ziyan Zhu Jun Ding Yinqiang Zheng http://arxiv.org/abs/2505.03377v1 Gene finding revisited: improved robustness through structured decoding from learned embeddings 2025-05-06T09:53:15Z

Gene finding is the task of identifying the locations of coding sequences within the vast amount of genetic code contained in the genome. With an ever increasing quantity of raw genome sequences, gene finding is an important avenue towards understanding the genetic information of (novel) organisms, as well as learning shared patterns across evolutionarily diverse species. The current state of the art are graphical models usually trained per organism and requiring manually curated datasets. However, these models lack the flexibility to incorporate deep learning representation learning techniques that have in recent years been transformative in the analysis of pro tein sequences, and which could potentially help gene finders exploit the growing number of the sequenced genomes to expand performance across multiple organisms. Here, we propose a novel approach, combining learned embeddings of raw genetic sequences with exact decoding using a latent conditional random field. We show that the model achieves performance matching the current state of the art, while increasing training robustness, and removing the need for manually fitted length distributions. As language models for DNA improve, this paves the way for more performant cross-organism gene-finders.

2025-05-06T09:53:15Z 8 pages, 3 figures, 2 tables Frederikke I. Marin Dennis Pultz Wouter Boomsma http://arxiv.org/abs/2505.03853v1 GRAPE: Heterogeneous Graph Representation Learning for Genetic Perturbation with Coding and Non-Coding Biotype 2025-05-06T03:35:24Z

Predicting genetic perturbations enables the identification of potentially crucial genes prior to wet-lab experiments, significantly improving overall experimental efficiency. Since genes are the foundation of cellular life, building gene regulatory networks (GRN) is essential to understand and predict the effects of genetic perturbations. However, current methods fail to fully leverage gene-related information, and solely rely on simple evaluation metrics to construct coarse-grained GRN. More importantly, they ignore functional differences between biotypes, limiting the ability to capture potential gene interactions. In this work, we leverage pre-trained large language model and DNA sequence model to extract features from gene descriptions and DNA sequence data, respectively, which serve as the initialization for gene representations. Additionally, we introduce gene biotype information for the first time in genetic perturbation, simulating the distinct roles of genes with different biotypes in regulating cellular processes, while capturing implicit gene relationships through graph structure learning (GSL). We propose GRAPE, a heterogeneous graph neural network (HGNN) that leverages gene representations initialized with features from descriptions and sequences, models the distinct roles of genes with different biotypes, and dynamically refines the GRN through GSL. The results on publicly available datasets show that our method achieves state-of-the-art performance.

2025-05-06T03:35:24Z Changxi Chi Jun Xia Jingbo Zhou Jiabei Cheng Chang Yu Stan Z. Li http://arxiv.org/abs/2505.02033v1 Quantum-Enhanced Classification of Brain Tumors Using DNA Microarray Gene Expression Profiles 2025-05-04T08:43:31Z

DNA microarray technology enables the simultaneous measurement of expression levels of thousands of genes, thereby facilitating the understanding of the molecular mechanisms underlying complex diseases such as brain tumors and the identification of diagnostic genetic signatures. To derive meaningful biological insights from the high-dimensional and complex gene features obtained through this technology and to analyze gene properties in detail, classical AI-based approaches such as machine learning and deep learning are widely employed. However, these methods face various limitations in managing high-dimensional vector spaces and modeling the intricate relationships among genes. In particular, challenges such as hyperparameter tuning, computational costs, and high processing power requirements can hinder their efficiency. To overcome these limitations, quantum computing and quantum AI approaches are gaining increasing attention. Leveraging quantum properties such as superposition and entanglement, quantum methods enable more efficient parallel processing of high-dimensional data and offer faster and more effective solutions to problems that are computationally demanding for classical methods. In this study, a novel model called "Deep VQC" is proposed, based on the Variational Quantum Classifier approach. Developed using microarray data containing 54,676 gene features, the model successfully classified four different types of brain tumors-ependymoma, glioblastoma, medulloblastoma, and pilocytic astrocytoma-alongside healthy samples with high accuracy. Furthermore, compared to classical ML algorithms, our model demonstrated either superior or comparable classification performance. These results highlight the potential of quantum AI methods as an effective and promising approach for the analysis and classification of complex structures such as brain tumors based on gene expression features.

2025-05-04T08:43:31Z Emine Akpinar Batuhan Hangun Murat Oduncuoglu Oguz Altun Onder Eyecioglu Zeynel Yalcin 10.1109/ISVLSI65124.2025.11130207 http://arxiv.org/abs/2505.01696v1 Interpretable graph-based models on multimodal biomedical data integration: A technical review and benchmarking 2025-05-03T05:00:38Z

Integrating heterogeneous biomedical data including imaging, omics, and clinical records supports accurate diagnosis and personalised care. Graph-based models fuse such non-Euclidean data by capturing spatial and relational structure, yet clinical uptake requires regulator-ready interpretability. We present the first technical survey of interpretable graph based models for multimodal biomedical data, covering 26 studies published between Jan 2019 and Sep 2024. Most target disease classification, notably cancer and rely on static graphs from simple similarity measures, while graph-native explainers are rare; post-hoc methods adapted from non-graph domains such as gradient saliency, and SHAP predominate. We group existing approaches into four interpretability families, outline trends such as graph-in-graph hierarchies, knowledge-graph edges, and dynamic topology learning, and perform a practical benchmark. Using an Alzheimer disease cohort, we compare Sensitivity Analysis, Gradient Saliency, SHAP and Graph Masking. SHAP and Sensitivity Analysis recover the broadest set of known AD pathways and Gene-Ontology terms, whereas Gradient Saliency and Graph Masking surface complementary metabolic and transport signatures. Permutation tests show all four beat random gene sets, but with distinct trade-offs: SHAP and Graph Masking offer deeper biology at higher compute cost, while Gradient Saliency and Sensitivity Analysis are quicker though coarser. We also provide a step-by-step flowchart covering graph construction, explainer choice and resource budgeting to help researchers balance transparency and performance. This review synthesises the state of interpretable graph learning for multimodal medicine, benchmarks leading techniques, and charts future directions, from advanced XAI tools to under-studied diseases, serving as a concise reference for method developers and translational scientists.

2025-05-03T05:00:38Z 41 pages Alireza Sadeghi Farshid Hajati Ahmadreza Argha Nigel H Lovell Min Yang Hamid Alinejad-Rokny http://arxiv.org/abs/2505.00934v1 Bayesian Forensic DNA Mixture Deconvolution Using a Novel String Similarity Measure 2025-05-02T00:33:37Z

Mixture interpretation is a central challenge in forensic science, where evidence often contains contributions from multiple sources. In the context of DNA analysis, biological samples recovered from crime scenes may include genetic material from several individuals, necessitating robust statistical tools to assess whether a specific person of interest (POI) is among the contributors. Methods based on capillary electrophoresis (CE) are currently in use worldwide, but offer limited resolution in complex mixtures. Advancements in massively parallel sequencing (MPS) technologies provide a richer, more detailed representation of DNA mixtures, but require new analytical strategies to fully leverage this information. In this work, we present a Bayesian framework for evaluating whether a POIs DNA is present in an MPS-based forensic sample. The model accommodates known contributors, such as the victim, and uses a novel string edit distance to quantify similarity between observed alleles and sequencing artifacts. The resulting Bayes factors enable effective discrimination between samples that do and do not contain the POIs DNA, demonstrating strong performance in both hypothesis testing and classification settings.

2025-05-02T00:33:37Z Taylor Petty Jan Hannig Hari Iyer http://arxiv.org/abs/2505.00650v1 OmicsCL: Unsupervised Contrastive Learning for Cancer Subtype Discovery and Survival Stratification 2025-05-01T16:51:48Z

Unsupervised learning of disease subtypes from multi-omics data presents a significant opportunity for advancing personalized medicine. We introduce OmicsCL, a modular contrastive learning framework that jointly embeds heterogeneous omics modalities-such as gene expression, DNA methylation, and miRNA expression-into a unified latent space. Our method incorporates a survival-aware contrastive loss that encourages the model to learn representations aligned with survival-related patterns, without relying on labeled outcomes. Evaluated on the TCGA BRCA dataset, OmicsCL uncovers clinically meaningful clusters and achieves strong unsupervised concordance with patient survival. The framework demonstrates robustness across hyperparameter configurations and can be tuned to prioritize either subtype coherence or survival stratification. Ablation studies confirm that integrating survival-aware loss significantly enhances the predictive power of learned embeddings. These results highlight the promise of contrastive objectives for biological insight discovery in high-dimensional, heterogeneous omics data.

2025-05-01T16:51:48Z Code available at: https://github.com/Atahanka/OmicsCL Atahan Karagoz http://arxiv.org/abs/2505.00572v1 A Bioinformatic Study of Genetics Involved in Determining Mild Traumatic Brain Injury Severity and Recovery 2025-05-01T14:56:46Z

Aim: This in silico study sought to identify specific biomarkers for mild traumatic brain injury (mTBI) through the analysis of publicly available gene and miRNA databases, hypothesizing their influence on neuronal structure, axonal integrity, and regeneration. Methods: This study implemented a three-step process: (1) Data searching for mTBI-related genes in Gene and MalaCard databases and literature review ; (2) Data analysis involved performing functional annotation through GO and KEGG, identifying hub genes using Cytoscape, mapping protein-protein interactions via DAVID and STRING, and predicting miRNA targets using miRSystem, miRWalk2.0, and mirDIP (3) RNA-sequencing analysis applied to the mTBI dataset GSE123336. Results: Eleven candidate hub genes associated with mTBI outcome were identified: APOE, S100B, GFAP, BDNF, AQP4, COMT, MBP, UCHL1, DRD2, ASIC1, and CACNA1A. Enrichment analysis linked these genes to neuron projection regeneration and synaptic plasticity. miRNAs linked to the mTBI candidate genes were hsa-miR-9-5p, hsa-miR-204-5p, hsa-miR-1908-5p, hsa-miR-16-5p, hsa-miR-10a-5p, has-miR-218-5p, has-miR-34a-5p, and has-miR-199b-5p. The RNA sequencing revealed 2664 differentially expressed miRNAs post-mTBI, with 17 showing significant changes at the time of injury and 48 hours post-injury. Two miRNAs were positively correlated with direct head hits. Conclusion: Our study indicates that specific genes and miRNAs, particularly hsa-miR-10a-5p, may influence mTBI outcomes. Our research may guide future mTBI diagnostics, emphasizing the need to measure and track these specific genes and miRNAs in diverse cohorts.

2025-05-01T14:56:46Z This paper consists of 43 pages, including references. It contains seven figures and four tables. Details regarding coding and analysis are available upon request Mahnaz Tajik Michael D Noseworthy http://arxiv.org/abs/2504.20328v1 Mantodea phylogenomics provides new insights into X-chromosome progression and evolutionary radiation 2025-04-29T00:36:14Z

Background: Praying mantises, members of the order Mantodea, play important roles in agriculture, medicine, bionics, and entertainment. However, the scarcity of genomic resources has hindered extensive studies on mantis evolution and behaviour. Results: Here, we present the chromosome-scale reference genomes of five mantis species: the European mantis (Mantis religiosa), Chinese mantis (Tenodera sinensis), triangle dead leaf mantis (Deroplatys truncata), orchid mantis (Hymenopus coronatus), and metallic mantis (Metallyticus violaceus). We found that transposable element expansion is the major force governing genome size in Mantodea. Based on whole-alignments, we deduced that the Mantodea ancestor may have had only one X chromosome and that translocations between the X chromosome and an autosome may have occurred in the lineage of the superfamily Mantoidea. Furthermore, we found a lower evolutionary rate for the metallic mantis than for the other mantises. We also found that Mantodea underwent rapid radiation after the K-Pg mass extinction event, which could have contributed to the confusion in species classification. Conclusions: We present the chromosome-scale reference genomes of five mantis species to reveal the X-chromosome evolution, clarify the phylogeny relationship, and transposable element expansion.

2025-04-29T00:36:14Z 52 pages, 1 table, and 5 figures Hangwei Liu Lihong Lei Fan Jiang Bo Zhang Hengchao Wang Yutong Zhang Anqi Wang Hanbo Zhao Guirong Wang Wei Fan http://arxiv.org/abs/2504.20034v1 Pan-genome Analysis of Angiosperm Plastomes using PGR-TK 2025-04-28T17:56:13Z

We present a novel approach for taxonomic analysis of chloroplast genomes in angiosperms using the Pan-genome Research Toolkit (PGR-TK). Comparative plots generated by PGR-TK across diverse angiosperm genera reveal a wide range of structural complexity, from straightforward to highly intricate patterns. Notably, the characteristic quadripartite plastome structure, comprising the large single copy (LSC), small single copy (SSC), and inverted repeat (IR) regions, is clearly identifiable in over 75% of the genera analyzed. Our findings also underscore several occurrences of species mis-annotations in public genomic databases, which are readily detected through visual anomalies in the PGR-TK plots. While more complex plot patterns remain difficult to interpret, they likely reflect underlying biological variation or technical inconsistencies in genome assembly. Overall, this approach effectively integrates classical botanical visualization with modern molecular taxonomy, providing a powerful tool for genome-based classification in plant systematics.

2025-04-28T17:56:13Z Manoj P. Samanta http://arxiv.org/abs/2504.17860v1 An Integrated Genomics Workflow Tool: Simulating Reads, Evaluating Read Alignments, and Optimizing Variant Calling Algorithms 2025-04-24T18:05:08Z

Next-generation sequencing (NGS) is a pivotal technique in genome sequencing due to its high throughput, rapid results, cost-effectiveness, and enhanced accuracy. Its significance extends across various domains, playing a crucial role in identifying genetic variations and exploring genomic complexity. NGS finds applications in diverse fields such as clinical genomics, comparative genomics, functional genomics, and metagenomics, contributing substantially to advancements in research, medicine, and scientific disciplines. Within the sphere of genomics data science, the execution of read simulation, mapping, and variant calling holds paramount importance for obtaining precise and dependable results. Given the plethora of tools available for these purposes, each employing distinct methodologies and options, a nuanced understanding of their intricacies becomes imperative for optimization. This research, situated at the intersection of data science and genomics, involves a meticulous assessment of various tools, elucidating their individual strengths and weaknesses through rigorous experimentation and analysis. This comprehensive evaluation has enabled the researchers to pinpoint the most accurate tools, reinforcing the alignment between the established workflow and the demonstrated efficacy of specific tools in the context of genomics data analysis. To meet these requirements, "VarFind", an open-source and freely accessible pipeline tool designed to automate the entire process has been introduced (VarFind GitHub repository: https://github.com/shanikawm/varfinder)

2025-04-24T18:05:08Z Fathima Nuzla Ismail Shanika Amarasoma http://arxiv.org/abs/2504.17162v1 A Comprehensive Review on RNA Subcellular Localization Prediction 2025-04-24T00:47:31Z

The subcellular localization of RNAs, including long non-coding RNAs (lncRNAs), messenger RNAs (mRNAs), microRNAs (miRNAs) and other smaller RNAs, plays a critical role in determining their biological functions. For instance, lncRNAs are predominantly associated with chromatin and act as regulators of gene transcription and chromatin structure, while mRNAs are distributed across the nucleus and cytoplasm, facilitating the transport of genetic information for protein synthesis. Understanding RNA localization sheds light on processes like gene expression regulation with spatial and temporal precision. However, traditional wet lab methods for determining RNA localization, such as in situ hybridization, are often time-consuming, resource-demanding, and costly. To overcome these challenges, computational methods leveraging artificial intelligence (AI) and machine learning (ML) have emerged as powerful alternatives, enabling large-scale prediction of RNA subcellular localization. This paper provides a comprehensive review of the latest advancements in AI-based approaches for RNA subcellular localization prediction, covering various RNA types and focusing on sequence-based, image-based, and hybrid methodologies that combine both data types. We highlight the potential of these methods to accelerate RNA research, uncover molecular pathways, and guide targeted disease treatments. Furthermore, we critically discuss the challenges in AI/ML approaches for RNA subcellular localization, such as data scarcity and lack of benchmarks, and opportunities to address them. This review aims to serve as a valuable resource for researchers seeking to develop innovative solutions in the field of RNA subcellular localization and beyond.

2025-04-24T00:47:31Z Cece Zhang Xuehuan Zhu Nick Peterson Jieqiong Wang Shibiao Wan http://arxiv.org/abs/2504.16961v1 A Novel Graph Transformer Framework for Gene Regulatory Network Inference 2025-04-23T06:24:26Z

The inference of gene regulatory networks (GRNs) is a foundational stride towards deciphering the fundamentals of complex biological systems. Inferring a possible regulatory link between two genes can be formulated as a link prediction problem. Inference of GRNs via gene coexpression profiling data may not always reflect true biological interactions, as its susceptibility to noise and misrepresenting true biological regulatory relationships. Most GRN inference methods face several challenges in the network reconstruction phase. Therefore, it is important to encode gene expression values, leverege the prior knowledge gained from the available inferred network structures and positional informations of the input network nodes towards inferring a better and more confident GRN network reconstruction. In this paper, we explore the integration of multiple inferred networks to enhance the inference of Gene Regulatory Networks (GRNs). Primarily, we employ autoencoder embeddings to capture gene expression patterns directly from raw data, preserving intricate biological signals. Then, we embed the prior knowledge from GRN structures transforming them into a text-like representation using random walks, which are then encoded with a masked language model, BERT, to generate global embeddings for each gene across all networks. Additionally, we embed the positional encodings of the input gene networks to better identify the position of each unique gene within the graph. These embeddings are integrated into graph transformer-based model, termed GT-GRN, for GRN inference. The GT-GRN model effectively utilizes the topological structure of the ground truth network while incorporating the enriched encoded information. Experimental results demonstrate that GT-GRN significantly outperforms existing GRN inference methods, achieving superior accuracy and highlighting the robustness of our approach.

2025-04-23T06:24:26Z Binon Teji Swarup Roy http://arxiv.org/abs/2411.00614v2 Fast and scalable Wasserstein-1 neural optimal transport solver for single-cell perturbation prediction 2025-04-21T16:33:30Z

\textbf{Motivation:} Predicting single-cell perturbation responses requires mapping between two unpaired single-cell data distributions. Optimal transport (OT) theory provides a principled framework for constructing such mappings by minimizing transport cost. Recently, Wasserstein-2 ($W_2$) neural optimal transport solvers (\textit{e.g.}, CellOT) have been employed for this prediction task. However, $W_2$ OT relies on the general Kantorovich dual formulation, which involves optimizing over two conjugate functions, leading to a complex min-max optimization problem that converges slowly. \\ \textbf{Results:} To address these challenges, we propose a novel solver based on the Wasserstein-1 ($W_1$) dual formulation. Unlike $W_2$, the $W_1$ dual simplifies the optimization to a maximization problem over a single 1-Lipschitz function, thus eliminating the need for time-consuming min-max optimization. While solving the $W_1$ dual only reveals the transport direction and does not directly provide a unique optimal transport map, we incorporate an additional step using adversarial training to determine an appropriate transport step size, effectively recovering the transport map. Our experiments demonstrate that the proposed $W_1$ neural optimal transport solver can mimic the $W_2$ OT solvers in finding a unique and ``monotonic" map on 2D datasets. Moreover, the $W_1$ OT solver achieves performance on par with or surpasses $W_2$ OT solvers on real single-cell perturbation datasets. Furthermore, we show that $W_1$ OT solver achieves $25 \sim 45\times$ speedup, scales better on high dimensional transportation task, and can be directly applied on single-cell RNA-seq dataset with highly variable genes. \\ \textbf{Availability and Implementation:} Our implementation and experiments are open-sourced at https://github.com/poseidonchan/w1ot.

2024-11-01T14:23:19Z ISMB/ECCB 2025 Yanshuo Chen Zhengmian Hu Wei Chen Heng Huang http://arxiv.org/abs/2504.13049v1 Multi-modal single-cell foundation models via dynamic token adaptation 2025-04-17T16:05:59Z

Recent advances in applying deep learning in genomics include DNA-language and single-cell foundation models. However, these models take only one data type as input. We introduce dynamic token adaptation and demonstrate how it combines these models to predict gene regulation at the single-cell level in different genetic contexts. Although the method is generalisable, we focus on an illustrative example by training an adapter from DNA-sequence embeddings to a single-cell foundation model's token embedding space. As a qualitative evaluation, we assess the impact of DNA sequence changes on the model's learned gene regulatory networks by mutating the transcriptional start site of the transcription factor GATA4 in silico, observing predicted expression changes in its target genes in fetal cardiomyocytes.

2025-04-17T16:05:59Z Wenmin Zhao Ana Solaguren-Beascoa Grant Neilson Louwai Muhammed Liisi Laaniste Sera Aylin Cakiroglu