https://arxiv.org/api/VucjOq83zYe9X4StCdLfWmONkWM2026-06-14T11:43:17Z384833015http://arxiv.org/abs/2510.07653v2Large-scale spatial variable gene atlas for spatial transcriptomics2025-10-19T00:10:16ZSpatial variable genes (SVGs) reveal critical information about tissue architecture, cellular interactions, and disease microenvironments. As spatial transcriptomics (ST) technologies proliferate, accurately identifying SVGs across diverse platforms, tissue types, and disease contexts has become both a major opportunity and a significant computational challenge. Here, we present a comprehensive benchmarking study of 20 state-of-the-art SVG detection methods using human slides from STimage-1K4M, a large-scale resource of ST data comprising 662 slides from more than 18 tissue types. We evaluate each method across a range of biologically and technically meaningful criteria, including recovery of pathologist-annotated domain-specific markers, cross-slide reproducibility, scalability to high-resolution data, and robustness to technical variation. Our results reveal marked differences in performance depending on tissue type, spatial resolution, and study design. Beyond benchmarking, we construct the first cross-tissue atlas of SVGs, enabling comparative analysis of spatial gene programs across cancer and normal tissues. We observe similarities between pairs of tissues that reflect developmental and functional relationships, such as high overlap between thymus and lymph node, and uncover spatial gene programs associated with metastasis, immune infiltration, and tissue-of-origin identity in cancer. Together, our work defines a framework for evaluating and interpreting spatial gene expression and establishes a reference resource for the ST community.2025-10-09T01:03:58ZJiawen ChenJinwei ZhangDongshen PengYutong SongAitong RuanYun LiDidong Lihttp://arxiv.org/abs/2510.16093v1Identifying multi-omics interactions for lung cancer drug targets discovery using Kernel Machine Regression2025-10-17T17:13:39ZCancer exhibits diverse and complex phenotypes driven by multifaceted molecular interactions. Recent biomedical research has emphasized the comprehensive study of such diseases by integrating multi-omics datasets (genome, proteome, transcriptome, epigenome). This approach provides an efficient method for identifying genetic variants associated with cancer and offers a deeper understanding of how the disease develops and spreads. However, it is challenging to comprehend complex interactions among the features of multi-omics datasets compared to single omics. In this paper, we analyze lung cancer multi-omics datasets from The Cancer Genome Atlas (TCGA). Using four statistical methods, LIMMA, the T test, Canonical Correlation Analysis (CCA), and the Wilcoxon test, we identified differentially expressed genes across gene expression, DNA methylation, and miRNA expression data. We then integrated these multi-omics data using the Kernel Machine Regression (KMR) approach. Our findings reveal significant interactions among the three omics: gene expression, miRNA expression, and DNA methylation in lung cancer. From our data analysis, we identified 38 genes significantly associated with lung cancer. From our data analysis, we identified 38 genes significantly associated with lung cancer. Among these, eight genes of highest ranking (PDGFRB, PDGFRA, SNAI1, ID1, FGF11, TNXB, ITGB1, ZIC1) were highlighted by rigorous statistical analysis. Furthermore, in silico studies identified three top-ranked potential candidate drugs (Selinexor, Orapred, and Capmatinib) that could play a crucial role in the treatment of lung cancer. These proposed drugs are also supported by the findings of other independent studies, which underscore their potential efficacy in the fight against lung cancer.2025-10-17T17:13:39ZMd. Imtyaz AhmedMd. Delwar HossainMd Mostafizer RahmanMd. Ahsan HabibMd. Mamunur RashidMd. Selim RezaMd Ashad Alamhttp://arxiv.org/abs/2510.10950v1ChloroScan: Recovering plastid genome bins from metagenomic data2025-10-13T03:01:31ZGenome-resolved metagenomics has contributed largely to discovering prokaryotic genomes. When applied to microscopic eukaryotes, challenges such as the high number of introns and repeat regions found in nuclear genomes have hampered the mining and discovery of novel protistan lineages. Organellar genomes are simpler, smaller, have higher abundance than their nuclear counterparts and contain valuable phylogenetic information, but are yet to be widely used to identify new protist lineages from metagenomes. Here we present "ChloroScan", a new bioinformatics pipeline to extract eukaryotic plastid genomes from metagenomes. It incorporates a deep learning contig classifier to identify putative plastid contigs and an automated binning module to recover bins with guidance from a curated marker gene database. Additionally, ChloroScan summarizes the results in different user-friendly formats, including annotated coding sequences and proteins for each bin. We show that ChloroScan recovers more high-quality plastid bins than MetaBAT2 for simulated metagenomes. The practical utility of ChloroScan is illustrated by recovering 16 medium to high-quality metagenome assembled genomes from four protist-size fractioned metagenomes, with several bins showing high taxonomic novelty.2025-10-13T03:01:31ZYuhao TongVanessa Rossetto MarcelinoRobert TurnbullHeroen Verbruggenhttp://arxiv.org/abs/2510.08655v1Knowledge Graph Sparsification for GNN-based Rare Disease Diagnosis2025-10-09T09:05:06ZRare genetic disease diagnosis faces critical challenges: insufficient patient data, inaccessible full genome sequencing, and the immense number of possible causative genes. These limitations cause prolonged diagnostic journeys, inappropriate treatments, and critical delays, disproportionately affecting patients in resource-limited settings where diagnostic tools are scarce. We propose RareNet, a subgraph-based Graph Neural Network that requires only patient phenotypes to identify the most likely causal gene and retrieve focused patient subgraphs for targeted clinical investigation. RareNet can function as a standalone method or serve as a pre-processing or post-processing filter for other candidate gene prioritization methods, consistently enhancing their performance while potentially enabling explainable insights. Through comprehensive evaluation on two biomedical datasets, we demonstrate competitive and robust causal gene prediction and significant performance gains when integrated with other frameworks. By requiring only phenotypic data, which is readily available in any clinical setting, RareNet democratizes access to sophisticated genetic analysis, offering particular value for underserved populations lacking advanced genomic infrastructure.2025-10-09T09:05:06ZPremt CaraKamilia ZaripovaDavid Bani-HarouniNassir NavabAzade Farshadhttp://arxiv.org/abs/2409.04922v2Nearest Neighbor CCP-Based Molecular Sequence Analysis2025-10-08T18:09:05ZMolecular sequence analysis is crucial for comprehending several biological processes, including protein-protein interactions, functional annotation, and disease classification. The large number of sequences and the inherently complicated nature of protein structures make it challenging to analyze such data. Finding patterns and enhancing subsequent research requires the use of dimensionality reduction and feature selection approaches. Recently, a method called Correlated Clustering and Projection (CCP) has been proposed as an effective method for biological sequencing data. The CCP technique is still costly to compute even though it is effective for sequence visualization. Furthermore, its utility for classifying molecular sequences is still uncertain. To solve these two problems, we present a Nearest Neighbor Correlated Clustering and Projection (CCP-NN)-based technique for efficiently preprocessing molecular sequence data. To group related molecular sequences and produce representative supersequences, CCP makes use of sequence-to-sequence correlations. As opposed to conventional methods, CCP doesn't rely on matrix diagonalization, therefore it can be applied to a range of machine-learning problems. We estimate the density map and compute the correlation using a nearest-neighbor search technique. We performed molecular sequence classification using CCP and CCP-NN representations to assess the efficacy of our proposed approach. Our findings show that CCP-NN considerably improves classification task accuracy as well as significantly outperforms CCP in terms of computational runtime.2024-09-07T22:06:00ZAccepted at IEEE Transactions on Computational Biology and Bioinformatics (TCBB 2025)IEEE Transactions on Computational Biology and Bioinformatics 2025Sarwan AliPrakash ChourasiaBipin KoiralaMurray Pattersonhttp://arxiv.org/abs/2510.00353v2The Multivariate SEM-PGS Model: Using Polygenic Scores to Investigate Cross-Trait Genetic Nurture and Assortative Mating2025-10-07T21:05:49ZGenetic nurture effects and assortative mating (AM) occur across many human behaviors and can bias estimates from traditional genetic models. These influences are typically studied univariately, within the same trait. However, estimation of cross-trait genetic nurture effects and cross-trait AM remains underexplored due to the absence of suitable approaches. To address this, we developed a multivariate extension of the SEM-PGS model for datasets with genotyped and phenotyped parents and offspring, enabling joint estimation of within-trait and cross-trait genetic and environmental influences. By integrating haplotypic polygenic scores (PGS) into a structural equation modeling framework, the model simultaneously estimates same-trait and cross-trait direct effects, genetic nurture, vertical transmission, and assortative mating. We also provide the first formal description of how copaths can be used to model multivariate assortative mating and derive the corresponding parameter expectations in matrix form. Forward-time Monte Carlo simulations under varying conditions of r^2_PGS and N_trio demonstrate that the model yields unbiased estimates of both within-trait and cross-trait effects when assumptions are met. The precision of estimates was adequate with large sample sizes (N_trio > 16k) and improved as PGS predictive power increased. In addition, our simulation results show that failing to model cross-trait effects biases within-trait estimates, underscoring the importance of incorporating cross-trait effects. The multivariate SEM-PGS model offers a powerful and flexible tool for disentangling gene-environment interplay and advancing the understanding of familial influences on human traits.2025-09-30T23:34:02ZXuanyu LyuJared BalbonaTong ChenMatthew C. Kellerhttp://arxiv.org/abs/2510.05777v1DP-SNP-TIHMM: Differentially Private, Time-Inhomogeneous Hidden Markov Models for Synthesizing Genome-Wide Association Datasets2025-10-07T10:47:29ZSingle nucleotide polymorphism (SNP) datasets are fundamental to genetic studies but pose significant privacy risks when shared. The correlation of SNPs with each other makes strong adversarial attacks such as masked-value reconstruction, kin, and membership inference attacks possible. Existing privacy-preserving approaches either apply differential privacy to statistical summaries of these datasets or offer complex methods that require post-processing and the usage of a publicly available dataset to suppress or selectively share SNPs.
In this study, we introduce an innovative framework for generating synthetic SNP sequence datasets using samples derived from time-inhomogeneous hidden Markov models (TIHMMs). To preserve the privacy of the training data, we ensure that each SNP sequence contributes only a bounded influence during training, enabling strong differential privacy guarantees. Crucially, by operating on full SNP sequences and bounding their gradient contributions, our method directly addresses the privacy risks introduced by their inherent correlations.
Through experiments conducted on the real-world 1000 Genomes dataset, we demonstrate the efficacy of our method using privacy budgets of $\varepsilon \in [1, 10]$ at $δ=10^{-4}$. Notably, by allowing the transition models of the HMM to be dependent on the location in the sequence, we significantly enhance performance, enabling the synthetic datasets to closely replicate the statistical properties of non-private datasets. This framework facilitates the private sharing of genomic data while offering researchers exceptional flexibility and utility.2025-10-07T10:47:29ZShadi RahimianMario Fritzhttp://arxiv.org/abs/2510.06290v1Soft-Evidence Fused Graph Neural Network for Cancer Driver Gene Identification across Multi-View Biological Graphs2025-10-07T05:20:57ZIdentifying cancer driver genes (CDGs) is essential for understanding cancer mechanisms and developing targeted therapies. Graph neural networks (GNNs) have recently been employed to identify CDGs by capturing patterns in biological interaction networks. However, most GNN-based approaches rely on a single protein-protein interaction (PPI) network, ignoring complementary information from other biological networks. Some studies integrate multiple networks by aligning features with consistency constraints to learn unified gene representations for CDG identification. However, such representation-level fusion often assumes congruent gene relationships across networks, which may overlook network heterogeneity and introduce conflicting information. To address this, we propose Soft-Evidence Fusion Graph Neural Network (SEFGNN), a novel framework for CDG identification across multiple networks at the decision level. Instead of enforcing feature-level consistency, SEFGNN treats each biological network as an independent evidence source and performs uncertainty-aware fusion at the decision level using Dempster-Shafer Theory (DST). To alleviate the risk of overconfidence from DST, we further introduce a Soft Evidence Smoothing (SES) module that improves ranking stability while preserving discriminative performance. Experiments on three cancer datasets show that SEFGNN consistently outperforms state-of-the-art baselines and exhibits strong potential in discovering novel CDGs.2025-10-07T05:20:57Z8pagesBang ChenLijun GuoHouli FanWentao HeRong Zhanghttp://arxiv.org/abs/2509.01001v2Generalized promotion time cure model: A new modeling framework to identify cell-type-specific genes and improve survival prognosis2025-10-04T09:18:13ZSingle-cell technologies provide an unprecedented opportunity for dissecting the interplay between the cancer cells and the associated tumor microenvironment, and the produced high-dimensional omics data should also augment existing survival modeling approaches for identifying tumor cell type-specific genes predictive of cancer patient survival. However, there is no statistical model to integrate multiscale data including individual-level survival data, multicellular-level cell composition data and cellular-level single-cell omics covariates. We propose a class of Bayesian generalized promotion time cure models (GPTCMs) for the multiscale data integration to identify cell-type-specific genes and improve cancer prognosis. We demonstrate with simulations in both low- and high-dimensional settings that the proposed Bayesian GPTCMs are able to identify cell-type-associated covariates and improve survival prediction.2025-08-31T21:35:57Z22 pages, 6 figures, 3 tables; supplement 14 pagesZhi ZhaoFatih KızılaslanShixiong WangManuela Zucknickhttp://arxiv.org/abs/2510.03629v1RawBench: A Comprehensive Benchmarking Framework for Raw Nanopore Signal Analysis Techniques2025-10-04T02:26:30ZNanopore sequencing technologies continue to advance rapidly, offering critical benefits such as real-time analysis, the ability to sequence extremely long DNA fragments (up to millions of bases in a single read), and the option to selectively stop sequencing a molecule before completion. Traditionally, the raw electrical signals generated during sequencing are converted into DNA sequences through a process called basecalling, which typically relies on large neural network models. Raw signal analysis has emerged as a promising alternative to these resource-intensive approaches. While attempts have been made to benchmark conventional basecalling methods, existing evaluation frameworks 1) overlook raw signal analysis techniques, 2) lack the flexibility to accommodate new raw signal analysis tools easily, and 3) fail to include the latest improvements in nanopore datasets. Our goal is to provide an extensible benchmarking framework that enables designing and comparing new methods for raw signal analysis. To this end, we introduce RawBench, the first flexible framework for evaluating raw nanopore signal analysis techniques. RawBench provides modular evaluation of three core pipeline components: 1) reference genome encoding (using different pore models), 2) signal encoding (through various segmentation methods), and 3) representation matching (via different data structures). We extensively evaluate raw signal analysis techniques in terms of 1) quality and performance for read mapping, 2) quality and performance for read classification, and 3) quality of raw signal analysis-assisted basecalling. Our evaluations show that raw signal analysis can achieve competitive quality while significantly reducing resource requirements, particularly in settings where real-time processing or edge deployment is necessary.2025-10-04T02:26:30ZAccepted in ACM BCB 2025Furkan ErisUlysse McConnellCan FirtinaOnur Mutluhttp://arxiv.org/abs/2510.03359v1Predicting cell-specific gene expression profile and knockout impact through deep learning2025-10-03T00:00:47ZGene expression data is essential for understanding how genes are regulated and interact within biological systems, providing insights into disease pathways and potential therapeutic targets. Gene knockout has proven to be a fundamental technique in molecular biology, allowing the investigation of the function of specific genes in an organism, as well as in specific cell types. However, gene expression patterns are quite heterogeneous in single-cell transcriptional data from a uniform environment, representing different cell states, which produce cell-type and cell-specific gene knockout impacts. A computational method that can predict the single-cell resolution knockout impact is still lacking. Here, we present a data-driven framework for learning the mapping between gene expression profiles derived from gene assemblages, enabling the accurate prediction of perturbed expression profiles following knockout (KO) for any cell, without relying on prior perturbed data. We systematically validated our framework using synthetic data generated from gene regulatory dynamics models, two mouse knockout single-cell datasets, and high-throughput in vitro CRISPRi Perturb-seq data. Our results demonstrate that the framework can accurately predict both expression profiles and KO effects at the single-cell level. Our approach provides a generalizable tool for inferring gene function at single-cell resolution, offering new opportunities to study genetic perturbations in contexts where large-scale experimental screens are infeasible.2025-10-03T00:00:47ZYongjian HeVered KleinOrr LevyXu-Wen Wanghttp://arxiv.org/abs/2510.02416v1Cross-Platform DNA Methylation Classifier for the Eight Molecular Subtypes of Group 3 & 4 Medulloblastoma2025-10-02T14:53:38ZMedulloblastoma is a malignant pediatric brain cancer, and the discovery of molecular subgroups is enabling personalized treatment strategies. In 2019, a consensus identified eight novel subtypes within Groups 3 and 4, each displaying heterogeneous characteristics. Classifiers are essential for translating these findings into clinical practice by supporting clinical trials, personalized therapy development and application, and patient monitoring. This study presents a DNA methylation-based, cross-platform machine learning classifier capable of distinguishing these subtypes on both HM450 and EPIC methylation array samples. Across two independent test sets, the model achieved weighted F1 = 0.95 and balanced accuracy = 0.957, consistent across platforms. As the first cross-platform solution, it provides backward compatibility while extending applicability to a newer platform, also enhancing accessibility. It also has the potential to become the first publicly available classifier for these subtypes once deployed through a web application, as planned in the future. This work overall takes steps in the direction of advancing precision medicine and improving clinical outcomes for patients within the majority prevalence medulloblastoma subgroups, groups 3 and 4.2025-10-02T14:53:38Z9 pages, 5 figures, 5 tablesOmer AbidGholamreza Rafieehttp://arxiv.org/abs/2505.12626v3scSiameseClu: A Siamese Clustering Framework for Interpreting single-cell RNA Sequencing Data2025-10-02T02:55:07ZSingle-cell RNA sequencing (scRNA-seq) reveals cell heterogeneity, with cell clustering playing a key role in identifying cell types and marker genes. Recent advances, especially graph neural networks (GNNs)-based methods, have significantly improved clustering performance. However, the analysis of scRNA-seq data remains challenging due to noise, sparsity, and high dimensionality. Compounding these challenges, GNNs often suffer from over-smoothing, limiting their ability to capture complex biological information. In response, we propose scSiameseClu, a novel Siamese Clustering framework for interpreting single-cell RNA-seq data, comprising of 3 key steps: (1) Dual Augmentation Module, which applies biologically informed perturbations to the gene expression matrix and cell graph relationships to enhance representation robustness; (2) Siamese Fusion Module, which combines cross-correlation refinement and adaptive information fusion to capture complex cellular relationships while mitigating over-smoothing; and (3) Optimal Transport Clustering, which utilizes Sinkhorn distance to efficiently align cluster assignments with predefined proportions while maintaining balance. Comprehensive evaluations on seven real-world datasets demonstrate that scSiameseClu outperforms state-of-the-art methods in single-cell clustering, cell type annotation, and cell type classification, providing a powerful tool for scRNA-seq data interpretation.2025-05-19T02:17:09ZPing XuZhiyuan NingPengjiang LiWenhao LiuPengyang WangJiaxu CuiYuanchun ZhouPengfei Wanghttp://arxiv.org/abs/2510.00392v1A Deep Learning Pipeline for Epilepsy Genomic Analysis Using GPT-2 XL and NVIDIA H1002025-10-01T01:07:35ZEpilepsy is a chronic neurological condition characterized by recurrent seizures, with global prevalence estimated at 50 million people worldwide. While progress in high-throughput sequencing has allowed for broad-based transcriptomic profiling of brain tissues, the deciphering of these highly complex datasets remains one of the challenges. To address this issue, in this paper we propose a new analysis pipeline that integrates the power of deep learning strategies with GPU-acceleration computation for investigating Gene expression patterns in epilepsy. Specifically, our proposed approach employs GPT-2 XL, a transformer-based Large Language Model (LLM) with 1.5 billion parameters for genomic sequence analysis over the latest NVIDIA H100 Tensor Core GPUs based on Hopper architecture. Our proposed method enables efficient preprocessing of RNA sequence data, gene sequence encoding, and subsequent pattern identification. We conducted experiments on two epilepsy datasets including GEO accession GSE264537 and GSE275235. The obtained results reveal several significant transcriptomic modifications, including reduced hippocampal astrogliosis after ketogenic diet treatment as well as restored excitatory-inhibitory signaling equilibrium in zebrafish epilepsy model. Moreover, our results highlight the effectiveness of leveraging LLMs in combination with advanced hardware acceleration for transcriptomic characterization in neurological diseases.2025-10-01T01:07:35Z12 pagesMuhammad Omer LatifHayat UllahMuhammad Ali ShafiqueZhihua Donghttp://arxiv.org/abs/2507.06113v3A Statistical Framework for Co-Mediators of Zero-Inflated Single-Cell RNA-Seq Data2025-09-30T14:09:02ZSingle-cell RNA sequencing (scRNA-seq) has revolutionized the study of cellular heterogeneity, enabling detailed molecular profiling at the individual cell level. However, integrating high-dimensional single-cell data into causal mediation analysis remains challenging due to zero inflation and complex mediator structures. We propose a novel mediation framework leveraging zero-inflated negative binomial models to characterize cell-level mediator distributions and beta regression for zero-inflation proportions. The model can identify expression level as well as expressed proportion that could mediate disease-leading causal pathway. Extensive simulation studies demonstrate improved power and controlled false discovery rates. We further illustrate the utility of this approach through application to ROSMAP single-cell transcriptomic data, uncovering biologically meaningful mediation effects that enhance understanding of disease mechanisms.2025-07-08T15:52:47Z24 pages and 3 figuresSeungjun AhnLi ChenMaaike van GerwenPanos RoussosZhigang Li