Predicting Gene Disease Associations in Type 2 Diabetes Using Machine Learning on Single-Cell RNA-Seq Data

2026-01-30T03:27:06Z

Diabetes is a chronic metabolic disorder characterized by elevated blood glucose levels due to impaired insulin production or function. Two main forms are recognized: type 1 diabetes (T1D), which involves autoimmune destruction of insulin-producing \b{eta}-cells, and type 2 diabetes (T2D), which arises from insulin resistance and progressive \b{eta}-cell dysfunction. Understanding the molecular mechanisms underlying these diseases is essential for the development of improved therapeutic strategies, particularly those targeting \b{eta}-cell dysfunction. To investigate these mechanisms in a controlled and biologically interpretable setting, mouse models have played a central role in diabetes research. Owing to their genetic and physiological similarity to humans, together with the ability to precisely manipulate their genome, mice enable detailed investigation of disease progression and gene function. In particular, mouse models have provided critical insights into \b{eta}-cell development, cellular heterogeneity, and functional failure under diabetic conditions. Building on these experimental advances, this study applies machine learning methods to single-cell transcriptomic data from mouse pancreatic islets. Specifically, we evaluate two supervised approaches identified in the literature; Extra Trees Classifier (ETC) and Partial Least Squares Discriminant Analysis (PLS-DA), to assess their ability to identify T2D-associated gene expression signatures at single-cell resolution. Model performance is evaluated using standard classification metrics, with an emphasis on interpretability and biological relevance

Beyond Conditional Computation: Retrieval-Augmented Genomic Foundation Models with Gengram

2026-01-29T17:43:10Z

Current genomic foundation models (GFMs) rely on extensive neural computation to implicitly approximate conserved biological motifs from single-nucleotide inputs. We propose Gengram, a conditional memory module that introduces an explicit and highly efficient lookup primitive for multi-base motifs via a genomic-specific hashing scheme, establishing genomic "syntax". Integrated into the backbone of state-of-the-art GFMs, Gengram achieves substantial gains (up to 14%) across several functional genomics tasks. The module demonstrates robust architectural generalization, while further inspection of Gengram's latent space reveals the emergence of meaningful representations that align closely with fundamental biological knowledge. By establishing structured motif memory as a modeling primitive, Gengram simultaneously boosts empirical performance and mechanistic interpretability, providing a scalable and biology-aligned pathway for the next generation of GFMs. The code is available at https://github.com/zhejianglab/Genos, and the model checkpoint is available at https://huggingface.co/ZhejiangLab/Gengram.

scDataset: Scalable Data Loading for Deep Learning on Large-Scale Single-Cell Omics

2026-01-29T15:07:40Z

Training deep learning models on single-cell datasets with hundreds of millions of cells requires loading data from disk, as these datasets exceed available memory. While random sampling provides the data diversity needed for effective training, it is prohibitively slow due to the random access pattern overhead, whereas sequential streaming achieves high throughput but introduces biases that degrade model performance. We present scDataset, a PyTorch data loader that enables efficient training from on-disk data with seamless integration across diverse storage formats. Our approach combines block sampling and batched fetching to achieve quasi-random sampling that balances I/O efficiency with minibatch diversity. On Tahoe-100M, a dataset of 100 million cells, scDataset achieves more than two orders of magnitude speedup compared to true random sampling while working directly with AnnData files. We provide theoretical bounds on minibatch diversity and empirically show that scDataset matches the performance of true random sampling across multiple classification tasks.

WFR-MFM: One-Step Inference for Dynamic Unbalanced Optimal Transport

2026-01-28T13:41:52Z

Reconstructing dynamical evolution from limited observations is a fundamental challenge in single-cell biology, where dynamic unbalanced optimal transport provides a principled framework for modeling coupled transport and mass variation. However, existing approaches rely on trajectory simulation at inference time, making inference a key bottleneck for scalable applications. In this work, we propose a mean-flow framework for unbalanced flow matching that summarizes both transport and mass-growth dynamics over arbitrary time intervals using mean velocity and mass-growth fields, enabling fast one-step generation without trajectory simulation. To solve dynamic unbalanced optimal transport under the Wasserstein-Fisher-Rao geometry, we further build on this framework to develop Wasserstein-Fisher-Rao Mean Flow Matching (WFR-MFM). Across synthetic and real single-cell RNA sequencing datasets, WFR-MFM achieves orders-of-magnitude faster inference than a range of existing baselines while maintaining high predictive accuracy, and enables efficient perturbation response prediction on large synthetic datasets with thousands of conditions.

LaCoGSEA: Unsupervised deep learning for pathway analysis via latent correlation

2026-01-27T15:08:11Z

Motivation: Pathway enrichment analysis is widely used to interpret gene expression data. Standard approaches, such as GSEA, rely on predefined phenotypic labels and pairwise comparisons, which limits their applicability in unsupervised settings. Existing unsupervised extensions, including single-sample methods, provide pathway-level summaries but primarily capture linear relationships and do not explicitly model gene-pathway associations. More recently, deep learning models have been explored to capture non-linear transcriptomic structure. However, their interpretation has typically relied on generic explainable AI (XAI) techniques designed for feature-level attribution. As these methods are not designed for pathway-level interpretation in unsupervised transcriptomic analyses, their effectiveness in this setting remains limited. Results: To bridge this gap, we introduce LaCoGSEA (Latent Correlation GSEA), an unsupervised framework that integrates deep representation learning with robust pathway statistics. LaCoGSEA employs an autoencoder to capture non-linear manifolds and proposes a global gene-latent correlation metric as a proxy for differential expression, generating dense gene rankings without prior labels. We demonstrate that LaCoGSEA offers three key advantages: (i) it achieves improved clustering performance in distinguishing cancer subtypes compared to existing unsupervised baselines; (ii) it recovers a broader range of biologically meaningful pathways at higher ranks compared with linear dimensionality reduction and gradient-based XAI methods; and (iii) it maintains high robustness and consistency across varying experimental protocols and dataset sizes. Overall, LaCoGSEA provides state-of-the-art performance in unsupervised pathway enrichment analysis. Availability and implementation: https://github.com/willyzzz/LaCoGSEA

GenPairX: A Hardware-Algorithm Co-Designed Accelerator for Paired-End Read Mapping

2026-01-27T09:16:27Z

Genome sequencing has become a central focus in computational biology. A genome study typically begins with sequencing, which produces millions to billions of short DNA fragments known as reads. Read mapping aligns these reads to a reference genome. Read mapping for short reads comes in two forms: single-end and paired-end, with the latter being more prevalent due to its higher accuracy and support for advanced analysis. Read mapping remains a major performance bottleneck in genome analysis due to expensive dynamic programming. Prior efforts have attempted to mitigate this cost by employing filters to identify and potentially discard computationally expensive matches and leveraging hardware accelerators to speed up the computations. While partially effective, these approaches have limitations. In particular, existing filters are often ineffective for paired-end reads, as they evaluate each read independently and exhibit relatively low filtering ratios. In this work, we propose GenPairX, a hardware-algorithm co-designed accelerator that efficiently minimizes the computational load of paired-end read mapping while enhancing the throughput of memory-intensive operations. GenPairX introduces: (1) a novel filtering algorithm that jointly considers both reads in a pair to improve filtering effectiveness, and a lightweight alignment algorithm to replace most of the computationally expensive dynamic programming operations, and (2) two specialized hardware mechanisms to support the proposed algorithms. Our evaluations show that GenPairX delivers substantial performance improvements over state-of-the-art solutions, achieving 1575x and 1.43x higher throughput per watt compared to leading CPU-based and accelerator-based read mappers, respectively, all without compromising accuracy.

FASTR: Reimagining FASTQ via Compact Image-inspired Representation

2026-01-23T21:24:05Z

Motivation: High-throughput sequencing (HTS) enables population-scale genomics but generates massive datasets, creating bottlenecks in storage, transfer, and analysis. FASTQ, the standard format for over two decades, stores one byte per base and one byte per quality score, leading to inefficient I/O, high storage costs, and redundancy. Existing compression tools can mitigate some issues, but often introduce costly decompression or complex dependency issues. Results: We introduce FASTR, a lossless, computation-native successor to FASTQ that encodes each nucleotide together with its base quality score into a single 8-bit value. FASTR reduces file size by at least 2x while remaining fully reversible and directly usable for downstream analyses. Applying general-purpose compression tools on FASTR consistently yields higher compression ratios, 2.47, 3.64, and 4.8x faster compression, and 2.34, 1.96, 1.75x faster decompression than on FASTQ across Illumina, HiFi, and ONT reads. FASTR is machine-learning-ready, allowing reads to be consumed directly as numerical vectors or image-like representations. We provide a highly parallel software ecosystem for FASTQ-FASTR conversion and show that FASTR integrates with existing tools, such as minimap2, with minimal interface changes and no performance overhead. By eliminating decompression costs and reducing data movement, FASTR lays the foundation for scalable genomics analyses and real-time sequencing workflows. Availability and Implementation: https://github.com/ALSER-Lab/FASTR

Ultrafast topological data analysis reveals pandemic-scale dynamics of convergent evolution

2026-01-23T18:33:54Z

Genome variants which re-occur independently across evolutionary lineages are key molecular signatures of adaptation. Inferring the dynamics of such genetic changes from pandemic-scale genomic datasets is now possible, which opens up unprecedented insight into evolutionary processes. However, existing approaches depend on the construction of accurate phylogenetic trees, which remains challenging at scale. Here we present EVOtRec, an organism-agnostic, fast and scalable Topological Data Analysis approach that enables the inference of convergently evolving genomic variants over time directly from topological patterns in the dataset, without requiring the construction of a phylogenetic tree. Using data from both simulations and published experiments, we show that EVOtRec can robustly identify variants under positive selection and performs orders of magnitude faster than state-of-the-art phylogeny-based approaches, with comparable results. We apply EVOtRec to three large viral genome datasets: SARS-CoV-2, influenza virus A subtype H5N1 and HIV-1. We identify key convergent genome variants and demonstrate how EVOtRec facilitates the real-time tracking of high fitness variants in large datasets with millions of genomes, including effects modulated by varying genomic backgrounds. We envision our Topological Data Analysis approach as a new framework for efficient comparative genomics.

SAGe: A Lightweight Algorithm-Architecture Co-Design for Mitigating the Data Preparation Bottleneck in Large-Scale Genome Sequence Analysis

2026-01-22T22:10:09Z

Genome sequence analysis, which examines the DNA sequences of organisms, drives advances in many critical medical and biotechnological fields. Given its importance and the exponentially growing volumes of genomic sequence data, there are extensive efforts to accelerate genome sequence analysis. In this work, we demonstrate a major bottleneck that greatly limits and diminishes the benefits of state-of-the-art genome sequence analysis accelerators: the data preparation bottleneck, where genomic sequence data is stored in compressed form and needs to be first decompressed and formatted before an accelerator can operate on it. To mitigate this bottleneck, we propose SAGe, an algorithm-architecture co-design for highly-compressed storage and high-performance access of large-scale genomic sequence data. The key challenge is to improve data preparation performance while maintaining high compression ratios (comparable to genomic-specific compression algorithms) at low hardware cost. We address this challenge by leveraging key properties of genomic datasets to co-design (i) a lossless (de)compression algorithm, (ii) hardware that decompresses data with lightweight operations and efficient streaming accesses, (iii) storage data layout, and (iv) interface commands to access data. SAGe is highly versatile, as it supports datasets from different sequencing technologies and species. Due to its lightweight design, SAGe can be seamlessly integrated with a broad range of hardware accelerators for genome sequence analysis to mitigate their data preparation bottlenecks. Our results demonstrate that SAGe improves the average end-to-end performance and energy efficiency of two state-of-the-art genome sequence analysis accelerators by 3.0x-32.1x and 13.0x-34.0x, respectively, compared to when the accelerators rely on state-of-the-art software and hardware decompression tools.

PhageMind: Generalized Strain-level Phage Host Range Prediction via Meta-learning

2026-01-22T12:03:50Z

Bacteriophages (phages) are key regulators of bacterial populations and hold great promise for applications such as phage therapy, biocontrol, and industrial fermentation. The success of these applications depends on accurately determining phage host range, which is often specific at the strain level rather than the species level. However, existing computational approaches face major limitations: many rely on genus-specific features that do not generalize across taxa, while others require large amounts of training data that are unavailable for most bacterial lineages. These challenges create a critical need for methods that can accurately predict strain-level phage-host interactions across diverse bacterial genera, particularly under data-limited conditions. We present PhageMind, a learning framework designed to address this challenge by enabling efficient transfer of knowledge across bacterial genera. PhageMind is trained to identify shared principles of phage-bacterium interactions from well-studied systems and to rapidly adapt these principles to new genera using only a small number of known interactions. To reflect the biological basis of infection, we represent phage-host relationships using a knowledge graph that explicitly incorporates phage tail fiber proteins and bacterial O-antigen biosynthesis gene clusters, and we use this representation to guide interaction prediction. Across four bacterial genera (Escherichia, Klebsiella, Vibrio, and Alteromonas), PhageMind achieves high prediction accuracy and shows strong adaptability to new lineages. In particular, in leave-one-genus-out evaluations, the model maintains robust performance when only limited reference data are available, demonstrating its potential as a scalable and practical tool for studying phage-host interactions across the global phageome.

SAGE-FM: A lightweight and interpretable spatial transcriptomics foundation model

2026-01-21T22:22:38Z

Spatial transcriptomics enables spatial gene expression profiling, motivating computational models that capture spatially conditioned regulatory relationships. We introduce SAGE-FM, a lightweight spatial transcriptomics foundation model based on graph convolutional networks (GCNs) trained with a masked central spot prediction objective. Trained on 416 human Visium samples spanning 15 organs, SAGE-FM learns spatially coherent embeddings that robustly recover masked genes, with 91% of masked genes showing significant correlations (p < 0.05). The embeddings generated by SAGE-FM outperform MOFA and existing spatial transcriptomics methods in unsupervised clustering and preservation of biological heterogeneity. SAGE-FM generalizes to downstream tasks, enabling 81% accuracy in pathologist-defined spot annotation in oropharyngeal squamous cell carcinoma and improving glioblastoma subtype prediction relative to MOFA. In silico perturbation experiments further demonstrate that the model captures directional ligand-receptor and upstream-downstream regulatory effects consistent with ground truth. These results demonstrate that simple, parameter-efficient GCNs can serve as biologically interpretable and spatially aware foundation models for large-scale spatial transcriptomics.

Biological Sequence Clustering: A Survey

2026-01-21T03:52:44Z

The rapid development of high-throughput sequencing technologies has led to an explosive increase in biological sequence data, making sequence clustering a fundamental task in large-scale bioinformatics analyses. Unlike traditional clustering problems, biological sequence clustering faces unique challenges due to the lack of direct similarity measures, strict biological constraints, and demanding requirements for both scalability and accuracy. Over the past decades, a wide variety of methods have been developed, differing in how they model sequence similarity, construct clusters, and prioritize optimization objectives. In this review, we provide a comprehensive methodological overview of biological sequence clustering algorithms. We begin by summarizing the main strategies for modeling sequence similarity, which can be divided into three stages: sequence encoding, feature generation, and similarity measurement. Next, we discuss the major clustering paradigms, including greedy incremental, hierarchical, graph-based, model-based, partitional, and deep learning approaches, highlighting their methodological characteristics and practical trade-offs. We then discuss clustering objectives from three key perspectives: scalability and resource efficiency, biological interpretability, and robustness and clustering quality. Organizing existing methods along these dimensions allows us to explore the trade-offs in biological sequence clustering and clarify the contexts in which different approaches are most appropriate. Finally, we identify current limitations and challenges, providing guidance for researchers and directions for future method development.

engGNN: A Dual-Graph Neural Network for Omics-Based Disease Classification and Feature Selection

2026-01-20T23:18:56Z

Omics data, such as transcriptomics, proteomics, and metabolomics, provide critical insights into disease mechanisms and clinical outcomes. However, their high dimensionality, small sample sizes, and intricate biological networks pose major challenges for reliable prediction and meaningful interpretation. Graph Neural Networks (GNNs) offer a promising way to integrate prior knowledge by encoding feature relationships as graphs. Yet, existing methods typically rely solely on either an externally curated feature graph or a data-driven generated one, which limits their ability to capture complementary information. To address this, we propose the external and generated Graph Neural Network (engGNN), a dual-graph framework that jointly leverages both external known biological networks and data-driven generated graphs. Specifically, engGNN constructs a biologically informed undirected feature graph from established network databases and complements it with a directed feature graph derived from tree-ensemble models. This dual-graph design produces more comprehensive embeddings, thereby improving predictive performance and interpretability. Through extensive simulations and real-world applications to gene expression data, engGNN consistently outperforms state-of-the-art baselines. Beyond classification, engGNN provides interpretable feature importance scores that facilitate biologically meaningful discoveries, such as pathway enrichment analysis. Taken together, these results highlight engGNN as a robust, flexible, and interpretable framework for disease classification and biomarker discovery in high-dimensional omics contexts.

Generative Language Models on Nucleotide Sequences of Human Genes

2026-01-20T17:19:40Z

Language models, especially transformer-based ones, have achieved colossal success in NLP. To be precise, studies like BERT for NLU and works like GPT-3 for NLG are very important. If we consider DNA sequences as a text written with an alphabet of four letters representing the nucleotides, they are similar in structure to natural languages. This similarity has led to the development of discriminative language models such as DNABert in the field of DNA-related bioinformatics. To our knowledge, however, the generative side of the coin is still largely unexplored. Therefore, we have focused on the development of an autoregressive generative language model such as GPT-3 for DNA sequences. Since working with whole DNA sequences is challenging without extensive computational resources, we decided to conduct our study on a smaller scale and focus on nucleotide sequences of human genes rather than the whole DNA. This decision has not changed the structure of the problem, as both DNA and genes can be considered as 1D sequences consisting of four different nucleotides without losing much information and without oversimplification. Firstly, we systematically studied an almost entirely unexplored problem and observed that RNNs perform best, while simple techniques such as N-grams are also promising. Another beneficial point was learning how to work with generative models on languages we do not understand, unlike natural languages. The importance of using real-world tasks beyond classical metrics such as perplexity was noted. In addition, we examined whether the data-hungry nature of these models can be altered by selecting a language with minimal vocabulary size, four due to four different types of nucleotides. The reason for reviewing this was that choosing such a language might make the problem easier. However, in this study, we found that this did not change the amount of data required very much.

Multimodal Spatial Omics: From Data Acquisition to Computational Integration

2026-01-18T12:28:14Z

Recent developments in spatial omics technologies have enabled the generation of high dimensional molecular data, such as transcriptomes, proteomes, and epigenomes, within their spatial tissue context, either through coprofiling on the same slice or through serial tissue sections. These datasets, which are often complemented by images, have given rise to multimodal frameworks that capture both the cellular and architectural complexity of tissues across multiple molecular layers. Integration in such multimodal data poses significant computational challenges due to differences in scale, resolution, and data modality. In this review, we present a comprehensive overview of computational methods developed to integrate multimodal spatial omics and imaging datasets. We highlight key algorithmic principles underlying these methods, ranging from probabilistic to the latest deep learning approaches.