https://arxiv.org/api/rWMpw64NAD22gESYSDggTWmopKA 2026-06-14T09:25:00Z 3848 300 15 http://arxiv.org/abs/2511.11717v1 Multiscale Grassmann Manifolds for Single-Cell Data Analysis 2025-11-12T19:47:10Z

Single-cell data analysis seeks to characterize cellular heterogeneity based on high-dimensional gene expression profiles. Conventional approaches represent each cell as a vector in Euclidean space, which limits their ability to capture intrinsic correlations and multiscale geometric structures. We propose a multiscale framework based on Grassmann manifolds that integrates machine learning with subspace geometry for single-cell data analysis. By generating embeddings under multiple representation scales, the framework combines their features from different geometric views into a unified Grassmann manifold. A power-based scale sampling function is introduced to control the selection of scales and balance in- formation across resolutions. Experiments on nine benchmark single-cell RNA-seq datasets demonstrate that the proposed approach effectively preserves meaningful structures and provides stable clustering performance, particularly for small to medium-sized datasets. These results suggest that Grassmann manifolds offer a coherent and informative foundation for analyzing single cell data.

2025-11-12T19:47:10Z Xiang Xiang Wang Sean Cottrell Guo-Wei Wei http://arxiv.org/abs/2511.09590v1 A Graphical Method for Identifying Gene Clusters from RNA Sequencing Data 2025-11-12T13:52:28Z

The identification of disease-gene associations is instrumental in understanding the mechanisms of diseases and developing novel treatments. Besides identifying genes from RNA-Seq datasets, it is often necessary to identify gene clusters that have relationships with a disease. In this work, we propose a graph-based method for using an RNA-Seq dataset with known genes related to a disease and perform a robust clustering analysis to identify clusters of genes. Our method involves the construction of a gene co-expression network, followed by the computation of gene embeddings leveraging Node2Vec+, an algorithm applying weighted biased random walks and skipgram with negative sampling to compute node embeddings from undirected graphs with weighted edges. Finally, we perform spectral clustering to identify clusters of genes. All processes in our entire method are jointly optimized for stability, robustness, and optimality by applying Tree-structured Parzen Estimator. Our method was applied to an RNA-Seq dataset of known genes that have associations with Age-related Macular Degeneration (AMD). We also performed tests to validate and verify the robustness and statistical significance of our methods due to the stochastic nature of the involved processes. Our results show that our method is capable of generating consistent and robust clustering results. Our method can be seamlessly applied to other RNA-Seq datasets due to our process of joint optimization, ensuring the stability and optimality of the several steps in our method, including the construction of a gene co-expression network, computation of gene embeddings, and clustering of genes. Our work will aid in the discovery of natural structures in the RNA-Seq data, and understanding gene regulation and gene functions not just for AMD but for any disease in general.

2025-11-12T13:52:28Z Jake R. Patock Rinki Ratnapriya Arko Barman http://arxiv.org/abs/2511.09026v1 DeepVRegulome: DNABERT-based deep-learning framework for predicting the functional impact of short genomic variants on the human regulome 2025-11-12T06:25:31Z

Whole-genome sequencing (WGS) has revealed numerous non-coding short variants whose functional impacts remain poorly understood. Despite recent advances in deep-learning genomic approaches, accurately predicting and prioritizing clinically relevant mutations in gene regulatory regions remains a major challenge. Here we introduce Deep VRegulome, a deep-learning method for prediction and interpretation of functionally disruptive variants in the human regulome, which combines 700 DNABERT fine-tuned models, trained on vast amounts of ENCODE gene regulatory regions, with variant scoring, motif analysis, attention-based visualization, and survival analysis. We showcase its application on TCGA glioblastoma WGS dataset in prioritizing survival-associated mutations and regulatory regions. The analysis identified 572 splice-disrupting and 9,837 transcription-factor binding site altering mutations occurring in greater than 10% of glioblastoma samples. Survival analysis linked 1352 mutations and 563 disrupted regulatory regions to patient outcomes, enabling stratification via non-coding mutation signatures. All the code, fine-tuned models, and an interactive data portal are publicly available.

2025-11-12T06:25:31Z Pratik Dutta Matthew Obusan Rekha Sathian Max Chao Pallavi Surana Nimisha Papineni Yanrong Ji Zhihan Zhou Han Liu Alisa Yurovsky Ramana V Davuluri http://arxiv.org/abs/2511.07219v1 Integrating Epigenetic and Phenotypic Features for Biological Age Estimation in Cancer Patients via Multimodal Learning 2025-11-10T15:42:14Z

Biological age, which may be older or younger than chronological age due to factors such as genetic predisposition, environmental exposures, serves as a meaningful biomarker of aging processes and can inform risk stratification, treatment planning, and survivorship care in cancer patients. We propose EpiCAge, a multimodal framework that integrates epigenetic and phenotypic data to improve biological age prediction. Evaluated on eight internal and four external cancer cohorts, EpiCAge consistently outperforms existing epigenetic and phenotypic age clocks. Our analyses show that EpiCAge identifies biologically relevant markers, and its derived age acceleration is significantly associated with mortality risk. These results highlight EpiCAge as a promising multimodal machine learning tool for biological age assessment in oncology.

2025-11-10T15:42:14Z In Proceedings of The 19th IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2025) Shuyue Jiang Wenjing Ma Shaojun Yu Chang Su Runze Yan Jiaying Lu http://arxiv.org/abs/2511.04637v2 Advancing Risk Gene Discovery Across the Allele Frequency Spectrum 2025-11-10T14:54:48Z

The discovery of genetic risk factors has transformed human genetics, yet the pace of new gene identification has slowed despite the exponential expansion of sequencing and biobank resources. Current approaches are optimized for the extremes of the allele frequency spectrum: rare, high-penetrance variants identified through burden testing, and common, low-effect variants mapped by genome-wide association studies. Between these extremes lies variants of intermediate frequency and effect size where statistical power is limited, pathogenicity is often misclassified, and gene discovery lags behind empirical evidence of heritable contribution. This 'missing middle' represents a critical blind spot across disease areas, from neurodevelopmental and psychiatric disorders to cancer and aging. In this review, we organize strategies for risk gene identification by variant frequency class, highlighting methodological strengths and constraints at each scale. We draw on lessons across fields to illustrate how innovations in variant annotation, joint modeling, phenotype refinement, and network-based inference can extend discovery into the intermediate range. By framing the frequency spectrum as a unifying axis, we provide a conceptual map of current capabilities, their limitations, and emerging directions toward more comprehensive risk gene discovery.

2025-11-06T18:32:19Z Review; 31 pages Madison Caballero Behrang Mahjani http://arxiv.org/abs/2509.25884v2 scUnified: An AI-Ready Standardized Resource for Single-Cell RNA Sequencing Analysis 2025-11-10T03:55:13Z

Single-cell RNA sequencing (scRNA-seq) technology enables systematic delineation of cellular states and interactions, providing crucial insights into cellular heterogeneity. Building on this potential, numerous computational methods have been developed for tasks such as cell clustering, cell type annotation, and marker gene identification. To fully assess and compare these methods, standardized, analysis-ready datasets are essential. However, such datasets remain scarce, and variations in data formats, preprocessing workflows, and annotation strategies hinder reproducibility and complicate systematic evaluation of existing methods. To address these challenges, we present scUnified, an AI-ready standardized resource for single-cell RNA sequencing data that consolidates 13 high-quality datasets spanning two species (human and mouse) and nine tissue types. All datasets undergo standardized quality control and preprocessing and are stored in a uniform format to enable direct application in diverse computational analyses without additional data cleaning. We further demonstrate the utility of scUnified through experimental analyses of representative biological tasks, providing a reproducible foundation for the standardized evaluation of computational methods on a unified dataset.

2025-09-30T07:23:01Z Ping Xu Zaitian Wang Zhirui Wang Pengjiang Li Ran Zhang Gaoyang Li Hanyu Xie Jiajia Wang Yuanchun Zhou Pengfei Wang http://arxiv.org/abs/2511.05992v1 Shared and distinct exonic parts in alternative paths of splicing bubbles 2025-11-08T12:42:13Z

Alternative splicing creates complex bubbles in splicing graphs where more than two transcript paths compete, challenging methods designed for simple binary events. We present a unified framework that compares paths using distinct exonic parts observed directly from reads. We build a GrASE splicing graph (DAG) per gene, enumerate bubbles, and quantify shared and distinct exonic parts across three comparison structures. (i) all-pairwise contrasts (ii) a multinomial n-way comparison and (iii) valid bipartitions of paths. For (iii) we introduce lower-set bipartitioning, which respects subset relations among paths by enumerating downward-closed sets in a containment graph, yielding valid two-group splits with nonempty distinguishing parts. Our test statistic is the fraction of reads mapped to distinct parts relative to distinct + shared parts, enabling differential usage across samples. Applied to genome annotations, the approach examines more bubbles than prior tools while remaining tractable and interpretable.

2025-11-08T12:42:13Z 7 pages, 2 figures Daniel Witoslawski Jelard Aquino Chuanchuan He Mira V. Han http://arxiv.org/abs/2511.03976v1 PETRA: Pretrained Evolutionary Transformer for SARS-CoV-2 Mutation Prediction 2025-11-06T01:58:23Z

Since its emergence, SARS-CoV-2 has demonstrated a rapid and unpredictable evolutionary trajectory, characterized by the continual emergence of immune-evasive variants. This poses persistent challenges to public health and vaccine development. While large-scale generative pre-trained transformers (GPTs) have revolutionized the modeling of sequential data, their direct applications to noisy viral genomic sequences are limited. In this paper, we introduce PETRA(Pretrained Evolutionary TRAnsformer), a novel transformer approach based on evolutionary trajectories derived from phylogenetic trees rather than raw RNA sequences. This method effectively mitigates sequencing noise and captures the hierarchical structure of viral evolution. With a weighted training framework to address substantial geographical and temporal imbalances in global sequence data, PETRA excels in predicting future SARS-CoV-2 mutations, achieving a weighted recall@1 of 9.45% for nucleotide mutations and 17.10\% for spike amino-acid mutations, compared to 0.49% and 6.64% respectively for the best baseline. PETRA also demonstrates its ability to aid in the real-time mutation prediction of major clades like 24F(XEC) and 25A(LP.8.1). The code is open sourced on https://github.com/xz-keg/PETra

2025-11-06T01:58:23Z preprint Xu Zou http://arxiv.org/abs/2411.06635v4 scMEDAL for the interpretable analysis of single-cell transcriptomics data with batch effect visualization using a deep mixed effects autoencoder 2025-11-05T22:46:50Z

Single-cell RNA sequencing enables high-resolution analysis of cellular heterogeneity, yet disentangling biological signal from batch effects remains a major challenge. Existing batch-correction algorithms suppress or discard batch-related variation rather than modeling it. We propose scMEDAL, single-cell Mixed Effects Deep Autoencoder Learning, a framework that separately models batch-invariant and batch-specific effects using two complementary subnetworks. The principal innovation, scMEDAL-RE, is a random-effects Bayesian autoencoder that learns batch-specific representations while preserving biologically meaningful information confounded with batch effects signal often lost under standard correction. Complementing it, the fixed-effects subnetwork, scMEDAL-FE, trained via adversarial learning provides a default batch-correction component. Evaluations across diverse conditions (autism, leukemia, cardiovascular), cell types, and technical and biological effects show that scMEDAL-RE produces interpretable, batch-specific embeddings that complement both scMEDAL-FE and established correction methods (scVI, Scanorama, Harmony, SAUCIE), yielding more accurate prediction of disease status, donor group, and tissue. scMEDAL also provides generative visualizations, including counterfactual reconstructions of a cell's expression as if acquired in another batch. The framework allows substitution of the fixed-effects component with other correction methods, while retaining scMEDAL-RE's enhanced predictive power and visualization. Overall, scMEDAL is a versatile, interpretable framework that complements existing correction, providing enhanced insight into cellular heterogeneity and data acquisition.

2024-11-11T00:10:48Z Main manuscript: 32 pages, including 8 figures and 1 table. Supplemental material: 23 pages Aixa X. Andrade Son Nguyen Austin Marckx Albert Montillo http://arxiv.org/abs/2408.08503v2 Computational strategies for cross-species knowledge transfer 2025-11-04T18:03:08Z

Research organisms provide invaluable insights into human biology and diseases, serving as essential tools for functional experiments, disease modeling, and drug testing. However, evolutionary divergence between humans and research organisms hinders effective knowledge transfer across species. Here, we review state-of-the-art methods for computationally transferring knowledge across species, primarily focusing on methods that utilize transcriptome data and/or molecular networks. Our review addresses four key areas: (1) transferring disease and gene annotation knowledge across species, (2) identifying functionally equivalent molecular components, (3) inferring equivalent perturbed genes or gene sets, and (4) identifying equivalent cell types. We conclude with an outlook on future directions and several key challenges that remain in cross-species knowledge transfer, including introducing the concept of "agnology" to describe functional equivalence of biological entities, regardless of their evolutionary origins. This concept is becoming pervasive in integrative data-driven models where evolutionary origins of functions can remain unresolved.

2024-08-16T03:01:35Z Hao Yuan Christopher A. Mancuso Kayla Johnson Ingo Braasch Arjun Krishnan http://arxiv.org/abs/2511.02888v1 NABench: Large-Scale Benchmarks of Nucleotide Foundation Models for Fitness Prediction 2025-11-04T14:28:01Z

Nucleotide sequence variation can induce significant shifts in functional fitness. Recent nucleotide foundation models promise to predict such fitness effects directly from sequence, yet heterogeneous datasets and inconsistent preprocessing make it difficult to compare methods fairly across DNA and RNA families. Here we introduce NABench, a large-scale, systematic benchmark for nucleic acid fitness prediction. NABench aggregates 162 high-throughput assays and curates 2.6 million mutated sequences spanning diverse DNA and RNA families, with standardized splits and rich metadata. We show that NABench surpasses prior nucleotide fitness benchmarks in scale, diversity, and data quality. Under a unified evaluation suite, we rigorously assess 29 representative foundation models across zero-shot, few-shot prediction, transfer learning, and supervised settings. The results quantify performance heterogeneity across tasks and nucleic-acid types, demonstrating clear strengths and failure modes for different modeling choices and establishing strong, reproducible baselines. We release NABench to advance nucleic acid modeling, supporting downstream applications in RNA/DNA design, synthetic biology, and biochemistry. Our code is available at https://github.com/mrzzmrzz/NABench.

2025-11-04T14:28:01Z Zhongmin Li Runze Ma Jiahao Tan Chengzi Tan Shuangjia Zheng http://arxiv.org/abs/2410.09964v2 Lower-dimensional projections of cellular expression improves cell type classification from single-cell RNA sequencing 2025-11-04T14:20:11Z

Single-cell RNA sequencing (scRNA-seq) enables the study of cellular diversity at single cell level. It provides a global view of cell-type specification during the onset of biological mechanisms such as developmental processes and human organogenesis. Various statistical, machine and deep learning-based methods have been proposed for cell-type classification. Most of the methods utilizes unsupervised lower dimensional projections obtained from for a large reference data. In this work, we proposed a reference-based method for cell type classification, called EnProCell. The EnProCell, first, computes lower dimensional projections that capture both the high variance and class separability through an ensemble of principle component analysis and multiple discriminant analysis. In the second phase, EnProCell trains a deep neural network on the lower dimensional representation of data to classify cell types. The proposed method outperformed the existing state-of-the-art methods when tested on four different data sets produced from different single-cell sequencing technologies. The EnProCell showed higher accuracy (98.91) and F1 score (98.64) than other methods for predicting reference from reference datasets. Similarly, EnProCell also showed better performance than existing methods in predicting cell types for data with unknown cell types (query) from reference datasets (accuracy:99.52; F1 score: 99.07). In addition to improved performance, the proposed methodology is simple and does not require more computational resources and time. the EnProCell is available at https://github.com/umar1196/EnProCell.

2024-10-13T19:01:38Z Muhammad Umar Andras Lakatos Muhammad Asif Arif Mahmood http://arxiv.org/abs/2509.24655v2 HyperHELM: Hyperbolic Hierarchy Encoding for mRNA Language Modeling 2025-11-04T10:26:57Z

Language models are increasingly applied to biological sequences like proteins and mRNA, yet their default Euclidean geometry may mismatch the hierarchical structures inherent to biological data. While hyperbolic geometry provides a better alternative for accommodating hierarchical data, it has yet to find a way into language modeling for mRNA sequences. In this work, we introduce HyperHELM, a framework that implements masked language model pre-training in hyperbolic space for mRNA sequences. Using a hybrid design with hyperbolic layers atop Euclidean backbone, HyperHELM aligns learned representations with the biological hierarchy defined by the relationship between mRNA and amino acids. Across multiple multi-species datasets, it outperforms Euclidean baselines on 9 out of 10 tasks involving property prediction, with 10% improvement on average, and excels in out-of-distribution generalization to long and low-GC content sequences; for antibody region annotation, it surpasses hierarchy-aware Euclidean models by 3% in annotation accuracy. Our results highlight hyperbolic geometry as an effective inductive bias for hierarchical language modeling of mRNA sequences.

2025-09-29T12:04:15Z Max van Spengler Artem Moskalev Tommaso Mansi Mangal Prakash Rui Liao http://arxiv.org/abs/2507.09378v3 Context-Aware Regularization with Markovian Integration for Attention-Based Nucleotide Analysis 2025-11-04T02:06:37Z

Transformers have revolutionized nucleotide sequence analysis, yet capturing long-range dependencies remains challenging. Recent studies show that autoregressive transformers often exhibit Markovian behavior by relying on fixed-length context windows for next-token prediction. However, standard self-attention mechanisms are computationally inefficient for long sequences due to their quadratic complexity and do not explicitly enforce global transition consistency. We introduce CARMANIA (Context-Aware Regularization with Markovian Integration for Attention-Based Nucleotide Analysis), a self-supervised pretraining framework that augments next-token (NT) prediction with a transition-matrix (TM) loss. The TM loss aligns predicted token transitions with empirically derived n-gram statistics from each input sequence, encouraging the model to capture higher-order dependencies beyond local context. This integration enables CARMANIA to learn organism-specific sequence structures that reflect both evolutionary constraints and functional organization. We evaluate CARMANIA across diverse genomic tasks, including regulatory element prediction, functional gene classification, taxonomic inference, antimicrobial resistance detection, and biosynthetic gene cluster classification. CARMANIA outperforms the previous best long-context model by at least 7 percent, matches state-of-the-art on shorter sequences (exceeding prior results on 20 out of 40 tasks while running approximately 2.5 times faster), and shows particularly strong improvements on enhancer and housekeeping gene classification tasks, including up to a 34 percent absolute gain in Matthews correlation coefficient (MCC) for enhancer prediction. The TM loss boosts accuracy in 33 of 40 tasks, especially where local motifs or regulatory patterns drive prediction.

2025-07-12T19:03:28Z Mohammadsaleh Refahi Mahdi Abavisani Bahrad A. Sokhansanj James R. Brown Gail Rosen http://arxiv.org/abs/2510.16013v3 AGNES: Adaptive Graph Neural Network and Dynamic Programming Hybrid Framework for Real-Time Nanopore Seed Chaining 2025-11-04T00:15:28Z

Nanopore sequencing enables real-time long-read DNA sequencing with reads exceeding 10 kilobases, but inherent error rates of 12-15 percent present significant computational challenges for read alignment. The critical seed chaining step must connect exact k-mer matches between reads and reference genomes while filtering spurious matches, yet state-of-the-art methods rely on fixed gap penalty functions unable to adapt to varying genomic contexts including tandem repeats and structural variants. This paper presents RawHash3, a hybrid framework combining graph neural networks with classical dynamic programming for adaptive seed chaining that maintains real-time performance while providing statistical guarantees. We formalize seed chaining as graph learning where seeds constitute nodes with 12-dimensional feature vectors and edges encode 8-dimensional spatial relationships including gap consistency. Our architecture employs three-layer EdgeConv GNN with confidence-based method selection that dynamically switches between learned guidance and algorithmic fallback. Comprehensive evaluation on 1,000 synthetic nanopore reads with 5,200 test seeds demonstrates RawHash3 achieves 99.94 percent precision and 40.07 percent recall, representing statistically significant 25.0 percent relative improvement over baseline with p less than 0.001. The system maintains median inference latency of 1.59ms meeting real-time constraints, while demonstrating superior robustness with 100 percent success rate under 20 percent label corruption versus baseline degradation to 30.3 percent. Cross-validation confirms stability establishing graph neural networks as viable approach for production genomics pipelines.

2025-10-15T08:05:43Z 31 pages, 12 figures, 6 tables. Submitted to ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM-BCB). Includes comprehensive evaluation with statistical validation, ablation studies, and open-source implementation Jahidul Arafat Sanjaya Poudel