https://arxiv.org/api/rWMpw64NAD22gESYSDggTWmopKA2026-06-14T09:25:00Z384830015http://arxiv.org/abs/2511.11717v1Multiscale Grassmann Manifolds for Single-Cell Data Analysis2025-11-12T19:47:10ZSingle-cell data analysis seeks to characterize cellular heterogeneity based on high-dimensional gene expression profiles. Conventional approaches represent each cell as a vector in Euclidean space, which limits their ability to capture intrinsic correlations and multiscale geometric structures. We propose a multiscale framework based on Grassmann manifolds that integrates machine learning with subspace geometry for single-cell data analysis. By generating embeddings under multiple representation scales, the framework combines their features from different geometric views into a unified Grassmann manifold. A power-based scale sampling function is introduced to control the selection of scales and balance in- formation across resolutions. Experiments on nine benchmark single-cell RNA-seq datasets demonstrate that the proposed approach effectively preserves meaningful structures and provides stable clustering performance, particularly for small to medium-sized datasets. These results suggest that Grassmann manifolds offer a coherent and informative foundation for analyzing single cell data.2025-11-12T19:47:10ZXiang Xiang WangSean CottrellGuo-Wei Weihttp://arxiv.org/abs/2511.09590v1A Graphical Method for Identifying Gene Clusters from RNA Sequencing Data2025-11-12T13:52:28ZThe identification of disease-gene associations is instrumental in understanding the mechanisms of diseases and developing novel treatments. Besides identifying genes from RNA-Seq datasets, it is often necessary to identify gene clusters that have relationships with a disease. In this work, we propose a graph-based method for using an RNA-Seq dataset with known genes related to a disease and perform a robust clustering analysis to identify clusters of genes. Our method involves the construction of a gene co-expression network, followed by the computation of gene embeddings leveraging Node2Vec+, an algorithm applying weighted biased random walks and skipgram with negative sampling to compute node embeddings from undirected graphs with weighted edges. Finally, we perform spectral clustering to identify clusters of genes. All processes in our entire method are jointly optimized for stability, robustness, and optimality by applying Tree-structured Parzen Estimator. Our method was applied to an RNA-Seq dataset of known genes that have associations with Age-related Macular Degeneration (AMD). We also performed tests to validate and verify the robustness and statistical significance of our methods due to the stochastic nature of the involved processes. Our results show that our method is capable of generating consistent and robust clustering results. Our method can be seamlessly applied to other RNA-Seq datasets due to our process of joint optimization, ensuring the stability and optimality of the several steps in our method, including the construction of a gene co-expression network, computation of gene embeddings, and clustering of genes. Our work will aid in the discovery of natural structures in the RNA-Seq data, and understanding gene regulation and gene functions not just for AMD but for any disease in general.2025-11-12T13:52:28ZJake R. PatockRinki RatnapriyaArko Barmanhttp://arxiv.org/abs/2511.09026v1DeepVRegulome: DNABERT-based deep-learning framework for predicting the functional impact of short genomic variants on the human regulome2025-11-12T06:25:31ZWhole-genome sequencing (WGS) has revealed numerous non-coding short variants whose functional impacts remain poorly understood. Despite recent advances in deep-learning genomic approaches, accurately predicting and prioritizing clinically relevant mutations in gene regulatory regions remains a major challenge. Here we introduce Deep VRegulome, a deep-learning method for prediction and interpretation of functionally disruptive variants in the human regulome, which combines 700 DNABERT fine-tuned models, trained on vast amounts of ENCODE gene regulatory regions, with variant scoring, motif analysis, attention-based visualization, and survival analysis. We showcase its application on TCGA glioblastoma WGS dataset in prioritizing survival-associated mutations and regulatory regions. The analysis identified 572 splice-disrupting and 9,837 transcription-factor binding site altering mutations occurring in greater than 10% of glioblastoma samples. Survival analysis linked 1352 mutations and 563 disrupted regulatory regions to patient outcomes, enabling stratification via non-coding mutation signatures. All the code, fine-tuned models, and an interactive data portal are publicly available.2025-11-12T06:25:31ZPratik DuttaMatthew ObusanRekha SathianMax ChaoPallavi SuranaNimisha PapineniYanrong JiZhihan ZhouHan LiuAlisa YurovskyRamana V Davulurihttp://arxiv.org/abs/2511.07219v1Integrating Epigenetic and Phenotypic Features for Biological Age Estimation in Cancer Patients via Multimodal Learning2025-11-10T15:42:14ZBiological age, which may be older or younger than chronological age due to factors such as genetic predisposition, environmental exposures, serves as a meaningful biomarker of aging processes and can inform risk stratification, treatment planning, and survivorship care in cancer patients. We propose EpiCAge, a multimodal framework that integrates epigenetic and phenotypic data to improve biological age prediction. Evaluated on eight internal and four external cancer cohorts, EpiCAge consistently outperforms existing epigenetic and phenotypic age clocks. Our analyses show that EpiCAge identifies biologically relevant markers, and its derived age acceleration is significantly associated with mortality risk. These results highlight EpiCAge as a promising multimodal machine learning tool for biological age assessment in oncology.2025-11-10T15:42:14ZIn Proceedings of The 19th IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2025)Shuyue JiangWenjing MaShaojun YuChang SuRunze YanJiaying Luhttp://arxiv.org/abs/2511.04637v2Advancing Risk Gene Discovery Across the Allele Frequency Spectrum2025-11-10T14:54:48ZThe discovery of genetic risk factors has transformed human genetics, yet the pace of new gene identification has slowed despite the exponential expansion of sequencing and biobank resources. Current approaches are optimized for the extremes of the allele frequency spectrum: rare, high-penetrance variants identified through burden testing, and common, low-effect variants mapped by genome-wide association studies. Between these extremes lies variants of intermediate frequency and effect size where statistical power is limited, pathogenicity is often misclassified, and gene discovery lags behind empirical evidence of heritable contribution. This 'missing middle' represents a critical blind spot across disease areas, from neurodevelopmental and psychiatric disorders to cancer and aging. In this review, we organize strategies for risk gene identification by variant frequency class, highlighting methodological strengths and constraints at each scale. We draw on lessons across fields to illustrate how innovations in variant annotation, joint modeling, phenotype refinement, and network-based inference can extend discovery into the intermediate range. By framing the frequency spectrum as a unifying axis, we provide a conceptual map of current capabilities, their limitations, and emerging directions toward more comprehensive risk gene discovery.2025-11-06T18:32:19ZReview; 31 pagesMadison CaballeroBehrang Mahjanihttp://arxiv.org/abs/2509.25884v2scUnified: An AI-Ready Standardized Resource for Single-Cell RNA Sequencing Analysis2025-11-10T03:55:13ZSingle-cell RNA sequencing (scRNA-seq) technology enables systematic delineation of cellular states and interactions, providing crucial insights into cellular heterogeneity. Building on this potential, numerous computational methods have been developed for tasks such as cell clustering, cell type annotation, and marker gene identification. To fully assess and compare these methods, standardized, analysis-ready datasets are essential. However, such datasets remain scarce, and variations in data formats, preprocessing workflows, and annotation strategies hinder reproducibility and complicate systematic evaluation of existing methods. To address these challenges, we present scUnified, an AI-ready standardized resource for single-cell RNA sequencing data that consolidates 13 high-quality datasets spanning two species (human and mouse) and nine tissue types. All datasets undergo standardized quality control and preprocessing and are stored in a uniform format to enable direct application in diverse computational analyses without additional data cleaning. We further demonstrate the utility of scUnified through experimental analyses of representative biological tasks, providing a reproducible foundation for the standardized evaluation of computational methods on a unified dataset.2025-09-30T07:23:01ZPing XuZaitian WangZhirui WangPengjiang LiRan ZhangGaoyang LiHanyu XieJiajia WangYuanchun ZhouPengfei Wanghttp://arxiv.org/abs/2511.05992v1Shared and distinct exonic parts in alternative paths of splicing bubbles2025-11-08T12:42:13ZAlternative splicing creates complex bubbles in splicing graphs where more than two transcript paths compete, challenging methods designed for simple binary events. We present a unified framework that compares paths using distinct exonic parts observed directly from reads. We build a GrASE splicing graph (DAG) per gene, enumerate bubbles, and quantify shared and distinct exonic parts across three comparison structures. (i) all-pairwise contrasts (ii) a multinomial n-way comparison and (iii) valid bipartitions of paths. For (iii) we introduce lower-set bipartitioning, which respects subset relations among paths by enumerating downward-closed sets in a containment graph, yielding valid two-group splits with nonempty distinguishing parts. Our test statistic is the fraction of reads mapped to distinct parts relative to distinct + shared parts, enabling differential usage across samples. Applied to genome annotations, the approach examines more bubbles than prior tools while remaining tractable and interpretable.2025-11-08T12:42:13Z7 pages, 2 figuresDaniel WitoslawskiJelard AquinoChuanchuan HeMira V. Hanhttp://arxiv.org/abs/2511.03976v1PETRA: Pretrained Evolutionary Transformer for SARS-CoV-2 Mutation Prediction2025-11-06T01:58:23ZSince its emergence, SARS-CoV-2 has demonstrated a rapid and unpredictable evolutionary trajectory, characterized by the continual emergence of immune-evasive variants. This poses persistent challenges to public health and vaccine development.
While large-scale generative pre-trained transformers (GPTs) have revolutionized the modeling of sequential data, their direct applications to noisy viral genomic sequences are limited. In this paper, we introduce PETRA(Pretrained Evolutionary TRAnsformer), a novel transformer approach based on evolutionary trajectories derived from phylogenetic trees rather than raw RNA sequences. This method effectively mitigates sequencing noise and captures the hierarchical structure of viral evolution.
With a weighted training framework to address substantial geographical and temporal imbalances in global sequence data, PETRA excels in predicting future SARS-CoV-2 mutations, achieving a weighted recall@1 of 9.45% for nucleotide mutations and 17.10\% for spike amino-acid mutations, compared to 0.49% and 6.64% respectively for the best baseline. PETRA also demonstrates its ability to aid in the real-time mutation prediction of major clades like 24F(XEC) and 25A(LP.8.1). The code is open sourced on https://github.com/xz-keg/PETra2025-11-06T01:58:23ZpreprintXu Zouhttp://arxiv.org/abs/2411.06635v4scMEDAL for the interpretable analysis of single-cell transcriptomics data with batch effect visualization using a deep mixed effects autoencoder2025-11-05T22:46:50ZSingle-cell RNA sequencing enables high-resolution analysis of cellular heterogeneity, yet disentangling biological signal from batch effects remains a major challenge. Existing batch-correction algorithms suppress or discard batch-related variation rather than modeling it. We propose scMEDAL, single-cell Mixed Effects Deep Autoencoder Learning, a framework that separately models batch-invariant and batch-specific effects using two complementary subnetworks. The principal innovation, scMEDAL-RE, is a random-effects Bayesian autoencoder that learns batch-specific representations while preserving biologically meaningful information confounded with batch effects signal often lost under standard correction. Complementing it, the fixed-effects subnetwork, scMEDAL-FE, trained via adversarial learning provides a default batch-correction component. Evaluations across diverse conditions (autism, leukemia, cardiovascular), cell types, and technical and biological effects show that scMEDAL-RE produces interpretable, batch-specific embeddings that complement both scMEDAL-FE and established correction methods (scVI, Scanorama, Harmony, SAUCIE), yielding more accurate prediction of disease status, donor group, and tissue. scMEDAL also provides generative visualizations, including counterfactual reconstructions of a cell's expression as if acquired in another batch. The framework allows substitution of the fixed-effects component with other correction methods, while retaining scMEDAL-RE's enhanced predictive power and visualization. Overall, scMEDAL is a versatile, interpretable framework that complements existing correction, providing enhanced insight into cellular heterogeneity and data acquisition.2024-11-11T00:10:48ZMain manuscript: 32 pages, including 8 figures and 1 table. Supplemental material: 23 pagesAixa X. AndradeSon NguyenAustin MarckxAlbert Montillohttp://arxiv.org/abs/2408.08503v2Computational strategies for cross-species knowledge transfer2025-11-04T18:03:08ZResearch organisms provide invaluable insights into human biology and diseases, serving as essential tools for functional experiments, disease modeling, and drug testing. However, evolutionary divergence between humans and research organisms hinders effective knowledge transfer across species. Here, we review state-of-the-art methods for computationally transferring knowledge across species, primarily focusing on methods that utilize transcriptome data and/or molecular networks. Our review addresses four key areas: (1) transferring disease and gene annotation knowledge across species, (2) identifying functionally equivalent molecular components, (3) inferring equivalent perturbed genes or gene sets, and (4) identifying equivalent cell types. We conclude with an outlook on future directions and several key challenges that remain in cross-species knowledge transfer, including introducing the concept of "agnology" to describe functional equivalence of biological entities, regardless of their evolutionary origins. This concept is becoming pervasive in integrative data-driven models where evolutionary origins of functions can remain unresolved.2024-08-16T03:01:35ZHao YuanChristopher A. MancusoKayla JohnsonIngo BraaschArjun Krishnanhttp://arxiv.org/abs/2511.02888v1NABench: Large-Scale Benchmarks of Nucleotide Foundation Models for Fitness Prediction2025-11-04T14:28:01ZNucleotide sequence variation can induce significant shifts in functional fitness. Recent nucleotide foundation models promise to predict such fitness effects directly from sequence, yet heterogeneous datasets and inconsistent preprocessing make it difficult to compare methods fairly across DNA and RNA families. Here we introduce NABench, a large-scale, systematic benchmark for nucleic acid fitness prediction. NABench aggregates 162 high-throughput assays and curates 2.6 million mutated sequences spanning diverse DNA and RNA families, with standardized splits and rich metadata. We show that NABench surpasses prior nucleotide fitness benchmarks in scale, diversity, and data quality. Under a unified evaluation suite, we rigorously assess 29 representative foundation models across zero-shot, few-shot prediction, transfer learning, and supervised settings. The results quantify performance heterogeneity across tasks and nucleic-acid types, demonstrating clear strengths and failure modes for different modeling choices and establishing strong, reproducible baselines. We release NABench to advance nucleic acid modeling, supporting downstream applications in RNA/DNA design, synthetic biology, and biochemistry. Our code is available at https://github.com/mrzzmrzz/NABench.2025-11-04T14:28:01ZZhongmin LiRunze MaJiahao TanChengzi TanShuangjia Zhenghttp://arxiv.org/abs/2410.09964v2Lower-dimensional projections of cellular expression improves cell type classification from single-cell RNA sequencing2025-11-04T14:20:11ZSingle-cell RNA sequencing (scRNA-seq) enables the study of cellular diversity at single cell level. It provides a global view of cell-type specification during the onset of biological mechanisms such as developmental processes and human organogenesis. Various statistical, machine and deep learning-based methods have been proposed for cell-type classification. Most of the methods utilizes unsupervised lower dimensional projections obtained from for a large reference data. In this work, we proposed a reference-based method for cell type classification, called EnProCell. The EnProCell, first, computes lower dimensional projections that capture both the high variance and class separability through an ensemble of principle component analysis and multiple discriminant analysis. In the second phase, EnProCell trains a deep neural network on the lower dimensional representation of data to classify cell types. The proposed method outperformed the existing state-of-the-art methods when tested on four different data sets produced from different single-cell sequencing technologies. The EnProCell showed higher accuracy (98.91) and F1 score (98.64) than other methods for predicting reference from reference datasets. Similarly, EnProCell also showed better performance than existing methods in predicting cell types for data with unknown cell types (query) from reference datasets (accuracy:99.52; F1 score: 99.07). In addition to improved performance, the proposed methodology is simple and does not require more computational resources and time. the EnProCell is available at https://github.com/umar1196/EnProCell.2024-10-13T19:01:38ZMuhammad UmarAndras LakatosMuhammad AsifArif Mahmoodhttp://arxiv.org/abs/2509.24655v2HyperHELM: Hyperbolic Hierarchy Encoding for mRNA Language Modeling2025-11-04T10:26:57ZLanguage models are increasingly applied to biological sequences like proteins and mRNA, yet their default Euclidean geometry may mismatch the hierarchical structures inherent to biological data. While hyperbolic geometry provides a better alternative for accommodating hierarchical data, it has yet to find a way into language modeling for mRNA sequences. In this work, we introduce HyperHELM, a framework that implements masked language model pre-training in hyperbolic space for mRNA sequences. Using a hybrid design with hyperbolic layers atop Euclidean backbone, HyperHELM aligns learned representations with the biological hierarchy defined by the relationship between mRNA and amino acids. Across multiple multi-species datasets, it outperforms Euclidean baselines on 9 out of 10 tasks involving property prediction, with 10% improvement on average, and excels in out-of-distribution generalization to long and low-GC content sequences; for antibody region annotation, it surpasses hierarchy-aware Euclidean models by 3% in annotation accuracy. Our results highlight hyperbolic geometry as an effective inductive bias for hierarchical language modeling of mRNA sequences.2025-09-29T12:04:15ZMax van SpenglerArtem MoskalevTommaso MansiMangal PrakashRui Liaohttp://arxiv.org/abs/2507.09378v3Context-Aware Regularization with Markovian Integration for Attention-Based Nucleotide Analysis2025-11-04T02:06:37ZTransformers have revolutionized nucleotide sequence analysis, yet capturing long-range dependencies remains challenging. Recent studies show that autoregressive transformers often exhibit Markovian behavior by relying on fixed-length context windows for next-token prediction. However, standard self-attention mechanisms are computationally inefficient for long sequences due to their quadratic complexity and do not explicitly enforce global transition consistency.
We introduce CARMANIA (Context-Aware Regularization with Markovian Integration for Attention-Based Nucleotide Analysis), a self-supervised pretraining framework that augments next-token (NT) prediction with a transition-matrix (TM) loss. The TM loss aligns predicted token transitions with empirically derived n-gram statistics from each input sequence, encouraging the model to capture higher-order dependencies beyond local context. This integration enables CARMANIA to learn organism-specific sequence structures that reflect both evolutionary constraints and functional organization.
We evaluate CARMANIA across diverse genomic tasks, including regulatory element prediction, functional gene classification, taxonomic inference, antimicrobial resistance detection, and biosynthetic gene cluster classification. CARMANIA outperforms the previous best long-context model by at least 7 percent, matches state-of-the-art on shorter sequences (exceeding prior results on 20 out of 40 tasks while running approximately 2.5 times faster), and shows particularly strong improvements on enhancer and housekeeping gene classification tasks, including up to a 34 percent absolute gain in Matthews correlation coefficient (MCC) for enhancer prediction. The TM loss boosts accuracy in 33 of 40 tasks, especially where local motifs or regulatory patterns drive prediction.2025-07-12T19:03:28ZMohammadsaleh RefahiMahdi AbavisaniBahrad A. SokhansanjJames R. BrownGail Rosenhttp://arxiv.org/abs/2510.16013v3AGNES: Adaptive Graph Neural Network and Dynamic Programming Hybrid Framework for Real-Time Nanopore Seed Chaining2025-11-04T00:15:28ZNanopore sequencing enables real-time long-read DNA sequencing with reads exceeding 10 kilobases, but inherent error rates of 12-15 percent present significant computational challenges for read alignment. The critical seed chaining step must connect exact k-mer matches between reads and reference genomes while filtering spurious matches, yet state-of-the-art methods rely on fixed gap penalty functions unable to adapt to varying genomic contexts including tandem repeats and structural variants. This paper presents RawHash3, a hybrid framework combining graph neural networks with classical dynamic programming for adaptive seed chaining that maintains real-time performance while providing statistical guarantees. We formalize seed chaining as graph learning where seeds constitute nodes with 12-dimensional feature vectors and edges encode 8-dimensional spatial relationships including gap consistency. Our architecture employs three-layer EdgeConv GNN with confidence-based method selection that dynamically switches between learned guidance and algorithmic fallback. Comprehensive evaluation on 1,000 synthetic nanopore reads with 5,200 test seeds demonstrates RawHash3 achieves 99.94 percent precision and 40.07 percent recall, representing statistically significant 25.0 percent relative improvement over baseline with p less than 0.001. The system maintains median inference latency of 1.59ms meeting real-time constraints, while demonstrating superior robustness with 100 percent success rate under 20 percent label corruption versus baseline degradation to 30.3 percent. Cross-validation confirms stability establishing graph neural networks as viable approach for production genomics pipelines.2025-10-15T08:05:43Z31 pages, 12 figures, 6 tables. Submitted to ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM-BCB). Includes comprehensive evaluation with statistical validation, ablation studies, and open-source implementationJahidul ArafatSanjaya Poudel