A New Framework for Explainable Rare Cell Identification in Single-Cell Transcriptomics Data

2026-01-04T04:04:58Z

The detection of rare cell types in single-cell transcriptomics data is crucial for elucidating disease pathogenesis and tissue development dynamics. However, a critical gap that persists in current methods is their inability to provide an explanation based on genes for each cell they have detected as rare. We identify three primary sources of this deficiency. First, the anomaly detectors often function as "black boxes", designed to detect anomalies but unable to explain why a cell is anomalous. Second, the standard analytical framework hinders interpretability by relying on dimensionality reduction techniques, such as Principal Component Analysis (PCA), which transform meaningful gene expression data into abstract, uninterpretable features. Finally, existing explanation algorithms cannot be readily applied to this domain, as single-cell data is characterized by high dimensionality, noise, and substantial sparsity. To overcome these limitations, we introduce a framework for explainable anomaly detection in single-cell transcriptomics data which not only identifies individual anomalies, but also provides a visual explanation based on genes that makes an instance anomalous. This framework has two key ingredients that are not existed in current methods applied in this domain. First, it eliminates the PCA step which is deemed to be an essential component in previous studies. Second, it employs the state-of-art anomaly detector and explainer as the efficient and effective means to find each rare cell and the relevant gene subspace in order to provide explanations for each rare cell as well as the typical normal cell associated with the rare cell's closest normal cells.

From In Silico to In Vitro: A Comprehensive Guide to Validating Bioinformatics Findings

2026-01-03T20:59:43Z

The integration of bioinformatics predictions and experimental validation plays a pivotal role in advancing biological research, from understanding molecular mechanisms to developing therapeutic strategies. Bioinformatics tools and methods offer powerful means for predicting gene functions, protein interactions, and regulatory networks, but these predictions must be validated through experimental approaches to ensure their biological relevance. This review explores the various methods and technologies used for experimental validation, including gene expression analysis, protein-protein interaction verification, and pathway validation. We also discuss the challenges involved in translating computational predictions to experimental settings and highlight the importance of collaboration between bioinformatics and experimental research. Finally, emerging technologies, such as CRISPR gene editing, next-generation sequencing, and artificial intelligence, are shaping the future of bioinformatics validation and driving more accurate and efficient biological discoveries.

MethConvTransformer: A Deep Learning Framework for Cross-Tissue Alzheimer's Disease Detection

2026-01-01T00:18:33Z

Alzheimer's disease (AD) is a multifactorial neurodegenerative disorder characterized by progressive cognitive decline and widespread epigenetic dysregulation in the brain. DNA methylation, as a stable yet dynamic epigenetic modification, holds promise as a noninvasive biomarker for early AD detection. However, methylation signatures vary substantially across tissues and studies, limiting reproducibility and translational utility. To address these challenges, we develop MethConvTransformer, a transformer-based deep learning framework that integrates DNA methylation profiles from both brain and peripheral tissues to enable biomarker discovery. The model couples a CpG-wise linear projection with convolutional and self-attention layers to capture local and long-range dependencies among CpG sites, while incorporating subject-level covariates and tissue embeddings to disentangle shared and region-specific methylation effects. In experiments across six GEO datasets and an independent ADNI validation cohort, our model consistently outperforms conventional machine-learning baselines, achieving superior discrimination and generalization. Moreover, interpretability analyses using linear projection, SHAP, and Grad-CAM++ reveal biologically meaningful methylation patterns aligned with AD-associated pathways, including immune receptor signaling, glycosylation, lipid metabolism, and endomembrane (ER/Golgi) organization. Together, these results indicate that MethConvTransformer delivers robust, cross-tissue epigenetic biomarkers for AD while providing multi-resolution interpretability, thereby advancing reproducible methylation-based diagnostics and offering testable hypotheses on disease mechanisms.

UnPaSt: unsupervised patient stratification by biclustering of omics data

2025-12-30T01:18:09Z

Unsupervised patient stratification is essential for disease subtype discovery, yet, despite growing evidence of molecular heterogeneity of non-oncological diseases, popular methods are benchmarked primarily using cancers with mutually exclusive molecular subtypes well-differentiated by numerous biomarkers. Evaluating 22 unsupervised methods, including clustering and biclustering, using simulated and real transcriptomics data revealed their inefficiency in scenarios with non-mutually exclusive subtypes or subtypes discriminated only by few biomarkers. To address these limitations and advance precision medicine, we developed UnPaSt, a novel biclustering algorithm for unsupervised patient stratification based on differentially expressed biclusters. UnPaSt outperformed widely used patient stratification approaches in the de novo identification of known subtypes of breast cancer and asthma. In addition, it detected many biologically insightful patterns across bulk transcriptomics, proteomics, single-cell, spatial transcriptomics, and multi-omics datasets, enabling a more nuanced and interpretable view of high-throughput data heterogeneity than traditionally used methods.

Epigenetic state encodes locus-specific chromatin mechanics

2025-12-28T07:15:13Z

Chromatin is repeatedly deformed in vivo during transcription, nuclear remodeling, and confined migration - yet how mechanical response varies from locus to locus, and how it relates to epigenetic state, remains unclear. We develop a theory to infer locus-specific viscoelasticity from three-dimensional genome organization. Using chromatin structures derived from contact maps, we calculate frequency-dependent storage and loss moduli for individual loci and establish that the mechanical properties are determined both by chromatin epigenetic marks and organization. On large length scales, chromatin exhibits Rouse-like viscoelastic scaling, but this coarse behavior masks extensive heterogeneity at the single-locus level. Loci segregate into two mechanical subpopulations with distinct longest relaxation times: one characterized by single-timescale and another by multi-timescale relaxation. The multi-timescale loci are strongly enriched in active marks, and the longest relaxation time for individual loci correlates inversely with effective local stiffness. Pull-release simulations further predict a time-dependent susceptibility: H3K27ac-rich loci deform more under sustained forcing yet can resist brief, large impulses. At finer genomic scales, promoters, enhancers, and gene bodies emerge as "viscoelastic islands" aligned with their focal interactions. Together, these results suggest that chromatin viscoelasticity is an organized, epigenetically coupled property of the 3D genome, providing a mechanistic layer that may influence enhancer-promoter communication, condensate-mediated organization, and response to cellular mechanical stress. The prediction that locus-specific mechanics in chromatin are controlled by 3D structures as well as the epigenetic states is amenable to experimental test.

Hypergraph Representations of scRNA-seq Data for Improved Clustering with Random Walks

2025-12-23T03:45:48Z

Analysis of single-cell RNA sequencing data is often conducted through network projections such as coexpression networks, primarily due to the abundant availability of network analysis tools for downstream tasks. However, this approach has several limitations: loss of higher-order information, inefficient data representation caused by converting a sparse dataset to a fully connected network, and overestimation of coexpression due to zero-inflation. To address these limitations, we propose conceptualizing scRNA-seq expression data as hypergraphs, which are generalized graphs in which the hyperedges can connect more than two vertices. In the context of scRNA-seq data, the hypergraph nodes represent cells and the edges represent genes. Each hyperedge connects all cells where its corresponding gene is actively expressed and records the expression of the gene across different cells. This hypergraph conceptualization enables us to explore multi-way relationships beyond the pairwise interactions in coexpression networks without loss of information. We propose two novel clustering methods: (1) the Dual-Importance Preference Hypergraph Walk (DIPHW) and (2) the Coexpression and Memory-Integrated Dual-Importance Preference Hypergraph Walk (CoMem-DIPHW). They outperform established methods on both simulated and real scRNA-seq datasets. The improvement brought by our proposed methods is especially significant when data modularity is weak. Furthermore, CoMem-DIPHW incorporates the gene coexpression network, cell coexpression network, and the cell-gene expression hypergraph from the single-cell abundance counts data altogether for embedding computation. This approach accounts for both the local level information from single-cell level gene expression and the global level information from the pairwise similarity in the two coexpression networks.

Single-cell 3D genome reconstruction in the haploid setting using rigidity theory

2025-12-22T17:34:27Z

This article considers the problem of 3-dimensional genome reconstruction for single-cell data, and the uniqueness of such reconstructions in the setting of haploid organisms. We consider multiple graph models as representations of this problem, and use techniques from graph rigidity theory to determine identifiability. Biologically, our models come from Hi-C data, microscopy data, and combinations thereof. Mathematically, we use unit ball and sphere packing models, as well as models consisting of distance and inequality constraints. In each setting, we describe and/or derive new results on realisability and uniqueness. We then propose a 3D reconstruction method based on semidefinite programming and apply it to synthetic and real data sets using our models.

DNAMotifTokenizer: Towards Biologically Informed Tokenization of Genomic Sequences

2025-12-22T15:33:22Z

DNA language models have advanced genomics, but their downstream performance varies widely due to differences in tokenization, pretraining data, and architecture. We argue that a major bottleneck lies in tokenizing sparse and unevenly distributed DNA sequence motifs, which are critical for accurate and interpretable models. To investigate, we systematically benchmark k-mer and Byte-Pair Encoding (BPE) tokenizers under controlled pretraining budget, evaluating across multiple downstream tasks from five datasets. We find that tokenizer choice induces task-specific trade-offs, and that vocabulary size and tokenizer training data strongly influence the biological knowledge captured. Notably, BPE tokenizers achieve strong performance when trained on smaller but biologically significant data. Building on these insights, we introduce DNAMotifTokenizer, which directly incorporates domain knowledge of DNA sequence motifs into the tokenization process. DNAMotifTokenizer consistently outperforms BPE across diverse benchmarks, demonstrating that knowledge-infused tokenization is crucial for learning powerful, interpretable, and generalizable genomic representations.

Trajectory Inference for Single Cell Omics

2025-12-22T15:16:15Z

Trajectory inference is used to order single-cell omics data along a path that reflects a continuous transition between cells. This approach is useful for studying processes like cell differentiation, where a stem cell matures into a specialized cell type, or investigating state changes in pathological conditions. In the current article, we provide a general introduction to trajectory inference, explaining the concepts and assumptions underlying the different methods. We then briefly discuss the strengths and weaknesses of different trajectory inference methods. We also describe best practices for using trajectory inference, such as how to validate the results and how to interpret them in the context of biological knowledge. Finally, the article highlights some applications of trajectory inference in single-cell omics research. These applications include studying cell differentiation, development, and disease.

Quantum Generative Modeling of Single-Cell transcriptomes: Capturing Gene-Gene and Cell-Cell Interactions

2025-12-19T16:30:37Z

Single-cell RNA sequencing (scRNA-seq) data simulation is limited by classical methods that rely on linear correlations, failing to capture the intrinsic, nonlinear dependencies. No existing simulator jointly models gene-gene and cell-cell interactions. We introduce qSimCells, a novel quantum computing-based simulator that employs entanglement to model intra- and inter-cellular interactions, generating realistic single-cell transcriptomes with cellular heterogeneity. The core innovation is a quantum kernel that uses a parameterized quantum circuit with CNOT gates to encode complex, nonlinear gene regulatory network (GRN) as well as cell-cell communication topologies with explicit causal directionality. The resulting synthetic data exhibits non-classical dependencies: standard correlation-based analyses (Pearson and Spearman) fail to recover the programmed causal pathways and instead report spurious associations driven by high baseline gene-expression probabilities. Furthermore, applying cell-cell communication detection to the simulated data validates the true mechanistic links, revealing a robust, up to 75-fold relative increase in inferred communication probability only when quantum entanglement is active. These results demonstrate that the quantum kernel is essential for producing high-fidelity ground-truth datasets and highlight the need for advanced inference techniques to capture the complex, non-classical dependencies inherent in gene regulation.

BHiCect 2.0: Multi-resolution clustering of Hi-C data

2025-12-19T12:26:28Z

Chromatin conformation capture technologies such as Hi-C have revealed that the genome is organized in a hierarchy of structures spanning multiple scales observed at different resolutions. Current algorithms often focus on specific interaction patterns found at a specific Hi-C resolution. We present BHi-Cect 2.0, a method that leverages Hi-C data at multiple resolutions to describe chromosome architecture as nested preferentially self-interacting clusters using spectral clustering. This new version describes the hierarchical configuration of chromosomes by now integrating multiple Hi-C data resolutions. Our new implementation offers a more comprehensive description of the multi-scale architecture of the chromosomes. We further provide these functionalities as an R package to assist their integration with other computational pipelines. The BHiCect 2.0 R packages is available on github at https://github.com/princeps091-binf/BHiCect2with the version used for this manuscript on Zenodo at https://doi.org/10.5281/zenodo.17985844.

An algorithm to align a chain of sequences to paths in a pangenome graph

2025-12-18T06:29:07Z

Affordable, high-quality whole-genome assemblies have made it possible to construct rich pangenomes that capture haplotype diversity across many species. As these datasets grow, they motivate the development of specialized techniques capable of handling the dense sequence variation found in large groups of related genomes. A common strategy is to encode pangenomic information in graph form, which provides a flexible substrate for improving algorithms in areas such as alignment, visualization, and functional analysis. Methods built on these graph models have already shown clear advantages in core bioinformatics workflows, including read mapping, variant discovery, and genotyping. By integrating multiple sequence and coordinate representations into a single structure, pangenome graphs offer a unified and expressive framework for comparative genomics. Although it remains unclear whether graph-based references will ultimately supplant traditional linear genomes, their versatility ensures that they will play a central role in emerging pangenomic approaches. This paper introduces an algorithm to mine a chain of sequences in pangenome graphs that might be useful in the functional analysis of pangenome graphs. Specifically, the algorithm calculates all maximal paths in a pangenome graph aligning with a given chain of sequences in the segments of the path vertices, possibly with some maximal gap as specified by the user.

Primer C-VAE: An interpretable deep learning primer design method to detect emerging virus variants

2025-12-16T21:36:12Z

Motivation: PCR is more economical and quicker than Next Generation Sequencing for detecting target organisms, with primer design being a critical step. In epidemiology with rapidly mutating viruses, designing effective primers is challenging. Traditional methods require substantial manual intervention and struggle to ensure effective primer design across different strains. For organisms with large, similar genomes like Escherichia coli and Shigella flexneri, differentiating between species is also difficult but crucial. Results: We developed Primer C-VAE, a model based on a Variational Auto-Encoder framework with Convolutional Neural Networks to identify variants and generate specific primers. Using SARS-CoV-2, our model classified variants (alpha, beta, gamma, delta, omicron) with 98% accuracy and generated variant-specific primers. These primers appeared with >95% frequency in target variants and <5% in others, showing good performance in in-silico PCR tests. For Alpha, Delta, and Omicron, our primer pairs produced fragments <200 bp, suitable for qPCR detection. The model also generated effective primers for organisms with longer gene sequences like E. coli and S. flexneri. Conclusion: Primer C-VAE is an interpretable deep learning approach for developing specific primer pairs for target organisms. This flexible, semi-automated and reliable tool works regardless of sequence completeness and length, allowing for qPCR applications and can be applied to organisms with large and highly similar genomes.

Dy-mer: An Explainable DNA Sequence Representation Scheme using Dictionary Learning

2025-12-13T12:52:48Z

DNA sequences encode critical genetic information, yet their variable length and discrete nature impede direct utilization in deep learning models. Existing DNA representation schemes convert sequences into numerical vectors but fail to capture structural features of local subsequences and often suffer from limited interpretability and poor generalization on small datasets. To address these limitations, we propose Dy-mer, an interpretable and robust DNA representation scheme based on dictionary learning. Dy-mer formulates an optimization problem in tensor format, which ensures computational efficiency in batch processing. Our scheme reconstructs DNA sequences as concatenations of dynamic-length subsequences (dymers) through a convolution operation and simultaneously optimize a learnable dymer dictionary and sparse representations. Our method achieves state-of-the-art performance in downstream tasks such as DNA promoter classification and motif detection. Experiments further show that the learned dymers match known DNA motifs and clustering using Dy-mer yields semantically meaningful phylogenetic trees. These results demonstrate that the proposed approach achieves both strong predictive performance and high interpretability, making it well suited for biological research applications.

Prediction of PLX-4720 Sensitivity in Cancer Cell Lines through Multi-Omics Integration and Attention-Based Fusion Modeling

2025-12-13T01:15:28Z

Predicting the sensitivity of cancer cell lines to PLX-4720, a preclinical BRAF inhibitor, requires models capable of capturing the multilayered regulation of oncogenic signaling. Single-omics predictors are often insufficient because drug response is shaped by interactions among genomic alterations, epigenetic regulation, transcriptional activity, protein signaling, metabolic state, and network-level context. In this study we develop an attention-based multi-omics integration framework using genomic, epigenomic, transcriptomic, proteomic, metabolomic, and protein interaction data from the GDSC1 panel. Each modality is encoded into a latent representation using feed-forward neural networks or graph convolutional networks, and fused through an attention mechanism that assigns modality-specific importance weights. A regression model is then used to predict PLX-4720 response. Across single- and multi-omics configurations, the best performance is achieved by integrating genomics and transcriptomics, which yields validation R2 values above 0.92. This reflects the complementary roles of mutational status and downstream transcriptional activation in shaping sensitivity to BRAF inhibition. Epigenomics is the strongest single-omics predictor, while metabolomics and PPI data contribute additional context when combined with other modalities. Integration of three to five omics layers improves stability but does not surpass the accuracy of the best two-modality combinations, likely due to information redundancy and sample-size imbalance. These findings highlight the importance of modality selection rather than maximal data depth. The proposed framework provides an efficient and biologically grounded strategy for drug response prediction and supports the development of precision pharmacogenomics.