A Hybrid Computational Intelligence Framework for scRNA-seq Imputation: Integrating scRecover and Random Forests

2025-11-21T03:23:47Z

Single-cell RNA sequencing (scRNA-seq) enables transcriptomic profiling at cellular resolution but suffers from pervasive dropout events that obscure biological signals. We present SCR-MF, a modular two-stage workflow that combines principled dropout detection using scRecover with robust non-parametric imputation via missForest. Across public and simulated datasets, SCR-MF achieves robust and interpretable performance comparable to or exceeding existing imputation methods in most cases, while preserving biological fidelity and transparency. Runtime analysis demonstrates that SCR-MF provides a competitive balance between accuracy and computational efficiency, making it suitable for mid-scale single-cell datasets.

Effect Size-Driven Pathway Meta-Analysis for Gene Expression Data

2025-11-20T11:58:36Z

The proliferation of omics datasets in public repositories has created unprecedented opportunities for biomedical research but has also posed significant challenges for their integration, particularly due to missing genes and platform-specific discrepancies. Traditional gene expression metaanalysis often focuses on individual genes, leading to data loss and limited biological insights when there are missing genes across different studies. To address these limitations, we propose GSEMA (Gene Set Enrichment Meta-Analysis), a novel methodology that leverages singlesample enrichment scoring to aggregate gene expression data into pathway-level matrices. By applying meta-analysis techniques to enrichment scores, GSEMA preserves the magnitude and directionality of effects, enabling the definition of pathway activity across datasets. Using simulated data and case studies on Systemic Lupus Erythematosus (SLE) and Parkinson's Disease (PD), we demonstrate that GSEMA outperforms other methods in controlling false positive rates while providing meaningful biological interpretations. GSEMA methodology is implemented as an R package available on CRAN repository

Efficient Chromosome Parallelization for Precision Medicine Genomic Workflows

2025-11-20T02:14:56Z

Large-scale genomic workflows used in precision medicine can process datasets spanning tens to hundreds of gigabytes per sample, leading to high memory spikes, intensive disk I/O, and task failures due to out-of-memory errors. Simple static resource allocation methods struggle to handle the variability in per-chromosome RAM demands, resulting in poor resource utilization and long runtimes. In this work, we propose multiple mechanisms for adaptive, RAM-efficient parallelization of chromosome-level bioinformatics workflows. First, we develop a symbolic regression model that estimates per-chromosome memory consumption for a given task and introduces an interpolating bias to conservatively minimize over-allocation. Second, we present a dynamic scheduler that adaptively predicts RAM usage with a polynomial regression model, treating task packing as a Knapsack problem to optimally batch jobs based on predicted memory requirements. Additionally, we present a static scheduler that optimizes chromosome processing order to minimize peak memory while preserving throughput. Our proposed methods, evaluated on simulations and real-world genomic pipelines, provide new mechanisms to reduce memory overruns and balance load across threads. We thereby achieve faster end-to-end execution, showcasing the potential to optimize large-scale genomic workflows.

BaGGLS: A Bayesian Shrinkage Framework for Interpretable Modeling of Interactions in High-Dimensional Biological Data

2025-11-19T10:48:30Z

Biological data sets are often high-dimensional, noisy, and governed by complex interactions among sparse signals. This poses major challenges for interpretability and reliable feature selection. Tasks such as identifying motif interactions in genomics exemplify these difficulties, as only a small subset of biologically relevant features (e.g., motifs) are typically active, and their effects are often non-linear and context-dependent. While statistical approaches often result in more interpretable models, deep learning models have proven effective in modeling complex interactions and prediction accuracy, yet their black-box nature limits interpretability. We introduce BaGGLS, a flexible and interpretable probabilistic binary regression model designed for high-dimensional biological inference involving feature interactions. BaGGLS incorporates a Bayesian group global-local shrinkage prior, aligned with the group structure introduced by interaction terms. This prior encourages sparsity while retaining interpretability, helping to isolate meaningful signals and suppress noise. To enable scalable inference, we employ a partially factorized variational approximation that captures posterior skewness and supports efficient learning even in large feature spaces. In extensive simulations, we can show that BaGGLS outperforms the other methods with regard to interaction detection and is many times faster than MCMC sampling under the horseshoe prior. We also demonstrate the usefulness of BaGGLS in the context of interaction discovery from motif scanner outputs and noisy attribution scores from deep learning models. This shows that BaGGLS is a promising approach for uncovering biologically relevant interaction patterns, with potential applicability across a range of high-dimensional tasks in computational biology.

CASPER: Cross-modal Alignment of Spatial and single-cell Profiles for Expression Recovery

2025-11-19T05:36:05Z

Spatial Transcriptomics enables mapping of gene expression within its native tissue context, but current platforms measure only a limited set of genes due to experimental constraints and excessive costs. To overcome this, computational models integrate Single-Cell RNA Sequencing data with Spatial Transcriptomics to predict unmeasured genes. We propose CASPER, a cross-attention based framework that predicts unmeasured gene expression in Spatial Transcriptomics by leveraging centroid-level representations from Single-Cell RNA Sequencing. We performed rigorous testing over four state-of-the-art Spatial Transcriptomics/Single-Cell RNA Sequencing dataset pairs across four existing baseline models. CASPER shows significant improvement in nine out of the twelve metrics for our experiments. This work paves the way for further work in Spatial Transcriptomics to Single-Cell RNA Sequencing modality translation. The code for CASPER is available at https://github.com/AI4Med-Lab/CASPER.

Deep Pathomic Learning Defines Prognostic Subtypes and Molecular Drivers in Colorectal Cancer

2025-11-19T03:19:43Z

Precise prognostic stratification of colorectal cancer (CRC) remains a major clinical challenge due to its high heterogeneity. The conventional TNM staging system is inadequate for personalized medicine. We aimed to develop and validate a novel multiple instance learning model TDAM-CRC using histopathological whole-slide images for accurate prognostic prediction and to uncover its underlying molecular mechanisms. We trained the model on the TCGA discovery cohort (n=581), validated it in an independent external cohort (n=1031), and further we integrated multi-omics data to improve model interpretability and identify novel prognostic biomarkers. The results demonstrated that the TDAM-CRC achieved robust risk stratification in both cohorts. Its predictive performance significantly outperformed the conventional clinical staging system and multiple state-of-the-art models. The TDAM-CRC risk score was confirmed as an independent prognostic factor in multivariable analysis. Multi-omics analysis revealed that the high-risk subtype is closely associated with metabolic reprogramming and an immunosuppressive tumor microenvironment. Through interaction network analysis, we identified and validated Mitochondrial Ribosomal Protein L37 (MRPL37) as a key hub gene linking deep pathomic features to clinical prognosis. We found that high expression of MRPL37, driven by promoter hypomethylation, serves as an independent biomarker of favorable prognosis. Finally, we constructed a nomogram incorporating the TDAM-CRC risk score and clinical factors to provide a precise and interpretable clinical decision-making tool for CRC patients. Our AI-driven pathological model TDAM-CRC provides a robust tool for improved CRC risk stratification, reveals new molecular targets, and facilitates personalized clinical decision-making.

Near-Lossless Model Compression Enables Longer Context Inference in DNA Large Language Models

2025-11-18T17:29:39Z

Trained on massive cross-species DNA corpora, DNA large language models (LLMs) learn the fundamental "grammar" and evolutionary patterns of genomic sequences. This makes them powerful priors for DNA sequence modeling, particularly over long ranges. However, two major constraints hinder their use in practice: the quadratic computational cost of self-attention and the growing memory required for key-value (KV) caches during autoregressive decoding. These constraints force the use of heuristics such as fixed-window truncation or sliding windows, which compromise fidelity on ultra-long sequences by discarding distant information. We introduce FOCUS (Feature-Oriented Compression for Ultra-long Self-attention), a progressive context-compression module that can be plugged into pretrained DNA LLMs. FOCUS combines the established k-mer representation in genomics with learnable hierarchical compression: it inserts summary tokens at k-mer granularity and progressively compresses attention key and value activations across multiple Transformer layers, retaining only the summary KV states across windows while discarding ordinary-token KV. A shared-boundary windowing scheme yields a stationary cross-window interface that propagates long-range information with minimal loss. We validate FOCUS on an Evo-2-based DNA LLM fine-tuned on GRCh38 chromosome 1 with self-supervised training and randomized compression schedules to promote robustness across compression ratios. On held-out human chromosomes, FOCUS achieves near-lossless fidelity: compressing a 1 kb context into only 10 summary tokens (about 100x) shifts the average per-nucleotide probability by only about 0.0004. Compared to a baseline without compression, FOCUS reduces KV-cache memory and converts effective inference scaling from O(N^2) to near-linear O(N), enabling about 100x longer inference windows on commodity GPUs with near-lossless fidelity.

MergeDNA: Context-aware Genome Modeling with Dynamic Tokenization through Token Merging

2025-11-17T19:27:41Z

Modeling genomic sequences faces two unsolved challenges: the information density varies widely across different regions, while there is no clearly defined minimum vocabulary unit. Relying on either four primitive bases or independently designed DNA tokenizers, existing approaches with naive masked language modeling pre-training often fail to adapt to the varying complexities of genomic sequences. Leveraging Token Merging techniques, this paper introduces a hierarchical architecture that jointly optimizes a dynamic genomic tokenizer and latent Transformers with context-aware pre-training tasks. As for network structures, the tokenization module automatically chunks adjacent bases into words by stacking multiple layers of the differentiable token merging blocks with local-window constraints, then a Latent Encoder captures the global context of these merged words by full-attention blocks. Symmetrically employing a Latent Decoder and a Local Decoder, MergeDNA learns with two pre-training tasks: Merged Token Reconstruction simultaneously trains the dynamic tokenization module and adaptively filters important tokens, while Adaptive Masked Token Modeling learns to predict these filtered tokens to capture informative contents. Extensive experiments show that MergeDNA achieves superior performance on three popular DNA benchmarks and several multi-omics tasks with fine-tuning or zero-shot evaluation, outperforming typical tokenization methods and large-scale DNA foundation models.

Rare Genomic Subtype Discovery from RNA-seq via Autoencoder Embeddings and Stability-Aware Clustering

2025-11-17T18:53:43Z

Unsupervised learning on high-dimensional RNA-seq data can reveal molecular subtypes beyond standard labels. We combine an autoencoder-based representation with clustering and stability analysis to search for rare but reproducible genomic subtypes. On the UCI "Gene Expression Cancer RNA-Seq" dataset (801 samples, 20,531 genes; BRCA, COAD, KIRC, LUAD, PRAD), a pan-cancer analysis shows clusters aligning almost perfectly with tissue of origin (Cramer's V = 0.887), serving as a negative control. We therefore reframe the problem within KIRC (n = 146): we select the top 2,000 highly variable genes, standardize them, train a feed-forward autoencoder (128-dimensional latent space), and run k-means for k = 2-10. While global indices favor small k, scanning k with a pre-specified discovery rule (rare < 10 percent and stable with Jaccard >= 0.60 across 20 seeds after Hungarian alignment) yields a simple solution at k = 5 (silhouette = 0.129, DBI = 2.045) with a rare cluster C0 (6.85 percent of patients) that is highly stable (Jaccard = 0.787). Cluster-vs-rest differential expression (Welch's t-test, Benjamini-Hochberg FDR) identifies coherent markers. Overall, pan-cancer clustering is dominated by tissue of origin, whereas a stability-aware within-cancer approach reveals a rare, reproducible KIRC subtype.

Bizard: A Community-Driven Platform for Accelerating and Enhancing Biomedical Data Visualization

2025-11-17T05:41:32Z

Biomedical research increasingly relies on heterogeneous, high-dimensional datasets, yet effective visualization remains hindered by fragmented code resources, steep programming barriers, and limited domain-specific guidance. Bizard is an open-source visualization code repository engineered to streamline data analysis in biomedical research. It aggregates a diverse array of executable visualization scripts, empowering researchers to select and tailor optimal graphical methods for their specific investigative demands. The platform features an intuitive interface equipped with sophisticated browsing and filtering capabilities, exhaustive tutorials, and interactive discussion forums that foster knowledge dissemination. Through its community-driven paradigm, Bizard promotes continual refinement and functional expansion, establishing itself as an essential resource for elevating biomedical data visualization and analytical standards. By harnessing Bizard's infrastructure, researchers can augment their visualization proficiency, propel methodological progress, and enhance interpretive rigor, ultimately accelerating precision medicine and personalized therapeutics. Bizard is freely accessible at https://openbiox.github.io/Bizard/.

Finding low-complexity DNA sequences with longdust

2025-11-16T23:17:23Z

Motivation: Low-complexity (LC) DNA sequences are compositionally repetitive sequences that are often associated with spurious homologous matches and variant calling artifacts. While algorithms for identifying LC sequences exist, they either do not define complexity mathematically or are inefficient with long or variable context windows. Results: Longdust is a new algorithm that efficiently identifies long LC sequences including centromeric satellite and tandem repeats with moderately long motifs. It defines string complexity by statistically modeling the k-mer count distribution with the parameters: the k-mer length, the context window size and a threshold on complexity. Longdust exhibits high performance on real data and high consistency with existing methods. Availability and implementation: https://github.com/lh3/longdust

CellStream: Dynamical Optimal Transport Informed Embeddings for Reconstructing Cellular Trajectories from Snapshots Data

2025-11-16T13:18:45Z

Single-cell RNA sequencing (scRNA-seq), especially temporally resolved datasets, enables genome-wide profiling of gene expression dynamics at single-cell resolution across discrete time points. However, current technologies provide only sparse, static snapshots of cell states and are inherently influenced by technical noise, complicating the inference and representation of continuous transcriptional dynamics. Although embedding methods can reduce dimensionality and mitigate technical noise, the majority of existing approaches typically treat trajectory inference separately from embedding construction, often neglecting temporal structure. To address this challenge, here we introduce CellStream, a novel deep learning framework that jointly learns embedding and cellular dynamics from single-cell snapshot data by integrating an autoencoder with unbalanced dynamical optimal transport. Compared to existing methods, CellStream generates dynamics-informed embeddings that robustly capture temporal developmental processes while maintaining high consistency with the underlying data manifold. We demonstrate CellStream's effectiveness on both simulated datasets and real scRNA-seq data, including spatial transcriptomics. Our experiments indicate significant quantitative improvements over state-of-the-art methods in representing cellular trajectories with enhanced temporal coherence and reduced noise sensitivity. Overall, CellStream provides a new tool for learning and representing continuous streams from the noisy, static snapshots of single-cell gene expression.

Combining multiplexed functional data to improve variant classification

2025-11-16T00:24:22Z

With the surge in the number of variants of uncertain significance (VUS) reported in ClinVar in recent years, there is an imperative to resolve VUS at scale. Multiplexed assays of variant effect (MAVEs), which allow the functional consequence of 100s to 1000s of genetic variants to be measured in a single experiment, are emerging as a powerful source of evidence which can be used in clinical gene variant classification. Increasingly, multiple published MAVEs are available for the same gene, sometimes measuring different aspects of variant impact. When multiple functional roles of a gene need to be considered, combining data from multiple MAVEs may provide a more comprehensive measure of the consequence of a genetic variant, which could impact variant classifications. Here, we provide guidance for combining such multiplexed functional data, incorporating a stepwise process from data curation and collection to model generation and validation. We demonstrate the potential and pitfalls of this approach by showing the integration of multiplexed functional data from five MAVEs for the gene TP53, two MAVEs for the gene LDLR and two MAVEs for PTEN. We also present a web applet that allows users to test various methods for combining score sets from multiple assays, calculate integrated functional scores for all variants, and assess whether combining data enables the application of stronger evidence for pathogenicity or benignity. By following these steps with appropriate guardrails, researchers can maximize the value of MAVEs, strengthen the functional evidence for clinical variant classification, and potentially uncover novel mechanisms of pathogenicity for clinically relevant genes.

Phospho-Proteomics Method Optimization and Application to Stimulated Jurkat Cells

2025-11-15T17:52:31Z

In clinical proteomics, available input is often limited. In addition, phospho-proteomics is of particular interest since the dysregulation of these post-translational modifications (PTMs) has been implicated in various diseases such as cancer. We therefore assessed the feasibility of low input phospho-proteomics via phospho-bulk titration and low-input starting material. We found that there was identification of more phospho-peptides through phospho-bulk titration because of sample loss during preparation of low input starting material. Additionally, we explored various lysis buffers and boiling times for efficiency of decrosslinking formalin-fixed cells since cells and tissues are often fixed for preservation and sorting via FACS. We found that boiling in 0.05M Tris pH 7.6 with 5% SDS for 60 min yielded the highest number of phospho-peptides. Lastly, we applied Evotips Pure and phospho-bulk titration to treated Jurkat cells and identified 7 phospho-sites involved in T-cell stimulation.

Gene Incremental Learning for Single-Cell Transcriptomics

2025-11-14T10:54:03Z

Classes, as fundamental elements of Computer Vision, have been extensively studied within incremental learning frameworks. In contrast, tokens, which play essential roles in many research fields, exhibit similar characteristics of growth, yet investigations into their incremental learning remain significantly scarce. This research gap primarily stems from the holistic nature of tokens in language, which imposes significant challenges on the design of incremental learning frameworks for them. To overcome this obstacle, in this work, we turn to a type of token, gene, for a large-scale biological dataset--single-cell transcriptomics--to formulate a pipeline for gene incremental learning and establish corresponding evaluations. We found that the forgetting problem also exists in gene incremental learning, thus we adapted existing class incremental learning methods to mitigate the forgetting of genes. Through extensive experiments, we demonstrated the soundness of our framework design and evaluations, as well as the effectiveness of our method adaptations. Finally, we provide a complete benchmark for gene incremental learning in single-cell transcriptomics.