What Topological and Geometric Structure Do Biological Foundation Models Learn? Evidence from 141 Hypotheses

2026-03-05T20:03:46Z

When biological foundation models such as scGPT and Geneformer process single-cell gene expression, what geometric and topological structure forms in their internal representations? Is that structure biologically meaningful or a training artifact, and how confident should we be in such claims? We address these questions through autonomous large-scale hypothesis screening: an AI-driven executor-brainstormer loop that proposed, tested, and refined 141 geometric and topological hypotheses across 52 iterations, covering persistent homology, manifold distances, cross-model alignment, community structure, and directed topology, all with explicit null controls and disjoint gene-pool splits. Three principal findings emerge. First, the models learn genuine geometric structure. Gene embedding neighborhoods exhibit non-trivial topology, with persistent homology significant in 11 of 12 transformer layers at p < 0.05 in the weakest domain and 12 of 12 in the other two. A multi-level distance hierarchy shows that manifold-aware metrics outperform Euclidean distance for identifying regulatory gene pairs, and graph community partitions track known transcription factor target relationships. Second, this structure is shared across independently trained models. CCA alignment between scGPT and Geneformer yields canonical correlation of 0.80 and gene retrieval accuracy of 72 percent, yet none of 19 tested methods reliably recover gene-level correspondences. The models agree on the global shape of gene space but not on precise gene placement. Third, the structure is more localized than it first appears. Under stringent null controls applied across all null families, robust signal concentrates in immune tissue, while lung and external lung signals weaken substantially.

Machine Learning for analysis of Multiple Sclerosis cross-tissue bulk and single-cell transcriptomics data

2026-03-05T16:09:12Z

Multiple Sclerosis (MS) is a chronic autoimmune disease of the central nervous system whose molecular mechanisms remain incompletely understood. In this study, we developed an end-to-end machine learning pipeline to analyze transcriptomic data from peripheral blood mononuclear cells and cerebrospinal fluid, integrating both bulk microarray and single-cell RNA sequencing datasets (concentrating on CD4+ and B-cells). After rigorous preprocessing, batch correction, and gene declustering, XGBoost classifiers were trained to distinguish MS patients from healthy controls. Explainable AI tools, namely SHapley Additive exPlanations (SHAP), were employed to identify key genes driving classification, and results were compared with Differential Expression Analysis (DEA). SHAP-prioritized genes were further investigated through interaction networks and pathway enrichment analyses. The models achieved strong performance, particularly in CSF B-cells (AUC=0.94) and microarray (AUC=0.86). SHAP gene selection proved to be complementary to classical DEA. Gene clusters identified across multiple datasets highlighted immune activation, non-canonical immune checkpoints (ITK, CLEC2D, KLRG1, CEACAM1), ribosomal and translational programs, ubiquitin-proteasome regulation, lipid trafficking, and Epstein-Barr virus-related pathways. Our integrative and explainable framework reveals complementary insights beyond conventional analysis and provides novel mechanistic hypotheses and potential biomarkers for MS pathogenesis.

In vitro binding energies capture Klf4 occupancy across the human genome

2026-03-04T09:32:15Z

Transcription factors (TFs) regulate gene expression by binding to specific genomic loci determined by DNA sequence. Their sequence specificity is commonly summarized by a consensus binding motif. However, eukaryotic genomes contain billions of low-affinity DNA sequences to which TFs associate with a sequence-dependent binding energy. We currently lack insight into how the genomic sequence defines this spectrum of binding energies and the resulting pattern of TF localization. Here, we set out to obtain a quantitative understanding of sequence-dependent TF binding to both motif and non-motif sequences. We achieve this by first pursuing accurate measurements of physical binding energies of the human TF Klf4 to a library of short DNA sequences in a fluorescence-anisotropy-based bulk competitive binding assay. Second, we show that the highly non-linear sequence dependence of Klf4 binding energies can be captured by combining a linear model of binding energies with an Ising model of the coupled recognition of nucleotides by a TF. We find that this statistical mechanics model parametrized by our in vitro measurements captures Klf4 binding patterns on individual long DNA molecules stretched in the optical tweezer, and is predictive for Klf4 occupancy across the entire human genome without additional fit parameters.

Causal Circuit Tracing Reveals Distinct Computational Architectures in Single-Cell Foundation Models: Inhibitory Dominance, Biological Coherence, and Cross-Model Convergence

2026-03-04T03:03:56Z

Motivation: Sparse autoencoders (SAEs) decompose foundation model activations into interpretable features, but causal feature-to-feature interactions across network depth remain unknown for biological foundation models. Results: We introduce causal circuit tracing by ablating SAE features and measuring downstream responses, and apply it to Geneformer V2-316M and scGPT whole-human across four conditions (96,892 edges, 80,191 forward passes). Both models show approximately 53 percent biological coherence and 65 to 89 percent inhibitory dominance, invariant to architecture and cell type. scGPT produces stronger effects (mean absolute d = 1.40 vs. 1.05) with more balanced dynamics. Cross-model consensus yields 1,142 conserved domain pairs (10.6x enrichment, p < 0.001). Disease-associated domains are 3.59x more likely to be consensus. Gene-level CRISPRi validation shows 56.4 percent directional accuracy, confirming co-expression rather than causal encoding.

Learning functional groups in complex microbiomes

2026-03-03T22:03:59Z

From soil to the gut, communities composed of thousands of microbes perform functions such as carbon sequestration and immune system regulation. Here, we introduce a data-driven approach that explains how community function can be traced to just a few groups of microbes or genes. In gut communities, our neural-network based clustering algorithm correctly recovers known functional groups. In the ocean metagenome, it distills ~500 gene modules down to three sparse groups highlighting survival strategies at different depths. In soils, it distills ~4400 bacterial species into two groups that enter a mathematical model of nitrate metabolism. By combining interpretable ML with strain isolation and sequencing experiments, we connect the metabolic specialization of each group to community-wide responses to perturbations. This integrated approach yields simple structure-function maps of microbiomes, allowing the discovery of molecular mechanisms underlying human and environmental health. More broadly, we illustrate how to do function-informed dimensionality reduction in biology.

Sparse autoencoders reveal organized biological knowledge but minimal regulatory logic in single-cell foundation models: a comparative atlas of Geneformer and scGPT

2026-03-03T13:05:11Z

Background: Single-cell foundation models such as Geneformer and scGPT encode rich biological information, but whether this includes causal regulatory logic rather than statistical co-expression remains unclear. Sparse autoencoders (SAEs) can resolve superposition in neural networks by decomposing dense activations into interpretable features, yet they have not been systematically applied to biological foundation models. Results: We trained TopK SAEs on residual stream activations from all layers of Geneformer V2-316M (18 layers, d=1152) and scGPT whole-human (12 layers, d=512), producing atlases of 82525 and 24527 features, respectively. Both atlases confirm massive superposition, with 99.8 percent of features invisible to SVD. Systematic characterization reveals rich biological organization: 29 to 59 percent of features annotate to Gene Ontology, KEGG, Reactome, STRING, or TRRUST, with U-shaped layer profiles reflecting hierarchical abstraction. Features organize into co-activation modules (141 in Geneformer, 76 in scGPT), exhibit causal specificity (median 2.36x), and form cross-layer information highways (63 to 99.8 percent). When tested against genome-scale CRISPRi perturbation data, only 3 of 48 transcription factors (6.2 percent) show regulatory-target-specific feature responses. A multi-tissue control yields marginal improvement (10.4 percent, 5 of 48 TFs), establishing model representations as the bottleneck. Conclusions: These models have internalized organized biological knowledge, including pathway membership, protein interactions, functional modules, and hierarchical abstraction, yet they encode minimal causal regulatory logic. We release both feature atlases as interactive web platforms enabling exploration of more than 107000 features across 30 layers of two leading single-cell foundation models.

GPU-accelerated single-cell analysis at scale with rapids-singlecell

2026-03-02T21:23:25Z

Single-cell sequencing technologies reveal cellular heterogeneity at high resolution, advancing our understanding of biological complexity. As datasets start to scale to tens of millions of cells, computational workflows face substantial bottlenecks, with CPU-based analytical pipelines requiring hours or days for routine processing steps like filtering, normalization, and clustering. These scalability limitations fundamentally restrict common interactive data exploration and iterative hypothesis testing. Here we introduce rapids-singlecell, a GPU-accelerated framework that integrates natively with the scverse ecosystem and operates directly on the AnnData data structure, which delivers orders-of-magnitude speedups for single-cell workflows. Built on CuPy arrays and the NVIDIA CUDA-X Data Science (RAPIDS) ecosystem, rapids-singlecell provides near drop-in GPU replacements for core scanpy-based analysis steps. Across standard single-cell workflows such as preprocessing, dimensionality reduction, neighborhood graph construction, clustering, and batch correction, rapids-singlecell achieves speedups of up to several hundred-fold compared to optimized CPU baselines. This reduces analysis time from hours to minutes on standard hardware, while maintaining consistent biological interpretations. These performance improvements make it possible to analyze large data sets in close to real time, without the need for data splitting. Together with real-time parameter tuning and iterative workflows, rapids-singlecell makes interactive large-scale single-cell analysis possible.

D3LM: A Discrete DNA Diffusion Language Model for Bidirectional DNA Understanding and Generation

2026-03-02T12:05:21Z

Early DNA foundation models adopted BERT-style training, achieving good performance on DNA understanding tasks but lacking generative capabilities. Recent autoregressive models enable DNA generation, but employ left-to-right causal modeling that is suboptimal for DNA where regulatory relationships are inherently bidirectional. We present D3LM (\textbf{D}iscrete \textbf{D}NA \textbf{D}iffusion \textbf{L}anguage \textbf{M}odel), which unifies bidirectional representation learning and DNA generation through masked diffusion. D3LM directly adopts the Nucleotide Transformer (NT) v2 architecture but reformulates the training objective as masked diffusion in discrete DNA space, enabling both bidirectional understanding and generation capabilities within a single model. Compared to NT v2 of the same size, D3LM achieves improved performance on understanding tasks. Notably, on regulatory element generation, D3LM achieves an SFID of 10.92, closely approaching real DNA sequences (7.85) and substantially outperforming the previous best result of 29.16 from autoregressive models. Our work suggests diffusion language models as a promising paradigm for unified DNA foundation models. We further present the first systematic study of masked diffusion models in the DNA domain, investigating practical design choices such as tokenization schemes and sampling strategies, thereby providing empirical insights and a solid foundation for future research. D3LM has been released at https://huggingface.co/collections/Hengchang-Liu/d3lm.

Large Language Models in Bioinformatics: A Survey

2026-03-01T13:29:26Z

Large Language Models (LLMs) are revolutionizing bioinformatics, enabling advanced analysis of DNA, RNA, proteins, and single-cell data. This survey provides a systematic review of recent advancements, focusing on genomic sequence modeling, RNA structure prediction, protein function inference, and single-cell transcriptomics. Meanwhile, we also discuss several key challenges, including data scarcity, computational complexity, and cross-omics integration, and explore future directions such as multimodal learning, hybrid AI models, and clinical applications. By offering a comprehensive perspective, this paper underscores the transformative potential of LLMs in driving innovations in bioinformatics and precision medicine.

Graph-Based Multi-Omics Integration Improves Subtype Recovery and Survival Prediction Over Classical Integration Strategies in TCGA-BRCA

2026-02-27T13:56:30Z

Background. Breast cancer comprises at least five molecular subtypes with distinct prognoses, yet PAM50 classification relies on transcriptomics alone. Whether integrating DNA methylation and copy number data improves subtype recovery and survival prediction over single-omic baselines remains an open question. Methods. We applied Similarity Network Fusion (SNF) to n = 644 TCGA-BRCA patients with matched RNA-seq, 450k DNA methylation, and GISTIC2 copy number profiles. Per-modality patient similarity networks were iteratively fused (K = 20, T = 20, u = 0.5) and partitioned by spectral clustering; k = 2 was pre-specified on eigengap and silhouette criteria. SNF was benchmarked against RNA-only, CNV-only, methylation-only, and early concatenation baselines using PAM50 NMI for subtype recovery and out-of-fold concordance index (OOF C-index) from a Ridge Cox model with N = 1,000 bootstrap CIs for pairwise comparisons. Results. SNF produced a stable two-cluster partition (stability ARI = 1.00, silhouette = 0.228), with NMI = 0.495 versus PAM50, exceeding RNA-only (0.428) and early concatenation (0.175). IHC receptor data confirmed cluster biology independently (ER+: 92.8% vs 15.6%; triple-negative: 1.0% vs 45.4%; both p < 10^-100). SNF achieved an OOF C-index of 0.681 (95% CI 0.610-0.760), significantly outperforming CNV-only (Delta = +0.122, CI 0.020-0.211); the advantage over RNA-only (Delta = +0.049, CI -0.036-0.144) did not exclude zero. Conclusion. Graph-based multi-omics fusion recovers breast cancer subtype biology more faithfully than feature concatenation and outperforms the weakest unimodal baselines in survival prediction. The improvement over RNA-seq alone is positive in direction but not yet statistically conclusive at this cohort size, pointing to the trade-off between integration complexity and the sample sizes needed to quantify its marginal benefit.

BEDCrypt: Privacy-preserving interval analytics with homomorphic encryption

2026-02-25T15:15:09Z

Motivation. Genomic data and derived interval datasets can carry sensitive information, and the analysis itself can reveal an analyst's intent. As genomic workloads are increasingly outsourced to third-party infrastructure, there is a need for privacy-preserving technologies that protect both the data and the queried loci. Results. We present BEDCrypt, a privacy-preserving system for genomic interval analytics based on homomorphic encryption in an honest-but-curious server setting. The server operates only on encrypted data and returns encrypted answers that the client decrypts locally, enabling core functionalities such as coverage summaries, interval intersections, proximity (window-style) queries, and set-similarity statistics, without revealing plaintext intervals or query genomic locations to the server.

Effects of Training Data Quality on Classifier Performance

2026-02-25T00:29:51Z

We describe extensive numerical experiments assessing and quantifying how classifier performance depends on the quality of the training data, a frequently neglected component of the analysis of classifiers. More specifically, in the scientific context of metagenomic assembly of short DNA reads into "contigs," we examine the effects of degrading the quality of the training data by multiple mechanisms, and for four classifiers -- Bayes classifiers, neural nets, partition models and random forests. We investigate both individual behavior and congruence among the classifiers. We find breakdown-like behavior that holds for all four classifiers, as degradation increases and they move from being mostly correct to only coincidentally correct, because they are wrong in the same way. In the process, a picture of spatial heterogeneity emerges: as the training data move farther from analysis data, classifier decisions degenerate, the boundary becomes less dense, and congruence increases.

Multi-Dimensional Spectral Geometry of Biological Knowledge in Single-Cell Transformer Representations

2026-02-24T17:57:59Z

Single-cell foundation models such as scGPT learn high-dimensional gene representations, but what biological knowledge these representations encode remains unclear. We systematically decode the geometric structure of scGPT internal representations through 63 iterations of automated hypothesis screening (183 hypotheses tested), revealing that the model organizes genes into a structured biological coordinate system rather than an opaque feature space. The dominant spectral axis separates genes by subcellular localization, with secreted proteins at one pole and cytosolic proteins at the other. Intermediate transformer layers transiently encode mitochondrial and ER compartments in a sequence that mirrors the cellular secretory pathway. Orthogonal axes encode protein-protein interaction networks with graded fidelity to experimentally measured interaction strength (Spearman rho = 1.000 across n = 5 STRING confidence quintiles, p = 0.017). In a compact six-dimensional spectral subspace, the model distinguishes transcription factors from their target genes (AUROC = 0.744, all 12 layers significant). Early layers preserve which specific genes regulate which targets, while deeper layers compress this into a coarser regulator versus regulated distinction. Repression edges are geometrically more prominent than activation edges, and B-cell master regulators BATF and BACH2 show convergence toward the B-cell identity anchor PAX5 across transformer depth. Cell-type marker genes cluster with high fidelity (AUROC = 0.851). Residual-stream geometry encodes biological structure complementary to attention patterns. These results indicate that biological transformers learn an interpretable internal model of cellular organization, with implications for regulatory network inference, drug target prioritization, and model auditing.

VAE-MS: An Asymmetric Variational Autoencoder for Mutational Signature Extraction

2026-02-24T08:08:38Z

Mutational signature analysis has emerged as a powerful method for uncovering the underlying biological processes driving cancer development. However, the signature extraction process, typically performed using non-negative matrix factorization (NMF), often lacks reliability and clinical applicability. To address these limitations, several solutions have been introduced, including the use of neural networks to achieve more accurate estimates and probabilistic methods to better capture natural variation in the data. In this work, we introduce a Variational Autoencoder for Mutational Signatures (VAE-MS), a novel model that leverages both an asymmetric architecture and probabilistic methods for the extraction of mutational signatures. VAE-MS is compared to with three state-of-the-art models for mutational signature extraction: SigProfilerExtractor, the NMF-based gold standard; MUSE-XAE, an autoencoder that employs an asymmetric design without probabilistic components; and SigneR, a Bayesian NMF model, to illustrate the strength in combining a nonlinear extraction with a probabilistic model. In the ability to reconstruct input data and generalize to unseen data, models with probabilistic components (VAE-MS, SigneR) dramatically outperformed models without (SigProfilerExtractor, MUSE-XAE). The NMF-baed models (SigneR, SigProfilerExtractor) had the most accurate reconstructions in simulated data, while VAE-MS reconstructed more accurately on real cancer data. Upon evaluating the ability to extract signatures consistently, no model exhibited a clear advantage over the others. Software for VAE-MS is available at https://github.com/CLINDA-AAU/VAE-MS.

CrossLLM-Mamba: Multimodal State Space Fusion of LLMs for RNA Interaction Prediction

2026-02-23T19:57:11Z

Accurate prediction of RNA-associated interactions is essential for understanding cellular regulation and advancing drug discovery. While Biological Large Language Models (BioLLMs) such as ESM-2 and RiNALMo provide powerful sequence representations, existing methods rely on static fusion strategies that fail to capture the dynamic, context-dependent nature of molecular binding. We introduce CrossLLM-Mamba, a novel framework that reformulates interaction prediction as a state-space alignment problem. By leveraging bidirectional Mamba encoders, our approach enables deep ``crosstalk'' between modality-specific embeddings through hidden state propagation, modeling interactions as dynamic sequence transitions rather than static feature overlaps. The framework maintains linear computational complexity, making it scalable to high-dimensional BioLLM embeddings. We further incorporate Gaussian noise injection and Focal Loss to enhance robustness against hard-negative samples. Comprehensive experiments across three interaction categories, RNA-protein, RNA-small molecule, and RNA-RNA demonstrate that CrossLLM-Mamba achieves state-of-the-art performance. On the RPI1460 benchmark, our model attains an MCC of 0.892, surpassing the previous best by 5.2\%. For binding affinity prediction, we achieve Pearson correlations exceeding 0.95 on riboswitch and repeat RNA subtypes. These results establish state-space modeling as a powerful paradigm for multi-modal biological interaction prediction.