A Phylogenetic Approach to Genomic Language Modeling

2026-03-20T00:45:19Z

Genomic language models (gLMs) have shown mostly modest success in identifying evolutionarily constrained elements in mammalian genomes. To address this issue, we introduce a novel framework for training gLMs that explicitly models nucleotide evolution on phylogenetic trees using multispecies whole-genome alignments. Our approach integrates an alignment into the loss function during training but does not require it for making predictions, thereby enhancing the model's applicability. We applied this framework to train PhyloGPN, a model that excels at predicting functionally disruptive variants from a single sequence alone and demonstrates strong transfer learning capabilities.

What You Read is What You Classify: Highlighting Attributions to Text and Text-Like Inputs

2026-03-19T17:25:16Z

At present, there are no easily understood explainable artificial intelligence (AI) methods for discrete token inputs, like text. Most explainable AI techniques do not extend well to token sequences, where both local and global features matter, because state-of-the-art models, like transformers, tend to focus on global connections. Therefore, existing explainable AI algorithms fail by (i) identifying disparate tokens of importance, or (ii) assigning a large number of tokens a low value of importance. This method for explainable AI for tokens-based classifiers generalizes a mask-based explainable AI algorithm for images. It starts with an Explainer neural network that is trained to create masks to hide information not relevant for classification. Then, the Hadamard product of the mask and the continuous values of the classifier's embedding layer is taken and passed through the classifier, changing the magnitude of the embedding vector but keeping the orientation unchanged. The Explainer is trained for a taxonomic classifier for nucleotide sequences and it is shown that the masked segments are less relevant to classification than the unmasked ones. This method focused on the importance the token as a whole (i.e., a segment of the input sequence), producing a human-readable explanation.

Genomic Next-Token Predictors are In-Context Learners

2026-03-17T23:52:42Z

In-context learning (ICL) -- the capacity of a model to infer and apply abstract patterns from examples provided within its input -- has been extensively studied in large language models trained for next-token prediction on human text. In fact, prior work often attributes this emergent behavior to distinctive statistical properties in human language. This raises a fundamental question: can ICL arise organically in other sequence domains purely through large-scale predictive training? To explore this, we turn to genomic sequences, an alternative symbolic domain rich in statistical structure. Specifically, we study the Evo2 genomic model, trained predominantly on next-nucleotide (A/T/C/G) prediction, at a scale comparable to mid-sized LLMs. We develop a controlled experimental framework comprising symbolic reasoning tasks instantiated in both linguistic and genomic forms, enabling direct comparison of ICL across genomic and linguistic models. Our results show that genomic models, like their linguistic counterparts, exhibit log-linear gains in pattern induction as the number of in-context demonstrations increases. To the best of our knowledge, this is the first evidence of organically emergent ICL in genomic sequences, supporting the hypothesis that ICL arises as a consequence of large-scale predictive modeling over rich data. These findings extend emergent meta-learning beyond language, pointing toward a unified, modality-agnostic view of in-context learning.

Moonshine.jl: a Julia package for genome-scale model-based ancestral recombination graph inference

2026-03-17T19:24:12Z

The ancestral recombination graph (ARG) is the model of choice in statistical genetics to model population ancestries. Software capable of simulating ARGs on a genome scale within a reasonable amount of time are now widely available for most practical use cases. While the inverse problem of inferring ancestries from a sample of haplotypes has seen major progress in the last decade, it does not enjoy the same level of advancement as its counterpart. Up until recently, even moderately sized samples could only be handled using heuristics. In recent years, the possibility of model-based inference for datasets closer to "real world" scenarios has become a reality, largely due to the development of threading-based samplers. This article introduces Moonshine.jl, a Julia package that has the ability, among other things, to infer ARGs for samples of thousands of human haplotypes of sizes on the order of hundreds of megabases within a reasonable amount of time. On recent hardware, our package is able to infer an ARG for samples of densely haplotyped (over one marker/kilobase) human chromosomes of sizes up to 10000 in well under a day on data simulated by msprime. Scaling up simulation on a compute cluster is straightforward thanks to a strictly single-threaded implementation. While model-based, it does not resort to threading but rather places restrictions on probability distributions typically used in simulation software in order to enforce sample consistency. In addition to being efficient, a strong emphasis is placed on ease of use and integration into the biostatistical software ecosystem.

Making Multi-Axis Gaussian Graphical Models Scalable to Millions of Cells

2026-03-17T10:19:17Z

Motivation: Networks underlie the generation and interpretation of many biological datasets: gene networks shed light on the regulatory structure of the genome, and cell networks can capture structure of the tumor micro-environment. However, most methods that learn such networks make the faulty 'independence assumption'; to learn the gene network, they assume that no cell network exists. 'Multi-axis' methods, which do not make this assumption, fail to scale beyond a few thousand cells or genes. This limits their applicability to only the smallest datasets. Results: We develop a multi-axis method capable of processing million-cell datasets within minutes. This was previously impossible, and unlocks the use of such methods on modern scRNA-seq datasets, as well as more complex datasets. We show that our method yields novel biological insights from real single-cell data, and compares favorably to the existing hdWGCNA methodology. In particular, it identifies long non-coding RNA genes that potentially have a regulatory or functional role in neuronal development. Availability and implementation: Our methodology is available as a Python package GmGM on PyPI (https://pypi.org/project/GmGM/0.5.3/). The code for all experiments performed in this paper is available on GitHub (https://github.com/BaileyAndrew/GmGM-Bioinformatics). Contact: sceba@leeds.ac.uk Supplementary information: Our proofs, and some additional experiments, are available in the supplementary material. Keywords: gaussian graphical models, multi-axis models, transcriptomics, multi-omics, scalability

TPMM: Three-component Posterior Mixture Model Enables Robust Inverton Detection in Low-Depth Metagenomes and Suggests Potential Viral Invertons

2026-03-17T07:21:07Z

Bacterial phase variation enables reversible, locus-specific phenotypic switching, often driven by DNA inversion (invertons). To identify these events, researchers commonly rely on sequencing reads that provide orientation-specific support. Metagenomic sequencing, which captures total genetic material independent of cultivation, offers a powerful platform for the comprehensive study of invertons. However, computational inverton calling from metagenomic data is difficult at low sequencing depth: hard read-support cutoffs can miss true events, while sequence-only predictors lack read-backed interpretability and uncertainty quantification. To address this, we present TPMM, a three-component posterior mixture model for inverton calling in metagenomic data. TPMM explicitly incorporates sequencing depth to formulate inverton detection as a probabilistic mixture problem. Starting from candidates flanked by inverted repeats, the model classifies the candidates into noise, low-probability, or high-probability inversion signals using read evidence. Finally, TPMM assigns posterior probabilities as soft labels and applies cumulative Bayesian False Discovery Rate control to robustly identify true invertons. On two real gut metagenomic datasets, TPMM agrees well with PhaseFinder at high depth but recovers substantially more invertons under systematic downsampling, demonstrating superior performance in sparse-data regimes. We further examine potential reversible inversion elements in viral genomes and provide supporting analyses, suggesting a broader scope for inversion-mediated regulation.

ELISA: An Interpretable Hybrid Generative AI Agent for Expression-Grounded Discovery in Single-Cell Genomics

2026-03-16T18:08:47Z

Translating single-cell RNA sequencing (scRNA-seq) data into mechanistic biological hypotheses remains a critical bottleneck, as agentic AI systems lack direct access to transcriptomic representations while expression foundation models remain opaque to natural language. Here we introduce ELISA (Embedding-Linked Interactive Single-cell Agent), an interpretable framework that unifies scGPT expression embeddings with BioBERT-based semantic retrieval and LLM-mediated interpretation for interactive single-cell discovery. An automatic query classifier routes inputs to gene marker scoring, semantic matching, or reciprocal rank fusion pipelines depending on whether the query is a gene signature, natural language concept, or mixture of both. Integrated analytical modules perform pathway activity scoringacross 60+ gene sets, ligand--receptor interaction prediction using 280+ curated pairs, condition-aware comparative analysis, and cell-type proportion estimation all operating directly on embedded data without access to the original count matrix. Benchmarked across six diverse scRNA-seq datasets spanning inflammatory lung disease, pediatric and adult cancers, organoid models, healthy tissue, and neurodevelopment, ELISA significantly outperforms CellWhisperer in cell type retrieval (combined permutation test, $p < 0.001$), with particularly large gains on gene-signature queries (Cohen's $d = 5.98$ for MRR). ELISA replicates published biological findings (mean composite score 0.90) with near-perfect pathway alignment and theme coverage (0.98 each), and generates candidate hypotheses through grounded LLM reasoning, bridging the gap between transcriptomic data exploration and biological discovery. Code available at: https://github.com/omaruno/ELISA-An-AI-Agent-for-Expression-Grounded-Discovery-in-Single-Cell-Genomics.git (If you use ELISA in your research, please cite this work).

Hecate: A Modular Genomic Compressor

2026-03-16T15:06:28Z

We present Hecate, a modular lossless genomic compression framework. It is designed around uncommon but practical source-coding choices. Unlike many single-method compressors, Hecate treats compression as a conditional coding problem over coupled FASTA/FASTQ streams (control, headers, nucleotides, case, quality, extras). It uses per-stream codecs under a shared indexed block container. Codecs include alphabet-aware packing with an explicit side channel for out-of-alphabet residues, an auxiliary-index Burrows-Wheeler pipeline with custom arithmetic coding, and a blockwise Markov mixture coder with explicit model-competition signaling. This architecture yields high throughput, exact random-access slicing, and referential mode through streamwise binary differencing. In a comprehensive benchmark suite, Hecate provides the best compression vs. speed trade-offs against state-of-the-art established tools (MFCompress, NAF, bzip3, AGC), with notably stronger behaviour on large genomes and high-similarity referential settings. For the same compression ratio, Hecate is 2 to 10 times faster. When given the same time budget as other algorithms, Hecate achieves up to 5% to 10% better compression.

Benchmarking LLM-based agents for single-cell omics analysis

2026-03-16T09:24:30Z

Background: The surge in single-cell omics data exposes limitations in traditional, manually defined analysis workflows. AI agents offer a paradigm shift, enabling adaptive planning, executable code generation, traceable decisions, and real-time knowledge fusion. However, the lack of a comprehensive benchmark critically hinders progress. Results: We introduce a novel benchmarking evaluation system to rigorously assess agent capabilities in single-cell omics analysis. This system comprises: a unified platform compatible with diverse agent frameworks and LLMs; multidimensional metrics assessing cognitive program synthesis, collaboration, execution efficiency, bioinformatics knowledge integration, and task completion quality; and 50 diverse real-world single-cell omics analysis tasks spanning multi-omics, species, and sequencing technologies. Our evaluation reveals that Grok3-beta achieves state-of-the-art performance among tested agent frameworks. Multi-agent frameworks significantly enhance collaboration and execution efficiency over single-agent approaches through specialized role division. Attribution analyses of agent capabilities identify that high-quality code generation is crucial for task success, and self-reflection has the most significant overall impact, followed by retrieval-augmented generation (RAG) and planning. Conclusions: This work highlights persistent challenges in code generation, long-context handling, and context-aware knowledge retrieval, providing a critical empirical foundation and best practices for developing robust AI agents in computational biology.

SeekRBP: Leveraging Sequence-Structure Integration with Reinforcement Learning for Receptor-Binding Protein Identification

2026-03-16T07:03:12Z

Motivation: Receptor-binding proteins (RBPs) initiate viral infection and determine host specificity, serving as key targets for phage engineering and therapy. However, the identification of RBPs is complicated by their extreme sequence divergence, which often renders traditional homology-based alignment methods ineffective. While machine learning offers a promising alternative, such approaches struggle with severe class imbalance and the difficulty of selecting informative negative samples from heterogeneous tail proteins. Existing methods often fail to balance learning from these ``hard negatives'' while maintaining generalization. Results: We present SeekRBP, a sequence--structure framework that models negative sampling as a sequential decision-making problem. By employing a multi-armed bandit strategy, SeekRBP dynamically prioritizes informative non-RBP sequences based on real-time training feedback, complemented by a multimodal fusion of protein language and structural embeddings. Benchmarking demonstrates that SeekRBP consistently outperforms static sampling strategies. Furthermore, a case study on Vibrio phages validates that SeekRBP effectively identifies RBPs to improve host prediction, highlighting its potential for large-scale annotation and synthetic biology applications.

Omics Data Discovery Agents

2026-03-13T00:03:32Z

The biomedical literature contains a vast collection of omics studies, yet most published data remain functionally inaccessible for computational reuse. When raw data are deposited in public repositories, essential information for reproducing reported results is dispersed across main text, supplementary files, and code repositories. In rarer instances where intermediate data is made available (e.g. protein abundance files), its location is irregular. In this article, we present an agentic framework that fetches omics-related articles and transforms the unstructured information into searchable research objects. Our system employs large language model (LLM) agents with access to tools for fetching omics studies, extracting article metadata, identifying and downloading published data, executing containerized quantification pipelines, and running analyses to address novel question. We demonstrate automated metadata extraction from PubMed Central articles, achieving 80% precision for dataset identification from standard data repositories. Using model context protocol (MCP) servers to expose containerized analysis tools, our set of agents were able to identify a set of relevant articles, download the associated datasets, and re-quantify the proteomics data. The results had a 63% overlap in differentially expressed proteins when matching reported preprocessing methods. Furthermore, we show that agents can identify semantically similar studies, determine data compatibility, and perform cross-study comparisons, revealing consistent protein regulation patterns in liver fibrosis. This work establishes a foundation for converting the static biomedical literature into an executable, queryable resource that enables automated data reuse at scale.

MetaHQ: Harmonized, high-quality metadata annotations of public omics samples and studies

2026-03-12T23:01:47Z

Public omics databases like the Gene Expression Omnibus and the Sequence Read Archive offer substantial opportunities for data reuse to address novel biomedical questions. However, it is still difficult to find samples and studies of interest since they are described by free-text metadata and lack standardized annotations. To address this issue, multiple research groups have undertaken curation efforts to add standardized annotations to large collections of these data, but these annotations are fragmented across online resources and are stored in different formats subject to varying standardization criteria, hindering the integration of annotations across sources. We developed MetaHQ to harmonize and distribute standardized metadata for public omics samples. MetaHQ comprises a database with nearly 200,000 annotations from 13 sources and a user-friendly command-line interface (CLI) to query the database and retrieve annotations. The MetaHQ CLI is deployed as a Python Package on PyPI at https://pypi.org/project/metahq-cli that accesses the MetaHQ database available at https://doi.org/10.5281/zenodo.18462463. Project source code and documentation are available at https://github.com/krishnanlab/meta-hq.

A Multi-Label Temporal Convolutional Framework for Transcription Factor Binding Characterization

2026-03-12T15:42:37Z

Transcription factors (TFs) regulate gene expression through complex and co-operative mechanisms. While many TFs act together, the logic underlying TFs binding and their interactions is not fully understood yet. Most current approaches for TF binding site prediction focus on individual TFs and binary classification tasks, without a full analysis of the possible interactions among various TFs. In this paper we investigate DNA TF binding site recognition as a multi-label classification problem, achieving reliable predictions for multiple TFs on DNA sequences retrieved in public repositories. Our deep learning models are based on Temporal Convolutional Networks (TCNs), which are able to predict multiple TF binding profiles, capturing correlations among TFs andtheir cooperative regulatory mechanisms. Our results suggest that multi-label learning leading to reliable predictive performances can reveal biologically meaningful motifs and co-binding patterns consistent with known TF interactions, while also suggesting novel relationships and cooperation among TFs.

Extending Sequence Length is Not All You Need: Effective Integration of Multimodal Signals for Gene Expression Prediction

2026-03-12T10:31:50Z

Gene expression prediction, which predicts mRNA expression levels from DNA sequences, presents significant challenges. Previous works often focus on extending input sequence length to locate distal enhancers, which may influence target genes from hundreds of kilobases away. Our work first reveals that for current models, long sequence modeling can decrease performance. Even carefully designed algorithms only mitigate the performance degradation caused by long sequences. Instead, we find that proximal multimodal epigenomic signals near target genes prove more essential. Hence we focus on how to better integrate these signals, which has been overlooked. We find that different signal types serve distinct biological roles, with some directly marking active regulatory elements while others reflect background chromatin patterns that may introduce confounding effects. Simple concatenation may lead models to develop spurious associations with these background patterns. To address this challenge, we propose Prism, a framework that learns multiple combinations of high-dimensional epigenomic features to represent distinct background chromatin states and uses backdoor adjustment to mitigate confounding effects. Our experimental results demonstrate that proper modeling of multimodal epigenomic signals achieves state-of-the-art performance using only short sequences for gene expression prediction.

A Standardized Framework For Evaluating Gene Expression Generative Models

2026-03-11T19:11:03Z

The rapid development of generative models for single-cell gene expression data has created an urgent need for standardised evaluation frameworks. Current evaluation practices suffer from inconsistent metric implementations, incomparable hyperparameter choices, and a lack of biologically-grounded metrics. We present Generated Genetic Expression Evaluator (GGE), an open-source Python framework that addresses these challenges by providing a comprehensive suite of distributional metrics with explicit computation space options and biologically-motivated evaluation through differentially expressed gene (DEG)-focused analysis and perturbation-effect correlation, enabling standardized reporting and reproducible benchmarking. Through extensive analysis of the single-cell generative modeling literature, we identify that no standardized evaluation protocol exists. Methods report incomparable metrics computed in different spaces with different hyperparameters. We demonstrate that metric values vary substantially depending on implementation choices, highlighting the critical need for standardization. GGE enables fair comparison across generative approaches and accelerates progress in perturbation response prediction, cellular identity modeling, and counterfactual inference.