https://arxiv.org/api/ffZtepYOstIQyw81Fr44HSx+cwg 2026-06-14T04:15:22Z 3848 225 15 http://arxiv.org/abs/2505.08341v2 Benchmarking AI scientists for omics data driven biological discovery 2026-01-18T09:39:51Z

Recent advances in large language models have enabled the emergence of AI scientists that aim to autonomously analyze biological data and assist scientific discovery. Despite rapid progress, it remains unclear to what extent these systems can extract meaningful biological insights from real experimental data. Existing benchmarks either evaluate reasoning in the absence of data or focus on predefined analytical outputs, failing to reflect realistic, data-driven biological research. Here, we introduce BAISBench (Biological AI Scientist Benchmark), a benchmark for evaluating AI scientists on real single-cell transcriptomic datasets. BAISBench comprises two tasks: cell type annotation across 15 expert-labeled datasets, and scientific discovery through 193 multiple-choice questions derived from biological conclusions reported in 41 published single-cell studies. We evaluated several representative AI scientists using BAISBench and, to provide a human performance baseline, invited six graduate-level bioinformaticians to collectively complete the same tasks. The results show that while current AI scientists fall short of fully autonomous biological discovery, they already demonstrate substantial potential in supporting data-driven biological research. These results position BAISBench as a practical benchmark for characterizing the current capabilities and limitations of AI scientists in biological research. We expect BAISBench to serve as a practical evaluation framework for guiding the development of more capable AI scientists and for helping biologists identify AI systems that can effectively support real-world research workflows. The BAISBench can be found at: https://github.com/EperLuo/BAISBench, https://huggingface.co/datasets/EperLuo/BaisBench.

2025-05-13T08:33:54Z Erpai Luo Jinmeng Jia Yifan Xiong Xiangyu Li Xiaobo Guo Baoqi Yu Minsheng Hao Lei Wei Xuegong Zhang http://arxiv.org/abs/2510.24987v2 scMRDR: A scalable and flexible framework for unpaired single-cell multi-omics data integration 2026-01-17T05:05:58Z

Advances in single-cell sequencing have enabled high-resolution profiling of diverse molecular modalities, while integrating unpaired multi-omics single-cell data remains challenging. Existing approaches either rely on pair information or prior correspondences, or require computing a global pairwise coupling matrix, limiting their scalability and flexibility. In this paper, we introduce a scalable and flexible generative framework called single-cell Multi-omics Regularized Disentangled Representations (scMRDR) for unpaired multi-omics integration. Specifically, we disentangle each cell's latent representations into modality-shared and modality-specific components using a well-designed $β$-VAE architecture, which are augmented with isometric regularization to preserve intra-omics biological heterogeneity, adversarial objective to encourage cross-modal alignment, and masked reconstruction loss strategy to address the issue of missing features across modalities. Our method achieves excellent performance on benchmark datasets in terms of batch correction, modality alignment, and biological signal preservation. Crucially, it scales effectively to large-scale datasets and supports integration of more than two omics, offering a powerful and flexible solution for large-scale multi-omics data integration and downstream biological discovery.

2025-10-28T21:28:39Z Accepted at NeurIPS 2025 (Spotlight) Jianle Sun Chaoqi Liang Ran Wei Peng Zheng Lei Bai Wanli Ouyang Hongliang Yan Peng Ye http://arxiv.org/abs/2601.10995v1 GP-DHT: A Dual-Head Transformer with Contras-tive Learning for Predicting Gene Regulatory Rela-tionships across Species from Single-Cell Data 2026-01-16T05:04:57Z

Gene regulatory networks (GRNs) are essential for understanding cell fate decisions and disease mechanisms, yet cross-species GRN inference from single-cell RNA-seq data remains challenging due to noise, sparsity, and cross-species distribution shifts. We propose GP-DHT (GenePair DualHeadTransformer), a cross-species single-cell GRN inference framework that models genes and cells in a heterogeneous graph with multi-level expression relations and learns structured regulatory representations via multi-relational graph attention. A dual-head Transformer further captures local gene pair regulatory dependencies and global cross-cell interaction patterns. To improve robustness under sparse and cross-species settings, GP-DHT introduces gene pair level supervised contrastive learning. Experiments on seven BEELINE benchmark datasets show consistent gains over representative baselines, improving AUROC and AUPRC by approximately 5 to 7 percent on most datasets. GP-DHT also recovers known regulatory modules and helps distinguish conserved from species-specific regulations.

2026-01-16T05:04:57Z Shuai Yan Qingzhi Yu Wengfeng Dai Xiang Cheng http://arxiv.org/abs/2601.10464v1 MitoFREQ: A Novel Approach for Mitogenome Frequency Estimation from Top-level Haplogroups and Single Nucleotide Variants 2026-01-15T14:51:17Z

Lineage marker population frequencies can serve as one way to express evidential value in forensic genetics. However, for high-quality whole mitochondrial DNA genome sequences (mitogenomes), population data remain limited. In this paper, we offer a new method, MitoFREQ, for estimating the population frequencies of mitogenomes. MitoFREQ uses the mitogenome resources HelixMTdb and gnomAD, harbouring information from 195,983 and 56,406 mitogenomes, respectively. Neither HelixMTdb nor gnomAD can be queried directly for individual mitogenome frequencies, but offers single nucleotide variant (SNV) allele frequencies for each of 30 "top-level" haplogroups (TLHG). We propose using the HelixMTdb and gnomAD resources by classifying a given mitogenome within the TLHG scheme and subsequently using the frequency of its rarest SNV within that TLHG weighted by the TLHG frequency. We show that this method is guaranteed to provide a higher population frequency estimate than if a refined haplogroup and its SNV frequencies were used. Further, we show that top-level haplogrouping can be achieved by using only 227 specific positions for 99.9% of the tested mitogenomes, potentially making the method available for low-quality samples. The method was tested on two types of datasets: high-quality forensic reference datasets and a diverse collection of scrutinised mitogenomes from GenBank. This dual evaluation demonstrated that the approach is robust across both curated forensic data and broader population-level sequences. This method produced likelihood ratios in the range of 100-100,000, demonstrating its potential to strengthen the statistical evaluation of forensic mtDNA evidence. We have developed an open-source R package `mitofreq` that implements our method, including a Shiny app where custom TLHG frequencies can be supplied.

2026-01-15T14:51:17Z Mikkel Meyer Andersen Nicole Huber Kimberly S Andreaggi Tóra Oluffa Stenberg Olsen Walther Parson Charla Marshall 10.1016/j.fsigen.2026.103526 http://arxiv.org/abs/2511.12205v4 LCPan: efficient variation graph construction using Locally Consistent Parsing 2026-01-15T07:54:33Z

Efficient and consistent string processing is critical in the exponentially growing genomic data era. Locally Consistent Parsing (LCP) addresses this need by partitioning an input genome string into short, exactly matching substrings (e.g., "cores"), ensuring consistency across partitions. Labeling the cores of an input string consistently not only provides a compact representation of the input but also enables the reapplication of LCP to refine the cores over multiple iterations, providing a progressively longer and more informative set of substrings for downstream analyses. We present the first iterative implementation of LCP with Lcptools and demonstrate its effectiveness in identifying cores with minimal collisions. Experimental results show that the number of cores at the i^th iteration is O(n/c^i) for c ~ 2.34, while the average length and the average distance between consecutive cores are O(c^i). Compared to the popular sketching techniques, LCP produces significantly fewer cores, enabling a more compact representation and faster analyses. To demonstrate the advantages of LCP in genomic string processing in terms of computation and memory efficiency, we also introduce LCPan, an efficient variation graph constructor. We show that LCPan generates variation graphs >10x faster than vg, while using >13x less memory.

2025-11-15T13:17:51Z Akmuhammet Ashyralyyev Zülal Bingöl Begüm Filiz Öz Kaiyuan Zhu Salem Malikic Uzi Vishkin S. Cenk Sahinalp Can Alkan http://arxiv.org/abs/2601.09758v1 Detecting Batch Heterogeneity via Likelihood Clustering 2026-01-14T01:49:21Z

Batch effects represent a major confounder in genomic diagnostics. In copy number variant (CNV) detection from NGS, many algorithms compare read depth between test samples and a reference sample, assuming they are process-matched. When this assumption is violated, with causes ranging from reagent lot changes to multi-site processing, the reference becomes inappropriate, introducing false CNV calls or masking true pathogenic variants. Detecting such heterogeneity before downstream analysis is critical for reliable clinical interpretation. Existing batch effect detection methods either cluster samples based on raw features, risking conflation of biological signal with technical variation, or require known batch labels that are frequently unavailable. We introduce a method that addresses both limitations by clustering samples according to their Bayesian model evidence. The central insight is that evidence quantifies compatibility between data and model assumptions, technical artifacts violate assumptions and reduce evidence, whereas biological variation, including CNV status, is anticipated by the model and yields high evidence. This asymmetry provides a discriminative signal that separates batch effects from biology. We formalize heterogeneity detection as a likelihood ratio test for mixture structure in evidence space, using parametric bootstrap calibration to ensure conservative false positive rates. We validate our approach on synthetic data demonstrating proper Type I error control, three clinical targeted sequencing panels (liquid biopsy, BRCA, and thalassemia) exhibiting distinct batch effect mechanisms, and mouse electrophysiology recordings demonstrating cross-modality generalization. Our method achieves superior clustering accuracy compared to standard correlation-based and dimensionality-reduction approaches while maintaining the conservativeness required for clinical usage.

2026-01-14T01:49:21Z Austin Talbot Yue Ke http://arxiv.org/abs/2509.09923v2 Engineering Spatial and Molecular Features from Cellular Niches to Inform Predictions of Inflammatory Bowel Disease 2026-01-13T02:56:55Z

Differentiating between the two main subtypes of Inflammatory Bowel Disease (IBD): Crohns disease (CD) and ulcerative colitis (UC) is a persistent clinical challenge due to overlapping presentations. This study introduces a novel computational framework that employs spatial transcriptomics (ST) to create an explainable machine learning model for IBD classification. We analyzed ST data from the colonic mucosa of healthy controls (HC), UC, and CD patients. Using Non-negative Matrix Factorization (NMF), we first identified four recurring cellular niches, representing distinct functional microenvironments within the tissue. From these niches, we systematically engineered 44 features capturing three key aspects of tissue pathology: niche composition, neighborhood enrichment, and niche-gene signals. A multilayer perceptron (MLP) classifier trained on these features achieved an accuracy of $0.774 \pm 0.161$ for the more challenging three-class problem (HC, UC, and CD) and $0.916 \pm 0.118$ in the two-class problem of distinguishing IBD from healthy tissue. Crucially, model explainability analysis revealed that disruptions in the spatial organization of niches were the strongest predictors of general inflammation, while the classification between UC and CD relied on specific niche-gene expression signatures. This work provides a robust, proof-of-concept pipeline that transforms descriptive spatial data into an accurate and explainable predictive tool, offering not only a potential new diagnostic paradigm but also deeper insights into the distinct biological mechanisms that drive IBD subtypes.

2025-09-12T02:10:41Z 20 pages, 8 figures, 7 tables. Presented at the 37th Benelux Conference on Artificial Intelligence / 34th Belgian Dutch Conference on Machine Learning, Namur, Belgium, November 19 - 21, 2025 Myles Joshua Toledo Tan Maria Kapetanaki Panayiotis V. Benos http://arxiv.org/abs/2601.07826v1 Histopathology-centered Computational Evolution of Spatial Omics: Integration, Mapping, and Foundation Models 2026-01-12T18:58:28Z

Spatial omics (SO) technologies enable spatially resolved molecular profiling, while hematoxylin and eosin (H&E) imaging remains the gold standard for morphological assessment in clinical pathology. Recent computational advances increasingly place H&E images at the center of SO analysis, bridging morphology with transcriptomic, proteomic, and other spatial molecular modalities, and pushing resolution toward the single-cell level. In this survey, we systematically review the computational evolution of SO from a histopathology-centered perspective and organize existing methods into three paradigms: integration, which jointly models paired multimodal data; mapping, which infers molecular profiles from H&E images; and foundation models, which learn generalizable representations from large-scale spatial datasets. We analyze how the role of H&E images evolves across these paradigms from spatial context to predictive anchor and ultimately to representation backbone in response to practical constraints such as limited paired data and increasing resolution demands. We further summarize actionable modeling directions enabled by current architectures and delineate persistent gaps driven by data, biology, and technology that are unlikely to be resolved by model design alone. Together, this survey provides a histopathology-centered roadmap for developing and applying computational frameworks in SO.

2026-01-12T18:58:28Z 30 pages, 5 figures Ninghui Hao Xinxing Yang Boshen Yan Dong Li Junzhou Huang Xintao Wu Emily S. Ruiz Arlene Ruiz de Luzuriaga Chen Zhao Guihong Wan http://arxiv.org/abs/2601.07546v1 Estimators for Substitution Rates in Genomes from Read Data 2026-01-12T13:50:42Z

We study the problem of estimating the mutation rate between two sequences from noisy sequencing reads. Existing alignment-free methods typically assume direct access to the full sequences. We extend these methods to the sequencing framework, where only noisy reads from the sequences are observed. We use a simple model in which both mutations and sequencing errors are substitutions. We propose multiple estimators, provide theoretical guarantees for one of them, and evaluate the others through simulations.

2026-01-12T13:50:42Z Shiv Pratap Singh Rathore Navin Kashyap http://arxiv.org/abs/2601.06381v1 Hierarchical Pooling and Explainability in Graph Neural Networks for Tumor and Tissue-of-Origin Classification Using RNA-seq Data 2026-01-10T01:33:56Z

This study explores the use of graph neural networks (GNNs) with hierarchical pooling and multiple convolution layers for cancer classification based on RNA-seq data. We combine gene expression data from The Cancer Genome Atlas (TCGA) with a precomputed STRING protein-protein interaction network to classify tissue origin and distinguish between normal and tumor samples. The model employs Chebyshev graph convolutions (K=2) and weighted pooling layers, aggregating gene clusters into 'supernodes' across multiple coarsening levels. This approach enables dimensionality reduction while preserving meaningful interactions. Saliency methods were applied to interpret the model by identifying key genes and biological processes relevant to cancer. Our findings reveal that increasing the number of convolution and pooling layers did not enhance classification performance. The highest F1-macro score (0.978) was achieved with a single pooling layer. However, adding more layers resulted in over-smoothing and performance degradation. However, the model proved highly interpretable through gradient methods, identifying known cancer-related genes and highlighting enriched biological processes, and its hierarchical structure can be used to develop new explainable architectures. Overall, while deeper GNN architectures did not improve performance, the hierarchical pooling structure provided valuable insights into tumor biology, making GNNs a promising tool for cancer biomarker discovery and interpretation

2026-01-10T01:33:56Z Thomas Vaitses Fontanari Mariana Recamonde-Mendoza http://arxiv.org/abs/2601.01089v2 Central Dogma Transformer: Towards Mechanism-Oriented AI for Cellular Understanding 2026-01-10T00:37:48Z

Understanding cellular mechanisms requires integrating information across DNA, RNA, and protein - the three molecular systems linked by the Central Dogma of molecular biology. While domain-specific foundation models have achieved success for each modality individually, they remain isolated, limiting our ability to model integrated cellular processes. Here we present the Central Dogma Transformer (CDT), an architecture that integrates pre-trained language models for DNA, RNA, and protein following the directional logic of the Central Dogma. CDT employs directional cross-attention mechanisms - DNA-to-RNA attention models transcriptional regulation, while RNA-to-Protein attention models translational relationships - producing a unified Virtual Cell Embedding that integrates all three modalities. We validate CDT v1 - a proof-of-concept implementation using fixed (non-cell-specific) RNA and protein embeddings - on CRISPRi enhancer perturbation data from K562 cells, achieving a Pearson correlation of 0.503, representing 63% of the theoretical ceiling set by cross-experiment variability (r = 0.797). Attention and gradient analyses provide complementary interpretive windows: in detailed case studies, these approaches highlight largely distinct genomic regions, with gradient analysis identifying a CTCF binding site that Hi-C data showed as physically contacting both enhancer and target gene. These results suggest that AI architectures aligned with biological information flow can achieve both predictive accuracy and mechanistic interpretability.

2026-01-03T06:29:22Z v2: Fixed dropout probability in Table 3 (0.1 -> 0.3); added acknowledgement Nobuyuki Ota http://arxiv.org/abs/2601.05648v1 Open World Knowledge Aided Single-Cell Foundation Model with Robust Cross-Modal Cell-Language Pre-training 2026-01-09T09:10:14Z

Recent advancements in single-cell multi-omics, particularly RNA-seq, have provided profound insights into cellular heterogeneity and gene regulation. While pre-trained language model (PLM) paradigm based single-cell foundation models have shown promise, they remain constrained by insufficient integration of in-depth individual profiles and neglecting the influence of noise within multi-modal data. To address both issues, we propose an Open-world Language Knowledge-Aided Robust Single-Cell Foundation Model (OKR-CELL). It is built based on a cross-modal Cell-Language pre-training framework, which comprises two key innovations: (1) leveraging Large Language Models (LLMs) based workflow with retrieval-augmented generation (RAG) enriches cell textual descriptions using open-world knowledge; (2) devising a Cross-modal Robust Alignment (CRA) objective that incorporates sample reliability assessment, curriculum learning, and coupled momentum contrastive learning to strengthen the model's resistance to noisy data. After pretraining on 32M cell-text pairs, OKR-CELL obtains cutting-edge results across 6 evaluation tasks. Beyond standard benchmarks such as cell clustering, cell-type annotation, batch-effect correction, and few-shot annotation, the model also demonstrates superior performance in broader multi-modal applications, including zero-shot cell-type annotation and bidirectional cell-text retrieval.

2026-01-09T09:10:14Z 41 pages Haoran Wang Xuanyi Zhang Shuangsang Fang Longke Ran Ziqing Deng Yong Zhang Yuxiang Li Shaoshuai Li http://arxiv.org/abs/2601.05531v1 DNATokenizer: A GPU-First Byte-to-Identifier Tokenizer for High-Throughput DNA Language Models 2026-01-09T05:08:17Z

Tokenization sits at the boundary between high-throughput genomic input and GPU compute, posing challenges in both algorithm design and system throughput. Overlapping k-mer tokenization can introduce information leakage under masked language modeling (MLM) and may degrade downstream accuracy. Single-nucleotide tokenization avoids leakage and preserves per-base fidelity, but it greatly increases sequence length for attention-based architectures. Non-overlapping k-mers and byte-pair encoding (BPE) provide compression and avoid leakage, at the cost of boundary sensitivity or reduced interpretability. Empirically, the choice of tokenization interacts strongly with model architecture and task requirements. At the system level, however, standard string tokenizers and host-bound vocabulary lookups dominate wall-clock time once inputs reach billions of bases, regardless of the tokenization algorithm. We present DNATok, a high-performance, GPU-first tokenization system that replaces general-purpose string processing with byte lookup table (LUT)-based identifier streaming and an overlapped host-to-device (H2D)/compute pipeline using pinned memory and architectural parallelism. DNATok is vocabulary-agnostic: it accelerates single-nucleotide, non-overlapping k-mer, and BPE tokenization, and integrates as a drop-in systems layer beneath genomic foundation models. DNATok achieves 84-95x higher encoding throughput than optimized Hugging Face baselines and up to 1.9x higher H2D throughput. End-to-end streaming reaches 1.27-1.84e8 tokens/s depending on configuration, effectively removing tokenization as a bottleneck for production-scale training and inference.

2026-01-09T05:08:17Z Eliatan Niktab Hardip Patel http://arxiv.org/abs/2601.04335v1 Thermodynamic Constraints Drive Hierarchical Preemption in Cellular Decision-Making: A Hybrid Petri Net Framework with Application to Bacillus subtilis Sporulation 2026-01-07T19:15:23Z

Cellular decision-making under stress involves rapid pathway selection despite energy scarcity. Here we demonstrate that thermodynamic constraints actively drive energy-efficient sporulation, where continuous metabolic sources enable system robustness through dynamic energy management. Using hybrid Petri nets (stochastic transitions with continuous sources) to model Bacillus subtilis sporulation, we show that stress conditions (ATP = 300 mM, 94% depletion) enable sporulation completion with extreme energy efficiency: 0.73 mM ATP per mature spore versus 11.6 mM ATP under normal conditions--a 16-fold efficiency gain. Despite ATP dropping to 1 mM (99.7% depletion) during the crisis, continuous ATP regeneration rescues the system, producing 67 mM mature spores (89% of normal yield) with only 49 mM total ATP consumption. This efficiency emerges from the interplay between stochastic regulatory transitions and continuous metabolic sources, where GTP accumulation (+4974 mM, 166% increase) provides an energy buffer while ATP regeneration (+240 mM) prevents complete depletion. The hybrid Petri net formalism--combining stochastic transitions for regulatory events with continuous sources for metabolic flux--extended with thermodynamic constraints through inhibitor arcs and energy-coupled rate functions, provides the mathematical foundation enabling this discovery by integrating discrete regulatory logic with continuous energy dynamics in a resource-aware concurrency model.

2026-01-07T19:15:23Z 9 pages, 2 figures, 2 tables. Includes supplementary analysis and data availability statement. Model files and simulation code available at https://github.com/simao-eugenio/shypn Eugenio Simao http://arxiv.org/abs/2601.03295v1 MetagenBERT: a Transformer-based Architecture using Foundational genomic Large Language Models for novel Metagenome Representation 2026-01-05T19:36:36Z

Metagenomic disease prediction commonly relies on species abundance tables derived from large, incomplete reference catalogs, constraining resolution and discarding valuable information contained in DNA reads. To overcome these limitations, we introduce MetagenBERT, a Transformer based framework that produces end to end metagenome embeddings directly from raw DNA sequences, without taxonomic or functional annotations. Reads are embedded using foundational genomic language models (DNABERT2 and the microbiome specialized DNABERTMS), then aggregated through a scalable clustering strategy based on FAISS accelerated KMeans. Each metagenome is represented as a cluster abundance vector summarizing the distribution of its embedded reads. We evaluate this approach on five benchmark gut microbiome datasets (Cirrhosis, T2D, Obesity, IBD, CRC). MetagenBERT achieves competitive or superior AUC performance relative to species abundance baselines across most tasks. Concatenating both representations further improves prediction, demonstrating complementarity between taxonomic and embedding derived signals. Clustering remains robust when applied to as little as 10% of reads, highlighting substantial redundancy in metagenomes and enabling major computational gains. We additionally introduce MetagenBERT Glob Mcardis, a cross cohort variant trained on the large, phenotypically diverse MetaCardis cohort and transferred to other datasets, retaining predictive signal including for unseen phenotypes, indicating the feasibility of a foundation model for metagenome representation. Robustness analyses (PERMANOVA, PERMDISP, entropy) show consistent separation of different states across subsamples. Overall, MetagenBERT provides a scalable, annotation free representation of metagenomes pointing toward future phenotype aware generalization across heterogeneous cohorts and sequencing technologies.

2026-01-05T19:36:36Z Gaspar Roy Eugeni Belda Baptiste Hennecart Yann Chevaleyre Edi Prifti Jean-Daniel Zucker