https://arxiv.org/api/7MEsVM+U7nH5w+q6AGMYb8XTYpI 2026-06-14T01:24:37Z 3848 180 15 http://arxiv.org/abs/2602.17532v1 Systematic Evaluation of Single-Cell Foundation Model Interpretability Reveals Attention Captures Co-Expression Rather Than Unique Regulatory Signal 2026-02-19T16:43:12Z

We present a systematic evaluation framework - thirty-seven analyses, 153 statistical tests, four cell types, two perturbation modalities - for assessing mechanistic interpretability in single-cell foundation models. Applying this framework to scGPT and Geneformer, we find that attention patterns encode structured biological information with layer-specific organisation - protein-protein interactions in early layers, transcriptional regulation in late layers - but this structure provides no incremental value for perturbation prediction: trivial gene-level baselines outperform both attention and correlation edges (AUROC 0.81-0.88 versus 0.70), pairwise edge scores add zero predictive contribution, and causal ablation of regulatory heads produces no degradation. These findings generalise from K562 to RPE1 cells; the attention-correlation relationship is context-dependent, but gene-level dominance is universal. Cell-State Stratified Interpretability (CSSI) addresses an attention-specific scaling failure, improving GRN recovery up to 1.85x. The framework establishes reusable quality-control standards for the field.

2026-02-19T16:43:12Z Ihor Kendiukhov http://arxiv.org/abs/2602.17747v1 AgriVariant: Variant Effect Prediction using DeepChem-Variant for Precision Breeding in Rice 2026-02-19T14:03:37Z

Predicting functional consequences of genetic variants in crop genes remains a critical bottleneck for precision breeding programs. We present AgriVariant, an end-to-end pipeline for variant-effect prediction in rice (Oryza sativa) that addresses the lack of crop-specific variant-interpretation tools and can be extended to any crop species with available reference genomes and gene annotations. Our approach integrates deep learning-based variant calling (DeepChem-Variant) with custom plant genomics annotation using RAP-DB gene models and database-independent deleteriousness scoring that combines the Grantham distance and the BLOSUM62 substitution matrix. We validate the pipeline through targeted mutations in stress-response genes (OsDREB2a, OsDREB1F, SKC1), demonstrating correct classification of stop-gained, missense, and synonymous variants with appropriate HIGH / MODERATE / LOW impact assignments. An exhaustive mutagenesis study of OsMT-3a analyzed all 1,509 possible single-nucleotide variants in 10 days, identifying 353 high-impact, 447 medium-impact, and 709 low-impact variants - an analysis that would have required 2-4 years using traditional wet-lab approaches. This computational framework enables breeders to prioritize variants for experimental validation across diverse crop species, reducing screening costs and accelerating development of climate-resilient crop varieties.

2026-02-19T14:03:37Z 8 pages, 7 figures, 5 tables Ankita Vaishnobi Bisoi Bharath Ramsundar http://arxiv.org/abs/2601.14969v2 Robust Machine Learning for Regulatory Sequence Modeling under Biological and Technical Distribution Shifts 2026-02-19T12:51:45Z

Robust machine learning for regulatory genomics is studied under biologically and technically induced distribution shifts. Deep convolutional and attention based models achieve strong in distribution performance on DNA regulatory sequence prediction tasks but are usually evaluated under i.i.d. assumptions, even though real applications involve cell type specific programs, evolutionary turnover, assay protocol changes, and sequencing artifacts. We introduce a robustness framework that combines a mechanistic simulation benchmark with real data analysis on a massively parallel reporter assay (MPRA) dataset to quantify performance degradation, calibration failures, and uncertainty based reliability. In simulation, motif driven regulatory outputs are generated with cell type specific programs, PWM perturbations, GC bias, depth variation, batch effects, and heteroscedastic noise, and CNN, BiLSTM, and transformer models are evaluated. Models remain accurate and reasonably calibrated under mild GC content shifts but show higher error, severe variance miscalibration, and coverage collapse under motif effect rewiring and noise dominated regimes, revealing robustness gaps invisible to standard i.i.d. evaluation. Adding simple biological structural priors motif derived features in simulation and global GC content in MPRA improves in distribution error and yields consistent robustness gains under biologically meaningful genomic shifts, while providing only limited protection against strong assay noise. Uncertainty-aware selective prediction offers an additional safety layer that risk coverage analyses on simulated and MPRA data show that filtering low confidence inputs recovers low risk subsets, including under GC-based out-of-distribution conditions, although reliability gains diminish when noise dominates.

2026-01-21T13:15:27Z 20 pages, 16 figures Yiyao Yang http://arxiv.org/abs/2602.16696v1 Parameter-free representations outperform single-cell foundation models on downstream benchmarks 2026-02-18T18:42:29Z

Single-cell RNA sequencing (scRNA-seq) data exhibit strong and reproducible statistical structure. This has motivated the development of large-scale foundation models, such as TranscriptFormer, that use transformer-based architectures to learn a generative model for gene expression by embedding genes into a latent vector space. These embeddings have been used to obtain state-of-the-art (SOTA) performance on downstream tasks such as cell-type classification, disease-state prediction, and cross-species learning. Here, we ask whether similar performance can be achieved without utilizing computationally intensive deep learning-based representations. Using simple, interpretable pipelines that rely on careful normalization and linear methods, we obtain SOTA or near SOTA performance across multiple benchmarks commonly used to evaluate single-cell foundation models, including outperforming foundation models on out-of-distribution tasks involving novel cell types and organisms absent from the training data. Our findings highlight the need for rigorous benchmarking and suggest that the biology of cell identity can be captured by simple linear representations of single cell gene expression data.

2026-02-18T18:42:29Z Huan Souza Pankaj Mehta http://arxiv.org/abs/2503.12286v2 Integrating Chain-of-Thought and Retrieval Augmented Generation Enhances Rare Disease Diagnosis from Clinical Notes 2026-02-18T16:38:37Z

Background: Several studies show that large language models (LLMs) struggle with phenotype-driven gene prioritization for rare diseases. These studies typically use Human Phenotype Ontology (HPO) terms to prompt foundation models like GPT and LLaMA to predict candidate genes. However, in real-world settings, foundation models are not optimized for domain-specific tasks like clinical diagnosis, yet inputs are unstructured clinical notes rather than standardized terms. How LLMs can be instructed to predict candidate genes or disease diagnosis from unstructured clinical notes remains a major challenge. Methods: We introduce RAG-driven CoT and CoT-driven RAG, two methods that combine Chain-of-Thought (CoT) and Retrieval Augmented Generation (RAG) to analyze clinical notes. A five-question CoT protocol mimics expert reasoning, while RAG retrieves data from sources like HPO and OMIM (Online Mendelian Inheritance in Man). We evaluated these approaches on rare disease datasets, including 5,980 Phenopacket-derived notes, 255 literature-based narratives, and 220 in-house clinical notes from Childrens Hospital of Philadelphia. Results: We found that recent foundations models, including Llama 3.3-70B-Instruct and DeepSeek-R1-Distill-Llama-70B, outperformed earlier versions such as Llama 2 and GPT-3.5. We also showed that RAG-driven CoT and CoT-driven RAG both outperform foundation models in candidate gene prioritization from clinical notes; in particular, both methods with DeepSeek backbone resulted in a top-10 gene accuracy of over 40% on Phenopacket-derived clinical notes. RAG-driven CoT works better for high-quality notes, where early retrieval can anchor the subsequent reasoning steps in domain-specific evidence, while CoT-driven RAG has advantage when processing lengthy and noisy notes.

2025-03-15T22:57:31Z Zhanliang Wang Da Wu Quan Nguyen Kai Wang http://arxiv.org/abs/2602.16728v1 FUNGAR: a pipeline for detecting antifungal resistance mutations directly from metagenomic short reads 2026-02-17T17:49:24Z

Motivation: Antifungal resistance has become an increasing global concern in both clinical and environmental health. Detecting known resistance mutations directly from sequencing reads, in special metagenomic samples, remains a major challenge. As fungal pathogens are often neglected compared with bacterial pathogens, most available tools are designed for bacterial taxa, whereas tools targeting fungi typically require assembled genomes. In metagenomic datasets, assembly-based strategies may result in substantial information loss due to genome fragmentation, low-abundance species, or incomplete recovery of resistance loci. Results: Here, we present FUNGAR, an open-source pipeline for the rapid identification of antifungal resistance genes and mutations directly from short-read data. FUNGAR employs translated alignments with DIAMOND and curated data from the FungAMR database to detect amino acid substitutions across all six open reading frames. The pipeline produces structured, reproducible reports linking detected variants to their associated antifungal drugs and can be easily customized for new species or databases.

2026-02-17T17:49:24Z Henrique RM Antoniolli Lívia Kmetzsch Charley C Staats http://arxiv.org/abs/2410.17801v3 Rawsamble: Overlapping and Assembling Raw Nanopore Signals using a Hash-based Seeding Mechanism 2026-02-16T17:03:30Z

Raw nanopore signal analysis is a common approach in genomics to provide fast and resource-efficient analysis without translating the signals to bases (i.e., without basecalling). However, existing solutions cannot interpret raw signals directly if a reference genome is unknown due to a lack of accurate mechanisms to handle increased noise in pairwise raw signal comparison. Our goal is to enable the direct analysis of raw signals without a reference genome. To this end, we propose Rawsamble, the first mechanism that can identify regions of similarity between all raw signal pairs, known as all-vs-all overlapping, using a hash-based search mechanism. We use these overlaps to construct de novo assembly graphs with an existing assembler, miniasm, off-the-shelf. To our knowledge, these are the first de novo assemblies ever constructed directly from raw signals without basecalling. Our extensive evaluations across multiple genomes of varying sizes show that Rawsamble provides a significant speedup (on average by 5.01x and up to 23.10x) and reduces peak memory usage (on average by 5.74x and up to by 22.00x) compared to a conventional genome assembly pipeline using the state-of-the-art tools for basecalling (Dorado's fastest mode) and overlapping (minimap2) on a CPU.We find that around one-third of Rawsamble 's overlapping pairs are also found by minimap2. We find that when we use overlapping reads from Rawsamble, we can construct unitigs that are 1) as accurate as those built from minimap2's overlaps and 2) up to half a chromosome in length (e.g., 2.3 million bases for E. coli). Source code: https://github.com/CMU-SAFARI/RawHash

2024-10-23T11:59:44Z Accepted to appear in the Bioinformatics journal Can Firtina Maximilian Mordig Harun Mustafa Sayan Goswami Nika Mansouri Ghiasi Stefano Mercogliano Furkan Eris Joël Lindegger Andre Kahles Onur Mutlu http://arxiv.org/abs/2602.13346v1 CellMaster: Collaborative Cell Type Annotation in Single-Cell Analysis 2026-02-12T20:20:22Z

Single-cell RNA-seq (scRNA-seq) enables atlas-scale profiling of complex tissues, revealing rare lineages and transient states. Yet, assigning biologically valid cell identities remains a bottleneck because markers are tissue- and state-dependent, and novel states lack references. We present CellMaster, an AI agent that mimics expert practice for zero-shot cell-type annotation. Unlike existing automated tools, CellMaster leverages LLM-encoded knowledge (e.g., GPT-4o) to perform on-the-fly annotation with interpretable rationales, without pre-training or fixed marker databases. Across 9 datasets spanning 8 tissues, CellMaster improved accuracy by 7.1% over best-performing baselines (including CellTypist and scTab) in automatic mode. With human-in-the-loop refinement, this advantage increased to 18.6%, with a 22.1% gain on subtype populations. The system demonstrates particular strength in rare and novel cell states where baselines often fail. Source code and the web application are available at \href{https://github.com/AnonymousGym/CellMaster}{https://github.com/AnonymousGym/CellMaster}.

2026-02-12T20:20:22Z Preprint Zhen Wang Yiming Gao Jieyuan Liu Enze Ma Jefferson Chen Mark Antkowiak Mengzhou Hu JungHo Kong Dexter Pratt Zhiting Hu Wei Wang Trey Ideker Eric P. Xing http://arxiv.org/abs/2602.11609v1 scPilot: Large Language Model Reasoning Toward Automated Single-Cell Analysis and Discovery 2026-02-12T06:04:11Z

We present scPilot, the first systematic framework to practice omics-native reasoning: a large language model (LLM) converses in natural language while directly inspecting single-cell RNA-seq data and on-demand bioinformatics tools. scPilot converts core single-cell analyses, i.e., cell-type annotation, developmental-trajectory reconstruction, and transcription-factor targeting, into step-by-step reasoning problems that the model must solve, justify, and, when needed, revise with new evidence. To measure progress, we release scBench, a suite of 9 expertly curated datasets and graders that faithfully evaluate the omics-native reasoning capability of scPilot w.r.t various LLMs. Experiments with o1 show that iterative omics-native reasoning lifts average accuracy by 11% for cell-type annotation and Gemini-2.5-Pro cuts trajectory graph-edit distance by 30% versus one-shot prompting, while generating transparent reasoning traces explain marker gene ambiguity and regulatory logic. By grounding LLMs in raw omics data, scPilot enables auditable, interpretable, and diagnostically informative single-cell analyses. Code, data, and package are available at https://github.com/maitrix-org/scPilot

2026-02-12T06:04:11Z Accepted at NeurIPS 2025 Main Conference Yiming Gao Zhen Wang Jefferson Chen Mark Antkowiak Mengzhou Hu JungHo Kong Dexter Pratt Jieyuan Liu Enze Ma Zhiting Hu Eric P. Xing http://arxiv.org/abs/2508.07465v2 MOTGNN: Interpretable Graph Neural Networks for Multi-Omics Disease Classification 2026-02-11T18:50:44Z

Integrating multi-omics data, such as DNA methylation, mRNA expression, and microRNA (miRNA) expression, offers a comprehensive view of the biological mechanisms underlying disease. However, the high dimensionality of multi-omics data, the heterogeneity across modalities, and the lack of reliable biological interaction networks make meaningful integration challenging. In addition, many existing models rely on handcrafted similarity graphs, are vulnerable to class imbalance, and often lack built-in interpretability, limiting their usefulness in biomedical applications. We propose Multi-Omics integration with Tree-generated Graph Neural Network (MOTGNN), a novel and interpretable framework for binary disease classification. MOTGNN employs eXtreme Gradient Boosting (XGBoost) for omics-specific supervised graph construction, followed by modality-specific Graph Neural Networks (GNNs) for hierarchical representation learning, and a deep feedforward network for cross-omics integration. Across three real-world disease datasets, MOTGNN outperforms state-of-the-art baselines by 5-10% in accuracy, ROC-AUC, and F1-score, and remains robust to severe class imbalance. The model maintains computational efficiency through the use of sparse graphs and provides built-in interpretability, revealing both top-ranked biomarkers and the relative contributions of each omics modality. These results highlight the potential of MOTGNN to improve both predictive accuracy and interpretability in multi-omics disease modeling.

2025-08-10T19:35:53Z 11 pages, 6 figures, 7 tables Tiantian Yang Zhiqian Chen http://arxiv.org/abs/2512.21320v2 An Allele-Centric Pan-Graph-Matrix Representation for Scalable Pangenome Analysis 2026-02-11T18:45:27Z

Population-scale pangenome analysis increasingly requires representations that unify single-nucleotide and structural variation while remaining scalable across large cohorts. Existing formats are typically sequence-centric, path-centric, or sample-centric, and often obscure population structure or fail to exploit carrier sparsity. We introduce the H1 pan-graph-matrix, an allele-centric representation that encodes exact haplotype membership using adaptive per-allele compression. By treating alleles as first-class objects and selecting optimal encodings based on carrier distribution, H1 achieves near-optimal storage across both common and rare variants. We further introduce H2, a path-centric dual representation derived from the same underlying allele-haplotype incidence information that restores explicit haplotype ordering while remaining exactly equivalent in information content. Using real human genome data, we show that this representation yields substantial compression gains, particularly for structural variants, while remaining equivalent in information content to pangenome graphs. H1 provides a unified, population-aware foundation for scalable pangenome analysis and downstream applications such as rare-variant interpretation and drug discovery.

2025-12-24T18:44:07Z 11 Pages, 2 Figures, 1 Table Roberto Garrone http://arxiv.org/abs/2602.09649v2 Population-scale Ancestral Recombination Graphs with tskit 1.0 2026-02-11T10:00:02Z

Ancestral recombination graphs (ARGs) are an increasingly important component of population and statistical genetics. The tskit library has become key infrastructure for the field, providing an expressive and general representation of ARGs together with a suite of efficient fundamental operations. In this note, we announce tskit version 1.0, describe its underlying rationale, and document its stability guarantees. These guarantees provide a foundation for durable computational artefacts and support long-term reproducibility of code and analyses.

2026-02-10T10:55:49Z Ben Jeffery Yan Wong Kevin Thornton Georgia Tsambos Gertjan Bisschop Yun Deng E. Castedo Ellerman Thomas B. Forest Halley Fritze Daniel Goldstein Gregor Gorjanc Graham Gower Simon Gravel Jeremy Guez Benjamin C. Haller Andrew D. Kern Lloyd Kirk Ivan Krukov Hanbin Lee Brieuc Lehmann Hossameldin Loay Matthew M. Osmond Duncan S. Palmer Nathaniel S. Pope Aaron P. Ragsdale Duncan Robertson Murillo F. Rodrigues Hugo van Kemenade Clemens L. Weiß Anthony Wilder Wohns Shing H. Zhan Brian C. Zhang Marianne Aspbury Nikolas A. Baya Saurabh Belsare Arjun Biddanda Francisco Campuzano Jiménez Ariella Gladstein Bing Guo Savita Karthikeyan Warren W. Kretzschmar Inés Rebollo Kumar Saunack Ruhollah Shemirani Alexis Simon Chris Smith Jeet Sukumaran Jonathan Terhorst Per Unneberg Ao Zhang Peter Ralph Jerome Kelleher http://arxiv.org/abs/2510.01935v2 scRNA-seq of preeclamptic trophoblasts identifies EBI3, COL17A1, miR-27a-5p, and miR-193b-5p as hypoxia markers: validation of neuradapt as a superior mimetic to cobalt chloride 2026-02-10T09:16:47Z

Background. Preeclampsia (PE) complicates 2-8% of pregnancies and involves placental hypoxia and HIF-pathway activation, especially in early-onset PE (eoPE). Chemical mimetics like cobalt (II) chloride (CoCl2) and oxyquinoline derivatives model trophoblast hypoxia in vitro, yet their fidelity in recapitulating PE gene profiles remains unclear. Integrating patient tissue analyses with experimental models may reveal common markers and validate physiologically relevant paradigms. Methods. We analyzed scRNA-seq data from 10 eoPE, 7 late-onset PE, and matched control placentas, identifying villous cytotrophoblast, syncytiotrophoblast, and extravillous trophoblast (EVT). BeWo b30 cells were treated for 24 h with CoCl2 (300 $μ$M) or the oxyquinoline derivative neuradapt (5 $μ$M) to induce hypoxia. RNA-seq with qPCR validation and small RNA-seq quantified mRNA and microRNA changes; PROGENy inferred pathway activities. Results. scRNA-seq revealed highest hypoxia activation in eoPE, with EVT showing maximum activity. Nine genes were upregulated across all trophoblast types (EBI3, CST6, FN1, RFK, COL17A1, LDHA, PKP2, RPS4Y1, RPS26). In vitro, neuradapt induced more specific hypoxia responses than CoCl2 (1284 vs. 3032 differentially expressed genes). Critically, EBI3, FN1, and COL17A1 showed concordant upregulation in tissue and neuradapt-treated cells, whereas CoCl2 produced opposite patterns. MicroRNAs hsa-miR-27a-5p and hsa-miR-193b-5p were consistently elevated in both models; 3'-isoforms of hsa-miR-9-5p and hsa-miR-92b-3p were identified as hypoxia-associated. Conclusions. EBI3, COL17A1, miR-27a-5p, and miR-193b-5p emerge as trophoblast hypoxia markers. Neuradapt (a selective HIF-prolyl hydroxylase inhibitor) provides a more physiologically relevant in vitro model than CoCl2, recapitulating transcriptomic signatures observed in PE placentas.

2025-10-02T11:57:21Z 31 pages, 5 figures, 1 table Placenta 176 (2026) 1-12 Evgeny Knyazev Faculty of Biology and Biotechnology, HSE University, Moscow, Russia Laboratory of Microfluidic Technologies for Biomedicine, Shemyakin-Ovchinnikov Institute of Bioorganic Chemistry of the Russian Academy of Sciences, Moscow, Russia Timur Kulagin Faculty of Biology and Biotechnology, HSE University, Moscow, Russia Ivan Antipenko Faculty of Biology and Biotechnology, HSE University, Moscow, Russia Alexander Tonevitsky Faculty of Biology and Biotechnology, HSE University, Moscow, Russia Laboratory of Microfluidic Technologies for Biomedicine, Shemyakin-Ovchinnikov Institute of Bioorganic Chemistry of the Russian Academy of Sciences, Moscow, Russia 10.1016/j.placenta.2026.02.005 http://arxiv.org/abs/2602.10156v1 STRAND: Sequence-Conditioned Transport for Single-Cell Perturbations 2026-02-10T00:57:38Z

Predicting how genetic perturbations change cellular state is a core problem for building controllable models of gene regulation. Perturbations targeting the same gene can produce different transcriptional responses depending on their genomic locus, including different transcription start sites and regulatory elements. Gene-level perturbation models collapse these distinct interventions into the same representation. We introduce STRAND, a generative model that predicts single-cell transcriptional responses by conditioning on regulatory DNA sequence. STRAND represents a perturbation by encoding the sequence at its genomic locus and uses this representation to parameterize a conditional transport process from control to perturbed cell states. Representing perturbations by sequence, rather than by a fixed set of gene identifiers, supports zero-shot inference at loci not seen during training and expands inference-time genomic coverage from ~1.5% for gene-level single-cell foundation models to ~95% of the genome. We evaluate STRAND on CRISPR perturbation datasets in K562, Jurkat, and RPE1 cells. STRAND improves discrimination scores by up to 33% in low-sample regimes, achieves the best average rank on unseen gene perturbation benchmarks, and improves transfer to novel cell lines by up to 0.14 in Pearson correlation. Ablations isolate the gains to sequence conditioning and transport, and case studies show that STRAND resolves functionally alternative transcription start sites missed by gene-level models.

2026-02-10T00:57:38Z 8 pages for main draft, 6 main figures Boyang Fu George Dasoulas Sameer Gabbita Xiang Lin Shanghua Gao Xiaorui Su Soumya Ghosh Marinka Zitnik http://arxiv.org/abs/2602.09067v1 AntigenLM: Structure-Aware DNA Language Modeling for Influenza 2026-02-09T08:52:04Z

Language models have advanced sequence analysis, yet DNA foundation models often lag behind task-specific methods for unclear reasons. We present AntigenLM, a generative DNA language model pretrained on influenza genomes with intact, aligned functional units. This structure-aware pretraining enables AntigenLM to capture evolutionary constraints and generalize across tasks. Fine-tuned on time-series hemagglutinin (HA) and neuraminidase (NA) sequences, AntigenLM accurately forecasts future antigenic variants across regions and subtypes, including those unseen during training, outperforming phylogenetic and evolution-based models. It also achieves near-perfect subtype classification. Ablation studies show that disrupting genomic structure through fragmentation or shuffling severely degrades performance, revealing the importance of preserving functional-unit integrity in DNA language modeling. AntigenLM thus provides both a powerful framework for antigen evolution prediction and a general principle for building biologically grounded DNA foundation models.

2026-02-09T08:52:04Z Accepted by ICLR 2026 Yue Pei Xuebin Chi Yu Kang