https://arxiv.org/api/7MEsVM+U7nH5w+q6AGMYb8XTYpI2026-06-14T01:24:37Z384818015http://arxiv.org/abs/2602.17532v1Systematic Evaluation of Single-Cell Foundation Model Interpretability Reveals Attention Captures Co-Expression Rather Than Unique Regulatory Signal2026-02-19T16:43:12ZWe present a systematic evaluation framework - thirty-seven analyses, 153 statistical tests, four cell types, two perturbation modalities - for assessing mechanistic interpretability in single-cell foundation models. Applying this framework to scGPT and Geneformer, we find that attention patterns encode structured biological information with layer-specific organisation - protein-protein interactions in early layers, transcriptional regulation in late layers - but this structure provides no incremental value for perturbation prediction: trivial gene-level baselines outperform both attention and correlation edges (AUROC 0.81-0.88 versus 0.70), pairwise edge scores add zero predictive contribution, and causal ablation of regulatory heads produces no degradation. These findings generalise from K562 to RPE1 cells; the attention-correlation relationship is context-dependent, but gene-level dominance is universal. Cell-State Stratified Interpretability (CSSI) addresses an attention-specific scaling failure, improving GRN recovery up to 1.85x. The framework establishes reusable quality-control standards for the field.2026-02-19T16:43:12ZIhor Kendiukhovhttp://arxiv.org/abs/2602.17747v1AgriVariant: Variant Effect Prediction using DeepChem-Variant for Precision Breeding in Rice2026-02-19T14:03:37ZPredicting functional consequences of genetic variants in crop genes remains a critical bottleneck for precision breeding programs. We present AgriVariant, an end-to-end pipeline for variant-effect prediction in rice (Oryza sativa) that addresses the lack of crop-specific variant-interpretation tools and can be extended to any crop species with available reference genomes and gene annotations. Our approach integrates deep learning-based variant calling (DeepChem-Variant) with custom plant genomics annotation using RAP-DB gene models and database-independent deleteriousness scoring that combines the Grantham distance and the BLOSUM62 substitution matrix. We validate the pipeline through targeted mutations in stress-response genes (OsDREB2a, OsDREB1F, SKC1), demonstrating correct classification of stop-gained, missense, and synonymous variants with appropriate HIGH / MODERATE / LOW impact assignments. An exhaustive mutagenesis study of OsMT-3a analyzed all 1,509 possible single-nucleotide variants in 10 days, identifying 353 high-impact, 447 medium-impact, and 709 low-impact variants - an analysis that would have required 2-4 years using traditional wet-lab approaches. This computational framework enables breeders to prioritize variants for experimental validation across diverse crop species, reducing screening costs and accelerating development of climate-resilient crop varieties.2026-02-19T14:03:37Z8 pages, 7 figures, 5 tablesAnkita Vaishnobi BisoiBharath Ramsundarhttp://arxiv.org/abs/2601.14969v2Robust Machine Learning for Regulatory Sequence Modeling under Biological and Technical Distribution Shifts2026-02-19T12:51:45ZRobust machine learning for regulatory genomics is studied under biologically and technically induced distribution shifts. Deep convolutional and attention based models achieve strong in distribution performance on DNA regulatory sequence prediction tasks but are usually evaluated under i.i.d. assumptions, even though real applications involve cell type specific programs, evolutionary turnover, assay protocol changes, and sequencing artifacts. We introduce a robustness framework that combines a mechanistic simulation benchmark with real data analysis on a massively parallel reporter assay (MPRA) dataset to quantify performance degradation, calibration failures, and uncertainty based reliability. In simulation, motif driven regulatory outputs are generated with cell type specific programs, PWM perturbations, GC bias, depth variation, batch effects, and heteroscedastic noise, and CNN, BiLSTM, and transformer models are evaluated. Models remain accurate and reasonably calibrated under mild GC content shifts but show higher error, severe variance miscalibration, and coverage collapse under motif effect rewiring and noise dominated regimes, revealing robustness gaps invisible to standard i.i.d. evaluation. Adding simple biological structural priors motif derived features in simulation and global GC content in MPRA improves in distribution error and yields consistent robustness gains under biologically meaningful genomic shifts, while providing only limited protection against strong assay noise. Uncertainty-aware selective prediction offers an additional safety layer that risk coverage analyses on simulated and MPRA data show that filtering low confidence inputs recovers low risk subsets, including under GC-based out-of-distribution conditions, although reliability gains diminish when noise dominates.2026-01-21T13:15:27Z20 pages, 16 figuresYiyao Yanghttp://arxiv.org/abs/2602.16696v1Parameter-free representations outperform single-cell foundation models on downstream benchmarks2026-02-18T18:42:29ZSingle-cell RNA sequencing (scRNA-seq) data exhibit strong and reproducible statistical structure. This has motivated the development of large-scale foundation models, such as TranscriptFormer, that use transformer-based architectures to learn a generative model for gene expression by embedding genes into a latent vector space. These embeddings have been used to obtain state-of-the-art (SOTA) performance on downstream tasks such as cell-type classification, disease-state prediction, and cross-species learning. Here, we ask whether similar performance can be achieved without utilizing computationally intensive deep learning-based representations. Using simple, interpretable pipelines that rely on careful normalization and linear methods, we obtain SOTA or near SOTA performance across multiple benchmarks commonly used to evaluate single-cell foundation models, including outperforming foundation models on out-of-distribution tasks involving novel cell types and organisms absent from the training data. Our findings highlight the need for rigorous benchmarking and suggest that the biology of cell identity can be captured by simple linear representations of single cell gene expression data.2026-02-18T18:42:29ZHuan SouzaPankaj Mehtahttp://arxiv.org/abs/2503.12286v2Integrating Chain-of-Thought and Retrieval Augmented Generation Enhances Rare Disease Diagnosis from Clinical Notes2026-02-18T16:38:37ZBackground: Several studies show that large language models (LLMs) struggle with phenotype-driven gene prioritization for rare diseases. These studies typically use Human Phenotype Ontology (HPO) terms to prompt foundation models like GPT and LLaMA to predict candidate genes. However, in real-world settings, foundation models are not optimized for domain-specific tasks like clinical diagnosis, yet inputs are unstructured clinical notes rather than standardized terms. How LLMs can be instructed to predict candidate genes or disease diagnosis from unstructured clinical notes remains a major challenge. Methods: We introduce RAG-driven CoT and CoT-driven RAG, two methods that combine Chain-of-Thought (CoT) and Retrieval Augmented Generation (RAG) to analyze clinical notes. A five-question CoT protocol mimics expert reasoning, while RAG retrieves data from sources like HPO and OMIM (Online Mendelian Inheritance in Man). We evaluated these approaches on rare disease datasets, including 5,980 Phenopacket-derived notes, 255 literature-based narratives, and 220 in-house clinical notes from Childrens Hospital of Philadelphia. Results: We found that recent foundations models, including Llama 3.3-70B-Instruct and DeepSeek-R1-Distill-Llama-70B, outperformed earlier versions such as Llama 2 and GPT-3.5. We also showed that RAG-driven CoT and CoT-driven RAG both outperform foundation models in candidate gene prioritization from clinical notes; in particular, both methods with DeepSeek backbone resulted in a top-10 gene accuracy of over 40% on Phenopacket-derived clinical notes. RAG-driven CoT works better for high-quality notes, where early retrieval can anchor the subsequent reasoning steps in domain-specific evidence, while CoT-driven RAG has advantage when processing lengthy and noisy notes.2025-03-15T22:57:31ZZhanliang WangDa WuQuan NguyenKai Wanghttp://arxiv.org/abs/2602.16728v1FUNGAR: a pipeline for detecting antifungal resistance mutations directly from metagenomic short reads2026-02-17T17:49:24ZMotivation: Antifungal resistance has become an increasing global concern in both clinical and environmental health. Detecting known resistance mutations directly from sequencing reads, in special metagenomic samples, remains a major challenge. As fungal pathogens are often neglected compared with bacterial pathogens, most available tools are designed for bacterial taxa, whereas tools targeting fungi typically require assembled genomes. In metagenomic datasets, assembly-based strategies may result in substantial information loss due to genome fragmentation, low-abundance species, or incomplete recovery of resistance loci. Results: Here, we present FUNGAR, an open-source pipeline for the rapid identification of antifungal resistance genes and mutations directly from short-read data. FUNGAR employs translated alignments with DIAMOND and curated data from the FungAMR database to detect amino acid substitutions across all six open reading frames. The pipeline produces structured, reproducible reports linking detected variants to their associated antifungal drugs and can be easily customized for new species or databases.2026-02-17T17:49:24ZHenrique RM AntoniolliLívia KmetzschCharley C Staatshttp://arxiv.org/abs/2410.17801v3Rawsamble: Overlapping and Assembling Raw Nanopore Signals using a Hash-based Seeding Mechanism2026-02-16T17:03:30ZRaw nanopore signal analysis is a common approach in genomics to provide fast and resource-efficient analysis without translating the signals to bases (i.e., without basecalling). However, existing solutions cannot interpret raw signals directly if a reference genome is unknown due to a lack of accurate mechanisms to handle increased noise in pairwise raw signal comparison. Our goal is to enable the direct analysis of raw signals without a reference genome. To this end, we propose Rawsamble, the first mechanism that can identify regions of similarity between all raw signal pairs, known as all-vs-all overlapping, using a hash-based search mechanism.
We use these overlaps to construct de novo assembly graphs with an existing assembler, miniasm, off-the-shelf. To our knowledge, these are the first de novo assemblies ever constructed directly from raw signals without basecalling. Our extensive evaluations across multiple genomes of varying sizes show that Rawsamble provides a significant speedup (on average by 5.01x and up to 23.10x) and reduces peak memory usage (on average by 5.74x and up to by 22.00x) compared to a conventional genome assembly pipeline using the state-of-the-art tools for basecalling (Dorado's fastest mode) and overlapping (minimap2) on a CPU.We find that around one-third of Rawsamble 's overlapping pairs are also found by minimap2. We find that when we use overlapping reads from Rawsamble, we can construct unitigs that are 1) as accurate as those built from minimap2's overlaps and 2) up to half a chromosome in length (e.g., 2.3 million bases for E. coli). Source code: https://github.com/CMU-SAFARI/RawHash2024-10-23T11:59:44ZAccepted to appear in the Bioinformatics journalCan FirtinaMaximilian MordigHarun MustafaSayan GoswamiNika Mansouri GhiasiStefano MercoglianoFurkan ErisJoël LindeggerAndre KahlesOnur Mutluhttp://arxiv.org/abs/2602.13346v1CellMaster: Collaborative Cell Type Annotation in Single-Cell Analysis2026-02-12T20:20:22ZSingle-cell RNA-seq (scRNA-seq) enables atlas-scale profiling of complex tissues, revealing rare lineages and transient states. Yet, assigning biologically valid cell identities remains a bottleneck because markers are tissue- and state-dependent, and novel states lack references. We present CellMaster, an AI agent that mimics expert practice for zero-shot cell-type annotation. Unlike existing automated tools, CellMaster leverages LLM-encoded knowledge (e.g., GPT-4o) to perform on-the-fly annotation with interpretable rationales, without pre-training or fixed marker databases. Across 9 datasets spanning 8 tissues, CellMaster improved accuracy by 7.1% over best-performing baselines (including CellTypist and scTab) in automatic mode. With human-in-the-loop refinement, this advantage increased to 18.6%, with a 22.1% gain on subtype populations. The system demonstrates particular strength in rare and novel cell states where baselines often fail. Source code and the web application are available at \href{https://github.com/AnonymousGym/CellMaster}{https://github.com/AnonymousGym/CellMaster}.2026-02-12T20:20:22ZPreprintZhen WangYiming GaoJieyuan LiuEnze MaJefferson ChenMark AntkowiakMengzhou HuJungHo KongDexter PrattZhiting HuWei WangTrey IdekerEric P. Xinghttp://arxiv.org/abs/2602.11609v1scPilot: Large Language Model Reasoning Toward Automated Single-Cell Analysis and Discovery2026-02-12T06:04:11ZWe present scPilot, the first systematic framework to practice omics-native reasoning: a large language model (LLM) converses in natural language while directly inspecting single-cell RNA-seq data and on-demand bioinformatics tools. scPilot converts core single-cell analyses, i.e., cell-type annotation, developmental-trajectory reconstruction, and transcription-factor targeting, into step-by-step reasoning problems that the model must solve, justify, and, when needed, revise with new evidence.
To measure progress, we release scBench, a suite of 9 expertly curated datasets and graders that faithfully evaluate the omics-native reasoning capability of scPilot w.r.t various LLMs. Experiments with o1 show that iterative omics-native reasoning lifts average accuracy by 11% for cell-type annotation and Gemini-2.5-Pro cuts trajectory graph-edit distance by 30% versus one-shot prompting, while generating transparent reasoning traces explain marker gene ambiguity and regulatory logic. By grounding LLMs in raw omics data, scPilot enables auditable, interpretable, and diagnostically informative single-cell analyses.
Code, data, and package are available at https://github.com/maitrix-org/scPilot2026-02-12T06:04:11ZAccepted at NeurIPS 2025 Main ConferenceYiming GaoZhen WangJefferson ChenMark AntkowiakMengzhou HuJungHo KongDexter PrattJieyuan LiuEnze MaZhiting HuEric P. Xinghttp://arxiv.org/abs/2508.07465v2MOTGNN: Interpretable Graph Neural Networks for Multi-Omics Disease Classification2026-02-11T18:50:44ZIntegrating multi-omics data, such as DNA methylation, mRNA expression, and microRNA (miRNA) expression, offers a comprehensive view of the biological mechanisms underlying disease. However, the high dimensionality of multi-omics data, the heterogeneity across modalities, and the lack of reliable biological interaction networks make meaningful integration challenging. In addition, many existing models rely on handcrafted similarity graphs, are vulnerable to class imbalance, and often lack built-in interpretability, limiting their usefulness in biomedical applications. We propose Multi-Omics integration with Tree-generated Graph Neural Network (MOTGNN), a novel and interpretable framework for binary disease classification. MOTGNN employs eXtreme Gradient Boosting (XGBoost) for omics-specific supervised graph construction, followed by modality-specific Graph Neural Networks (GNNs) for hierarchical representation learning, and a deep feedforward network for cross-omics integration. Across three real-world disease datasets, MOTGNN outperforms state-of-the-art baselines by 5-10% in accuracy, ROC-AUC, and F1-score, and remains robust to severe class imbalance. The model maintains computational efficiency through the use of sparse graphs and provides built-in interpretability, revealing both top-ranked biomarkers and the relative contributions of each omics modality. These results highlight the potential of MOTGNN to improve both predictive accuracy and interpretability in multi-omics disease modeling.2025-08-10T19:35:53Z11 pages, 6 figures, 7 tablesTiantian YangZhiqian Chenhttp://arxiv.org/abs/2512.21320v2An Allele-Centric Pan-Graph-Matrix Representation for Scalable Pangenome Analysis2026-02-11T18:45:27ZPopulation-scale pangenome analysis increasingly requires representations that unify single-nucleotide and structural variation while remaining scalable across large cohorts. Existing formats are typically sequence-centric, path-centric, or sample-centric, and often obscure population structure or fail to exploit carrier sparsity. We introduce the H1 pan-graph-matrix, an allele-centric representation that encodes exact haplotype membership using adaptive per-allele compression. By treating alleles as first-class objects and selecting optimal encodings based on carrier distribution, H1 achieves near-optimal storage across both common and rare variants. We further introduce H2, a path-centric dual representation derived from the same underlying allele-haplotype incidence information that restores explicit haplotype ordering while remaining exactly equivalent in information content. Using real human genome data, we show that this representation yields substantial compression gains, particularly for structural variants, while remaining equivalent in information content to pangenome graphs. H1 provides a unified, population-aware foundation for scalable pangenome analysis and downstream applications such as rare-variant interpretation and drug discovery.2025-12-24T18:44:07Z11 Pages, 2 Figures, 1 TableRoberto Garronehttp://arxiv.org/abs/2602.09649v2Population-scale Ancestral Recombination Graphs with tskit 1.02026-02-11T10:00:02ZAncestral recombination graphs (ARGs) are an increasingly important component of population and statistical genetics. The tskit library has become key infrastructure for the field, providing an expressive and general representation of ARGs together with a suite of efficient fundamental operations. In this note, we announce tskit version 1.0, describe its underlying rationale, and document its stability guarantees. These guarantees provide a foundation for durable computational artefacts and support long-term reproducibility of code and analyses.2026-02-10T10:55:49ZBen JefferyYan WongKevin ThorntonGeorgia TsambosGertjan BisschopYun DengE. Castedo EllermanThomas B. ForestHalley FritzeDaniel GoldsteinGregor GorjancGraham GowerSimon GravelJeremy GuezBenjamin C. HallerAndrew D. KernLloyd KirkIvan KrukovHanbin LeeBrieuc LehmannHossameldin LoayMatthew M. OsmondDuncan S. PalmerNathaniel S. PopeAaron P. RagsdaleDuncan RobertsonMurillo F. RodriguesHugo van KemenadeClemens L. WeißAnthony Wilder WohnsShing H. ZhanBrian C. ZhangMarianne AspburyNikolas A. BayaSaurabh BelsareArjun BiddandaFrancisco Campuzano JiménezAriella GladsteinBing GuoSavita KarthikeyanWarren W. KretzschmarInés RebolloKumar SaunackRuhollah ShemiraniAlexis SimonChris SmithJeet SukumaranJonathan TerhorstPer UnnebergAo ZhangPeter RalphJerome Kelleherhttp://arxiv.org/abs/2510.01935v2scRNA-seq of preeclamptic trophoblasts identifies EBI3, COL17A1, miR-27a-5p, and miR-193b-5p as hypoxia markers: validation of neuradapt as a superior mimetic to cobalt chloride2026-02-10T09:16:47ZBackground. Preeclampsia (PE) complicates 2-8% of pregnancies and involves placental hypoxia and HIF-pathway activation, especially in early-onset PE (eoPE). Chemical mimetics like cobalt (II) chloride (CoCl2) and oxyquinoline derivatives model trophoblast hypoxia in vitro, yet their fidelity in recapitulating PE gene profiles remains unclear. Integrating patient tissue analyses with experimental models may reveal common markers and validate physiologically relevant paradigms.
Methods. We analyzed scRNA-seq data from 10 eoPE, 7 late-onset PE, and matched control placentas, identifying villous cytotrophoblast, syncytiotrophoblast, and extravillous trophoblast (EVT). BeWo b30 cells were treated for 24 h with CoCl2 (300 $μ$M) or the oxyquinoline derivative neuradapt (5 $μ$M) to induce hypoxia. RNA-seq with qPCR validation and small RNA-seq quantified mRNA and microRNA changes; PROGENy inferred pathway activities.
Results. scRNA-seq revealed highest hypoxia activation in eoPE, with EVT showing maximum activity. Nine genes were upregulated across all trophoblast types (EBI3, CST6, FN1, RFK, COL17A1, LDHA, PKP2, RPS4Y1, RPS26). In vitro, neuradapt induced more specific hypoxia responses than CoCl2 (1284 vs. 3032 differentially expressed genes). Critically, EBI3, FN1, and COL17A1 showed concordant upregulation in tissue and neuradapt-treated cells, whereas CoCl2 produced opposite patterns. MicroRNAs hsa-miR-27a-5p and hsa-miR-193b-5p were consistently elevated in both models; 3'-isoforms of hsa-miR-9-5p and hsa-miR-92b-3p were identified as hypoxia-associated.
Conclusions. EBI3, COL17A1, miR-27a-5p, and miR-193b-5p emerge as trophoblast hypoxia markers. Neuradapt (a selective HIF-prolyl hydroxylase inhibitor) provides a more physiologically relevant in vitro model than CoCl2, recapitulating transcriptomic signatures observed in PE placentas.2025-10-02T11:57:21Z31 pages, 5 figures, 1 tablePlacenta 176 (2026) 1-12Evgeny KnyazevFaculty of Biology and Biotechnology, HSE University, Moscow, RussiaLaboratory of Microfluidic Technologies for Biomedicine, Shemyakin-Ovchinnikov Institute of Bioorganic Chemistry of the Russian Academy of Sciences, Moscow, RussiaTimur KulaginFaculty of Biology and Biotechnology, HSE University, Moscow, RussiaIvan AntipenkoFaculty of Biology and Biotechnology, HSE University, Moscow, RussiaAlexander TonevitskyFaculty of Biology and Biotechnology, HSE University, Moscow, RussiaLaboratory of Microfluidic Technologies for Biomedicine, Shemyakin-Ovchinnikov Institute of Bioorganic Chemistry of the Russian Academy of Sciences, Moscow, Russia10.1016/j.placenta.2026.02.005http://arxiv.org/abs/2602.10156v1STRAND: Sequence-Conditioned Transport for Single-Cell Perturbations2026-02-10T00:57:38ZPredicting how genetic perturbations change cellular state is a core problem for building controllable models of gene regulation. Perturbations targeting the same gene can produce different transcriptional responses depending on their genomic locus, including different transcription start sites and regulatory elements. Gene-level perturbation models collapse these distinct interventions into the same representation.
We introduce STRAND, a generative model that predicts single-cell transcriptional responses by conditioning on regulatory DNA sequence. STRAND represents a perturbation by encoding the sequence at its genomic locus and uses this representation to parameterize a conditional transport process from control to perturbed cell states. Representing perturbations by sequence, rather than by a fixed set of gene identifiers, supports zero-shot inference at loci not seen during training and expands inference-time genomic coverage from ~1.5% for gene-level single-cell foundation models to ~95% of the genome.
We evaluate STRAND on CRISPR perturbation datasets in K562, Jurkat, and RPE1 cells. STRAND improves discrimination scores by up to 33% in low-sample regimes, achieves the best average rank on unseen gene perturbation benchmarks, and improves transfer to novel cell lines by up to 0.14 in Pearson correlation. Ablations isolate the gains to sequence conditioning and transport, and case studies show that STRAND resolves functionally alternative transcription start sites missed by gene-level models.2026-02-10T00:57:38Z8 pages for main draft, 6 main figuresBoyang FuGeorge DasoulasSameer GabbitaXiang LinShanghua GaoXiaorui SuSoumya GhoshMarinka Zitnikhttp://arxiv.org/abs/2602.09067v1AntigenLM: Structure-Aware DNA Language Modeling for Influenza2026-02-09T08:52:04ZLanguage models have advanced sequence analysis, yet DNA foundation models often lag behind task-specific methods for unclear reasons. We present AntigenLM, a generative DNA language model pretrained on influenza genomes with intact, aligned functional units. This structure-aware pretraining enables AntigenLM to capture evolutionary constraints and generalize across tasks. Fine-tuned on time-series hemagglutinin (HA) and neuraminidase (NA) sequences, AntigenLM accurately forecasts future antigenic variants across regions and subtypes, including those unseen during training, outperforming phylogenetic and evolution-based models. It also achieves near-perfect subtype classification. Ablation studies show that disrupting genomic structure through fragmentation or shuffling severely degrades performance, revealing the importance of preserving functional-unit integrity in DNA language modeling. AntigenLM thus provides both a powerful framework for antigen evolution prediction and a general principle for building biologically grounded DNA foundation models.2026-02-09T08:52:04ZAccepted by ICLR 2026Yue PeiXuebin ChiYu Kang