https://arxiv.org/api/wsEUQRWsecbvQJOW8aKoU2pR48U 2026-03-18T10:16:19Z 3746 15 15 http://arxiv.org/abs/2603.10261v1 Discovery of a Hematopoietic Manifold in scGPT Yields a Method for Extracting Performant Algorithms from Biological Foundation Model Internals 2026-03-10T22:44:12Z We report the discovery and extraction of a compact hematopoietic algorithm from the single-cell foundation model scGPT, to our knowledge the first biologically useful, competitive algorithm extracted from a foundation model via mechanistic interpretability. We show that scGPT internally encodes a compact hematopoietic manifold with significant developmental branch structure, validated on a strict non-overlap Tabula Sapiens external panel and confirmed via frozen-head zero-shot transfer to an independent multi-donor immune panel. To isolate this geometry, we introduce a general three-stage extraction method consisting of direct operator export from frozen attention weights, a lightweight learned adaptor, and a task-specific readout, producing a standalone algorithm without target-dataset retraining. In 88-split donor-holdout benchmarks against scVI, Palantir, DPT, CellTypist, PCA, and raw-expression baselines, the extracted algorithm achieves the strongest pseudotime-depth ordering and leads on key subtype endpoints (CD4/CD8 AUROC 0.867, mono/macro AUROC 0.951). Compared to standard probing of frozen scGPT embeddings with a 3-layer MLP, the extracted head is BH-significantly better on 6/8 classification endpoints while completing a full 12-split evaluation campaign 34.5x faster with approximately 1000x fewer trainable parameters. The exported operator compresses from three pooled attention heads to a single head without statistically significant loss, and further to a rank-64 surrogate. Mechanistic interpretability of the compact operator reveals a concentrated four-factor core explaining 66.2% of ablation impact, with factors resolving into explicit T/lymphoid, B/plasma, granulocytic, and monocyte/macrophage gene programs. A supplementary second-manifold validation (intercellular communication geometry) confirms that the extraction method generalizes beyond hematopoiesis. 2026-03-10T22:44:12Z Ihor Kendiukhov http://arxiv.org/abs/2603.08913v1 Quantifying Memorization and Privacy Risks in Genomic Language Models 2026-03-09T20:30:37Z Genomic language models (GLMs) have emerged as powerful tools for learning representations of DNA sequences, enabling advances in variant prediction, regulatory element identification, and cross-task transfer learning. However, as these models are increasingly trained or fine-tuned on sensitive genomic cohorts, they risk memorizing specific sequences from their training data, raising serious concerns around privacy, data leakage, and regulatory compliance. Despite growing awareness of memorization risks in general-purpose language models, little systematic evaluation exists for these risks in the genomic domain, where data exhibit unique properties such as a fixed nucleotide alphabet, strong biological structure, and individual identifiability. We present a comprehensive, multi-vector privacy evaluation framework designed to quantify memorization risks in GLMs. Our approach integrates three complementary risk assessment methodologies: perplexity-based detection, canary sequence extraction, and membership inference. These are combined into a unified evaluation pipeline that produces a worst-case memorization risk score. To enable controlled evaluation, we plant canary sequences at varying repetition rates into both synthetic and real genomic datasets, allowing precise quantification of how repetition and training dynamics influence memorization. We evaluate our framework across multiple GLM architectures, examining the relationship between sequence repetition, model capacity, and memorization risk. Our results establish that GLMs exhibit measurable memorization and that the degree of memorization varies across architectures and training regimes. These findings reveal that no single attack vector captures the full scope of memorization risk, underscoring the need for multi-vector privacy auditing as a standard practice for genomic AI systems. 2026-03-09T20:30:37Z 13 pages Alexander Nemecek Wenbiao Li Xiaoqian Jiang Jaideep Vaidya Erman Ayday http://arxiv.org/abs/2502.03569v3 Controllable Sequence Editing for Biological and Clinical Trajectories 2026-03-09T14:46:46Z Conditional generation models for longitudinal sequences can produce new or modified trajectories given a conditioning input. However, they often lack control over when the condition should take effect (timing) and which variables it should influence (scope). Most methods either operate only on univariate sequences or assume that the condition alters all variables and time steps. In scientific and clinical settings, interventions instead begin at a specific moment, such as the time of drug administration or surgery, and influence only a subset of measurements while the rest of the trajectory remains unchanged. CLEF learns temporal concepts that encode how and when a condition alters future sequence evolution. These concepts allow CLEF to apply targeted edits to the affected time steps and variables while preserving the rest of the sequence. We evaluate CLEF on 8 datasets spanning cellular reprogramming, patient health, and sales, comparing against 9 state-of-the-art baselines. CLEF improves immediate sequence editing accuracy by 16.28% (MAE) on average against their non-CLEF counterparts. Unlike prior models, CLEF enables one-step conditional generation at arbitrary future times, outperforming their non-CLEF counterparts in delayed sequence editing by 26.73% (MAE) on average. We test CLEF under counterfactual inference assumptions and show up to 62.84% (MAE) improvement on zero-shot conditional generation of counterfactual trajectories. In a case study of patients with type 1 diabetes mellitus, CLEF identifies clinical interventions that generate realistic counterfactual trajectories shifted toward healthier outcomes. 2025-02-05T19:33:12Z ICLR 2026 Michelle M. Li Kevin Li Yasha Ektefaie Ying Jin Yepeng Huang Shvat Messica Tianxi Cai Marinka Zitnik http://arxiv.org/abs/2512.05731v2 DeeDeeExperiment: Building an infrastructure for integrating and managing omics data analysis results in R/Bioconductor 2026-03-09T11:07:33Z Summary: Modern omics experiments now involve multiple conditions and complex designs, producing an increasingly large set of differential expression and functional enrichment analysis results. However, no standardized data structure exists to store and contextualize these results together with their metadata, leaving researchers with an unmanageable and potentially non-reproducible collection of results that are difficult to navigate and/or share. Here we introduce DeeDeeExperiment, a new S4 class for managing and storing omics data analysis results, implemented within the Bioconductor ecosystem, which promotes interoperability, reproducibility and good documentation. This class extends the widely used SingleCellExperiment object by introducing dedicated slots for Differential Expression (DEA) and Functional Enrichment Analysis (FEA) results, allowing users to organize, store, and retrieve information on multiple contrasts and associated metadata within a single data object, ultimately streamlining the management and interpretation of many omics datasets. Availability and implementation: DeeDeeExperiment is available on Bioconductor under the MIT license (https://bioconductor.org/packages/DeeDeeExperiment), with its development version also available on Github (https://github.com/imbeimainz/DeeDeeExperiment). 2025-12-05T14:11:28Z 1 figure Najla Abassi Lea Schwarz Edoardo Filippi Federico Marini http://arxiv.org/abs/2603.08062v1 Adversarial Domain Adaptation Enables Knowledge Transfer Across Heterogeneous RNA-Seq Datasets 2026-03-09T07:55:32Z Accurate phenotype prediction from RNA sequencing (RNA-seq) data is essential for diagnosis, biomarker discovery, and personalized medicine. Deep learning models have demonstrated strong potential to outperform classical machine learning approaches, but their performance relies on large, well-annotated datasets. In transcriptomics, such datasets are frequently limited, leading to over-fitting and poor generalization. Knowledge transfer from larger, more general datasets can alleviate this issue. However, transferring information across RNA-seq datasets remains challenging due to heterogeneous preprocessing pipelines and differences in target phenotypes. In this study, we propose a deep learning-based domain adaptation framework that enables effective knowledge transfer from a large general dataset to a smaller one for cancer type classification. The method learns a domain-invariant latent space by jointly optimizing classification and domain alignment objectives. To ensure stable training and robustness in data-scarce scenarios, the framework is trained with an adversarial approach with appropriate regularization. Both supervised and unsupervised approach variants are explored, leveraging labeled or unlabeled target samples. The framework is evaluated on three large-scale transcriptomic datasets (TCGA, ARCHS4, GTEx) to assess its ability to transfer knowledge across cohorts. Experimental results demonstrate consistent improvements in cancer and tissue type classification accuracy compared to non-adaptive baselines, particularly in low-data scenarios. Overall, this work highlights domain adaptation as a powerful strategy for data-efficient knowledge transfer in transcriptomics, enabling robust phenotype prediction under constrained data conditions. 2026-03-09T07:55:32Z 7 pages, 5 figures. Submitted to ECCB 2026 Kevin Dradjat Massinissa Hamidi Blaise Hanczar http://arxiv.org/abs/2511.08996v3 Partial domain adaptation enables cross domain cell type annotation between scRNA-seq and snRNA-seq 2026-03-08T17:33:31Z Accurate cell type annotation across datasets is a key challenge in single-cell analysis. snRNA-seq enables profiling of frozen or difficult-to-dissociate tissues, complementing scRNA-seq by capturing fragile or rare cell types. However, cross-annotation between these two datasets remains largely unexplored, as existing methods treat them independently. We introduce ScNucAdapt, a method designed for cross-annotation between paired and unpaired scRNA-seq and snRNA-seq datasets. To address distributional and cell composition differences, ScNucAdapt employs partial domain adaptation. Experiments across both unpaired and paired scRNA-seq and snRNA-seq show that ScNucAdapt achieves robust and accurate cell type annotation, outperforming existing approaches. Therefore, ScNucAdapt provides a practical framework for the cross-domain cell type annotation between scRNA-seq and snRNA seq data. 2025-11-12T05:37:40Z Xiran Chen Quan Zou Qinyu Cai Xiaofeng Chen Weikai Li Yansu Wang http://arxiv.org/abs/2603.06950v1 How Private Are DNA Embeddings? Inverting Foundation Model Representations of Genomic Sequences 2026-03-06T23:52:26Z DNA foundation models have become transformative tools in bioinformatics and healthcare applications. Trained on vast genomic datasets, these models can be used to generate sequence embeddings, dense vector representations that capture complex genomic information. These embeddings are increasingly being shared via Embeddings-as-a-Service (EaaS) frameworks to facilitate downstream tasks, while supposedly protecting the privacy of the underlying raw sequences. However, as this practice becomes more prevalent, the security of these representations is being called into question. This study evaluates the resilience of DNA foundation models to model inversion attacks, whereby adversaries attempt to reconstruct sensitive training data from model outputs. In our study, the model's output for reconstructing the DNA sequence is a zero-shot embedding, which is then fed to a decoder. We evaluated the privacy of three DNA foundation models: DNABERT-2, Evo 2, and Nucleotide Transformer v2 (NTv2). Our results show that per-token embeddings allow near-perfect sequence reconstruction across all models. For mean-pooled embeddings, reconstruction quality degrades as sequence length increases, though it remains substantially above random baselines. Evo 2 and NTv2 prove to be most vulnerable, especially for shorter sequences with reconstruction similarities > 90%, while DNABERT-2's BPE tokenization provides the greatest resilience. We found that the correlation between embedding similarity and sequence similarity was a key predictor of reconstruction success. Our findings emphasize the urgent need for privacy-aware design in genomic foundation models prior to their widespread deployment in EaaS settings. Training code, model weights and evaluation pipeline are released on: https://github.com/not-a-feature/DNA-Embedding-Inversion. 2026-03-06T23:52:26Z Sofiane Ouaari Jules Kreuer Nico Pfeifer http://arxiv.org/abs/2603.06804v1 Identifying genes associated with phenotypes using machine and deep learning 2026-03-06T19:13:19Z Identifying disease-associated genes enables the development of precision medicine and the understanding of biological processes. Genome-wide association studies (GWAS), gene expression data, biological pathway analysis, and protein network analysis are among the techniques used to identify causal genes. We propose a machine-learning (ML) and deep-learning (DL) pipeline to identify genes associated with a phenotype. The proposed pipeline consists of two interrelated processes. The first is classifying people into case/control based on the genotype data. The second is calculating feature importance to identify genes associated with a particular phenotype. We considered 30 phenotypes from the openSNP data for analysis, 21 ML algorithms, and 80 DL algorithms and variants. The best-performing ML and DL models, evaluated by the area under the curve (AUC), F1 score, and Matthews correlation coefficient (MCC), were used to identify important single-nucleotide polymorphisms (SNPs), and the identified SNPs were compared with the phenotype-associated SNPs from the GWAS Catalog. The mean per-phenotype gene identification ratio (GIR) was 0.84. These results suggest that SNPs selected by ML/DL algorithms that maximize classification performance can help prioritise phenotype-associated SNPs and genes, potentially supporting downstream studies aimed at understanding disease mechanisms and identifying candidate therapeutic targets. 2026-03-06T19:13:19Z Muhammad Muneeb David B. Ascher YooChan Myung http://arxiv.org/abs/2603.06768v1 Benchmarking 80 binary phenotypes from the openSNP dataset using deep learning algorithms and polygenic risk score tools 2026-03-06T16:36:42Z Genotype-phenotype prediction plays a crucial role in identifying disease-causing single nucleotide polymorphisms and precision medicine. In this manuscript, we benchmark the performance of various machine/deep learning algorithms and polygenic risk score tools on 80 binary phenotypes extracted from the openSNP dataset. After cleaning and extraction, the genotype data for each phenotype is passed to PLINK for quality control, after which it is transformed separately for each of the considered tools/algorithms. To compute polygenic risk scores, we used the quality control measures for the test data and the genome-wide association studies summary statistic file, along with various combinations of clumping and pruning. For the machine learning algorithms, we used p-value thresholding on the training data to select the single nucleotide polymorphisms, and the resulting data was passed to the algorithm. Our results report the average 5-fold Area Under the Curve (AUC) for 29 machine learning algorithms, 80 deep learning algorithms, and 3 polygenic risk scores tools with 675 different clumping and pruning parameters. Machine learning outperformed for 44 phenotypes, while polygenic risk score tools excelled for 36 phenotypes. The results give us valuable insights into which techniques tend to perform better for certain phenotypes compared to more traditional polygenic risk scores tools. 2026-03-06T16:36:42Z Muhammad Muneeb David B. Ascher YooChan Myung Samuel F. Feng Andreas Henschel http://arxiv.org/abs/2602.10152v2 Validating Interpretability in siRNA Efficacy Prediction: A Perturbation-Based, Dataset-Aware Protocol 2026-03-06T15:55:21Z Saliency maps are increasingly used as design guidance in siRNA efficacy prediction, yet attribution methods are rarely validated before motivating sequence edits. We introduce a pre-synthesis gate: a protocol for counterfactual sensitivity faithfulness that tests whether mutating high-saliency positions changes model output more than composition-matched controls. Cross-dataset transfer reveals two failure modes that would otherwise go undetected: faithful-but-wrong (saliency valid, predictions fail) and inverted saliency (top-saliency edits less impactful than random). Strikingly, models trained on mRNA-level assays collapse on a luciferase reporter dataset, demonstrating that protocol shifts can silently invalidate deployment. Across four benchmarks, 19/20 fold instances pass; the single failure shows inverted saliency. A biology-informed regularizer (BioPrior) strengthens saliency faithfulness with modest, dataset-dependent predictive trade-offs. Our results establish saliency validation as essential pre-deployment practice for explanation-guided therapeutic design. Code is available at https://github.com/shadi97kh/BioPrior. 2026-02-09T19:27:42Z Accepted at the Machine Learning for Genomics Explorations (MLGenX) Workshop at ICLR 2026 ICLR 2026 Workshop on Machine Learning for Genomics Explorations (MLGenX) Zahra Khodagholi Niloofar Yousefi http://arxiv.org/abs/2507.19229v2 TrinityDNA: A Bio-Inspired Foundational Model for Efficient Long-Sequence DNA Modeling 2026-03-06T10:49:55Z The modeling of genomic sequences presents unique challenges due to their length and structural complexity. Traditional sequence models struggle to capture long-range dependencies and biological features inherent in DNA. In this work, we propose TrinityDNA, a novel DNA foundational model designed to address these challenges. The model integrates biologically informed components, including Groove Fusion for capturing DNA's structural features and Gated Reverse Complement (GRC) to handle the inherent symmetry of DNA sequences. Additionally, we introduce a multi-scale attention mechanism that allows the model to attend to varying levels of sequence dependencies, and an evolutionary training strategy that progressively adapts the model to both prokaryotic and eukaryotic genomes. TrinityDNA provides a more accurate and efficient approach to genomic sequence modeling, offering significant improvements in gene function prediction, regulatory mechanism discovery, and other genomics applications. Our model bridges the gap between machine learning techniques and biological insights, paving the way for more effective analysis of genomic data. Additionally, we introduced a new DNA long-sequence CDS annotation benchmark to make evaluations more comprehensive and oriented toward practical applications. 2025-07-25T12:55:30Z AAAI 2026 Qirong Yang Yucheng Guo Zicheng Liu Yujie Yang Qijin Yin Siyuan Li Shaomin Ji Linlin Chao Xiaoming Zhang Stan Z. Li http://arxiv.org/abs/2511.02263v4 LA-MARRVEL: A Knowledge-Grounded, Language-Aware LLM Framework for Clinically Robust Rare Disease Gene Prioritization 2026-03-05T22:52:04Z Rare disease diagnosis requires matching variant-bearing genes to complex patient phenotypes across large and heterogeneous evidence sources. This process remains time-intensive in current clinical interpretation pipelines. To overcome these limitations, We present LA-MARRVEL, a knowledge-grounded, language-aware LLM framework and designed for clinical robustness and practical deployment. LA-MARRVEL delivers a 12-15 percentage-point absolute improvement in Recall@1 over established gene prioritization approaches, showing that architectural design can drive substantial accuracy gains. We found that the central contributor is structured, phenotype-rich prompt construction that explicitly encodes patient and disease phenotypes, preserving clinically meaningful context more effectively than disease labels alone. Across three real-world cohorts, LA-MARRVEL consistently improves gene-ranking performance, including in challenging cases where the causal gene was initially ranked lower by first-stage prioritization. For each candidate gene, the system delivers clinically relevant, ACMG-aligned reasoning that integrates phenotype concordance, inheritance patterns, and variant-level evidence into auditable explanations, enabling streamlined clinical review. These findings suggest that knowledge-grounded LLM layer can enhance existing rare-disease gene prioritization workflows without altering established diagnostic pipelines. 2025-11-04T05:17:41Z Jaeyeon Lee Lin Yao Hyun-Hwan Jeong Zhandong Liu http://arxiv.org/abs/2602.22289v2 What Topological and Geometric Structure Do Biological Foundation Models Learn? Evidence from 141 Hypotheses 2026-03-05T20:03:46Z When biological foundation models such as scGPT and Geneformer process single-cell gene expression, what geometric and topological structure forms in their internal representations? Is that structure biologically meaningful or a training artifact, and how confident should we be in such claims? We address these questions through autonomous large-scale hypothesis screening: an AI-driven executor-brainstormer loop that proposed, tested, and refined 141 geometric and topological hypotheses across 52 iterations, covering persistent homology, manifold distances, cross-model alignment, community structure, and directed topology, all with explicit null controls and disjoint gene-pool splits. Three principal findings emerge. First, the models learn genuine geometric structure. Gene embedding neighborhoods exhibit non-trivial topology, with persistent homology significant in 11 of 12 transformer layers at p < 0.05 in the weakest domain and 12 of 12 in the other two. A multi-level distance hierarchy shows that manifold-aware metrics outperform Euclidean distance for identifying regulatory gene pairs, and graph community partitions track known transcription factor target relationships. Second, this structure is shared across independently trained models. CCA alignment between scGPT and Geneformer yields canonical correlation of 0.80 and gene retrieval accuracy of 72 percent, yet none of 19 tested methods reliably recover gene-level correspondences. The models agree on the global shape of gene space but not on precise gene placement. Third, the structure is more localized than it first appears. Under stringent null controls applied across all null families, robust signal concentrates in immune tissue, while lung and external lung signals weaken substantially. 2026-02-25T14:33:24Z Ihor Kendiukhov http://arxiv.org/abs/2603.05572v1 Machine Learning for analysis of Multiple Sclerosis cross-tissue bulk and single-cell transcriptomics data 2026-03-05T16:09:12Z Multiple Sclerosis (MS) is a chronic autoimmune disease of the central nervous system whose molecular mechanisms remain incompletely understood. In this study, we developed an end-to-end machine learning pipeline to analyze transcriptomic data from peripheral blood mononuclear cells and cerebrospinal fluid, integrating both bulk microarray and single-cell RNA sequencing datasets (concentrating on CD4+ and B-cells). After rigorous preprocessing, batch correction, and gene declustering, XGBoost classifiers were trained to distinguish MS patients from healthy controls. Explainable AI tools, namely SHapley Additive exPlanations (SHAP), were employed to identify key genes driving classification, and results were compared with Differential Expression Analysis (DEA). SHAP-prioritized genes were further investigated through interaction networks and pathway enrichment analyses. The models achieved strong performance, particularly in CSF B-cells (AUC=0.94) and microarray (AUC=0.86). SHAP gene selection proved to be complementary to classical DEA. Gene clusters identified across multiple datasets highlighted immune activation, non-canonical immune checkpoints (ITK, CLEC2D, KLRG1, CEACAM1), ribosomal and translational programs, ubiquitin-proteasome regulation, lipid trafficking, and Epstein-Barr virus-related pathways. Our integrative and explainable framework reveals complementary insights beyond conventional analysis and provides novel mechanistic hypotheses and potential biomarkers for MS pathogenesis. 2026-03-05T16:09:12Z Francesco Massafra Samuele Punzo Silvia Giulia Galfré Alessandro Maglione Simone Pernice Stefano Forti Simona Rolla Marco Beccuti Marinella Clerico Corrado Priami Alina Sîrbu http://arxiv.org/abs/2601.16151v2 In vitro binding energies capture Klf4 occupancy across the human genome 2026-03-04T09:32:15Z Transcription factors (TFs) regulate gene expression by binding to specific genomic loci determined by DNA sequence. Their sequence specificity is commonly summarized by a consensus binding motif. However, eukaryotic genomes contain billions of low-affinity DNA sequences to which TFs associate with a sequence-dependent binding energy. We currently lack insight into how the genomic sequence defines this spectrum of binding energies and the resulting pattern of TF localization. Here, we set out to obtain a quantitative understanding of sequence-dependent TF binding to both motif and non-motif sequences. We achieve this by first pursuing accurate measurements of physical binding energies of the human TF Klf4 to a library of short DNA sequences in a fluorescence-anisotropy-based bulk competitive binding assay. Second, we show that the highly non-linear sequence dependence of Klf4 binding energies can be captured by combining a linear model of binding energies with an Ising model of the coupled recognition of nucleotides by a TF. We find that this statistical mechanics model parametrized by our in vitro measurements captures Klf4 binding patterns on individual long DNA molecules stretched in the optical tweezer, and is predictive for Klf4 occupancy across the entire human genome without additional fit parameters. 2026-01-22T17:49:53Z A.S., J.N., and Y.S. contributed equally to this work. Update 2025/03: correction of a few typos Anne Schwager Jonas Neipel Yahor Savich Douglas Diehl Frank Jülicher Anthony A. Hyman Stephan W. Grill