https://arxiv.org/api/wsEUQRWsecbvQJOW8aKoU2pR48U2026-03-18T10:16:19Z37461515http://arxiv.org/abs/2603.10261v1Discovery of a Hematopoietic Manifold in scGPT Yields a Method for Extracting Performant Algorithms from Biological Foundation Model Internals2026-03-10T22:44:12ZWe report the discovery and extraction of a compact hematopoietic algorithm from the single-cell foundation model scGPT, to our knowledge the first biologically useful, competitive algorithm extracted from a foundation model via mechanistic interpretability. We show that scGPT internally encodes a compact hematopoietic manifold with significant developmental branch structure, validated on a strict non-overlap Tabula Sapiens external panel and confirmed via frozen-head zero-shot transfer to an independent multi-donor immune panel. To isolate this geometry, we introduce a general three-stage extraction method consisting of direct operator export from frozen attention weights, a lightweight learned adaptor, and a task-specific readout, producing a standalone algorithm without target-dataset retraining. In 88-split donor-holdout benchmarks against scVI, Palantir, DPT, CellTypist, PCA, and raw-expression baselines, the extracted algorithm achieves the strongest pseudotime-depth ordering and leads on key subtype endpoints (CD4/CD8 AUROC 0.867, mono/macro AUROC 0.951). Compared to standard probing of frozen scGPT embeddings with a 3-layer MLP, the extracted head is BH-significantly better on 6/8 classification endpoints while completing a full 12-split evaluation campaign 34.5x faster with approximately 1000x fewer trainable parameters. The exported operator compresses from three pooled attention heads to a single head without statistically significant loss, and further to a rank-64 surrogate. Mechanistic interpretability of the compact operator reveals a concentrated four-factor core explaining 66.2% of ablation impact, with factors resolving into explicit T/lymphoid, B/plasma, granulocytic, and monocyte/macrophage gene programs. A supplementary second-manifold validation (intercellular communication geometry) confirms that the extraction method generalizes beyond hematopoiesis.2026-03-10T22:44:12ZIhor Kendiukhovhttp://arxiv.org/abs/2603.08913v1Quantifying Memorization and Privacy Risks in Genomic Language Models2026-03-09T20:30:37ZGenomic language models (GLMs) have emerged as powerful tools for learning representations of DNA sequences, enabling advances in variant prediction, regulatory element identification, and cross-task transfer learning. However, as these models are increasingly trained or fine-tuned on sensitive genomic cohorts, they risk memorizing specific sequences from their training data, raising serious concerns around privacy, data leakage, and regulatory compliance. Despite growing awareness of memorization risks in general-purpose language models, little systematic evaluation exists for these risks in the genomic domain, where data exhibit unique properties such as a fixed nucleotide alphabet, strong biological structure, and individual identifiability. We present a comprehensive, multi-vector privacy evaluation framework designed to quantify memorization risks in GLMs. Our approach integrates three complementary risk assessment methodologies: perplexity-based detection, canary sequence extraction, and membership inference. These are combined into a unified evaluation pipeline that produces a worst-case memorization risk score. To enable controlled evaluation, we plant canary sequences at varying repetition rates into both synthetic and real genomic datasets, allowing precise quantification of how repetition and training dynamics influence memorization. We evaluate our framework across multiple GLM architectures, examining the relationship between sequence repetition, model capacity, and memorization risk. Our results establish that GLMs exhibit measurable memorization and that the degree of memorization varies across architectures and training regimes. These findings reveal that no single attack vector captures the full scope of memorization risk, underscoring the need for multi-vector privacy auditing as a standard practice for genomic AI systems.2026-03-09T20:30:37Z13 pagesAlexander NemecekWenbiao LiXiaoqian JiangJaideep VaidyaErman Aydayhttp://arxiv.org/abs/2502.03569v3Controllable Sequence Editing for Biological and Clinical Trajectories2026-03-09T14:46:46ZConditional generation models for longitudinal sequences can produce new or modified trajectories given a conditioning input. However, they often lack control over when the condition should take effect (timing) and which variables it should influence (scope). Most methods either operate only on univariate sequences or assume that the condition alters all variables and time steps. In scientific and clinical settings, interventions instead begin at a specific moment, such as the time of drug administration or surgery, and influence only a subset of measurements while the rest of the trajectory remains unchanged. CLEF learns temporal concepts that encode how and when a condition alters future sequence evolution. These concepts allow CLEF to apply targeted edits to the affected time steps and variables while preserving the rest of the sequence. We evaluate CLEF on 8 datasets spanning cellular reprogramming, patient health, and sales, comparing against 9 state-of-the-art baselines. CLEF improves immediate sequence editing accuracy by 16.28% (MAE) on average against their non-CLEF counterparts. Unlike prior models, CLEF enables one-step conditional generation at arbitrary future times, outperforming their non-CLEF counterparts in delayed sequence editing by 26.73% (MAE) on average. We test CLEF under counterfactual inference assumptions and show up to 62.84% (MAE) improvement on zero-shot conditional generation of counterfactual trajectories. In a case study of patients with type 1 diabetes mellitus, CLEF identifies clinical interventions that generate realistic counterfactual trajectories shifted toward healthier outcomes.2025-02-05T19:33:12ZICLR 2026Michelle M. LiKevin LiYasha EktefaieYing JinYepeng HuangShvat MessicaTianxi CaiMarinka Zitnikhttp://arxiv.org/abs/2512.05731v2DeeDeeExperiment: Building an infrastructure for integrating and managing omics data analysis results in R/Bioconductor2026-03-09T11:07:33ZSummary: Modern omics experiments now involve multiple conditions and complex designs, producing an increasingly large set of differential expression and functional enrichment analysis results. However, no standardized data structure exists to store and contextualize these results together with their metadata, leaving researchers with an unmanageable and potentially non-reproducible collection of results that are difficult to navigate and/or share. Here we introduce DeeDeeExperiment, a new S4 class for managing and storing omics data analysis results, implemented within the Bioconductor ecosystem, which promotes interoperability, reproducibility and good documentation. This class extends the widely used SingleCellExperiment object by introducing dedicated slots for Differential Expression (DEA) and Functional Enrichment Analysis (FEA) results, allowing users to organize, store, and retrieve information on multiple contrasts and associated metadata within a single data object, ultimately streamlining the management and interpretation of many omics datasets. Availability and implementation: DeeDeeExperiment is available on Bioconductor under the MIT license (https://bioconductor.org/packages/DeeDeeExperiment), with its development version also available on Github (https://github.com/imbeimainz/DeeDeeExperiment).2025-12-05T14:11:28Z1 figureNajla AbassiLea SchwarzEdoardo FilippiFederico Marinihttp://arxiv.org/abs/2603.08062v1Adversarial Domain Adaptation Enables Knowledge Transfer Across Heterogeneous RNA-Seq Datasets2026-03-09T07:55:32ZAccurate phenotype prediction from RNA sequencing (RNA-seq) data is essential for diagnosis, biomarker discovery, and personalized medicine. Deep learning models have demonstrated strong potential to outperform classical machine learning approaches, but their performance relies on large, well-annotated datasets. In transcriptomics, such datasets are frequently limited, leading to over-fitting and poor generalization. Knowledge transfer from larger, more general datasets can alleviate this issue. However, transferring information across RNA-seq datasets remains challenging due to heterogeneous preprocessing pipelines and differences in target phenotypes. In this study, we propose a deep learning-based domain adaptation framework that enables effective knowledge transfer from a large general dataset to a smaller one for cancer type classification. The method learns a domain-invariant latent space by jointly optimizing classification and domain alignment objectives. To ensure stable training and robustness in data-scarce scenarios, the framework is trained with an adversarial approach with appropriate regularization. Both supervised and unsupervised approach variants are explored, leveraging labeled or unlabeled target samples. The framework is evaluated on three large-scale transcriptomic datasets (TCGA, ARCHS4, GTEx) to assess its ability to transfer knowledge across cohorts. Experimental results demonstrate consistent improvements in cancer and tissue type classification accuracy compared to non-adaptive baselines, particularly in low-data scenarios. Overall, this work highlights domain adaptation as a powerful strategy for data-efficient knowledge transfer in transcriptomics, enabling robust phenotype prediction under constrained data conditions.2026-03-09T07:55:32Z7 pages, 5 figures. Submitted to ECCB 2026Kevin DradjatMassinissa HamidiBlaise Hanczarhttp://arxiv.org/abs/2511.08996v3Partial domain adaptation enables cross domain cell type annotation between scRNA-seq and snRNA-seq2026-03-08T17:33:31ZAccurate cell type annotation across datasets is a key challenge in single-cell analysis. snRNA-seq enables profiling of frozen or difficult-to-dissociate tissues, complementing scRNA-seq by capturing fragile or rare cell types. However, cross-annotation between these two datasets remains largely unexplored, as existing methods treat them independently. We introduce ScNucAdapt, a method designed for cross-annotation between paired and unpaired scRNA-seq and snRNA-seq datasets. To address distributional and cell composition differences, ScNucAdapt employs partial domain adaptation. Experiments across both unpaired and paired scRNA-seq and snRNA-seq show that ScNucAdapt achieves robust and accurate cell type annotation, outperforming existing approaches. Therefore, ScNucAdapt provides a practical framework for the cross-domain cell type annotation between scRNA-seq and snRNA seq data.2025-11-12T05:37:40ZXiran ChenQuan ZouQinyu CaiXiaofeng ChenWeikai LiYansu Wanghttp://arxiv.org/abs/2603.06950v1How Private Are DNA Embeddings? Inverting Foundation Model Representations of Genomic Sequences2026-03-06T23:52:26ZDNA foundation models have become transformative tools in bioinformatics and healthcare applications. Trained on vast genomic datasets, these models can be used to generate sequence embeddings, dense vector representations that capture complex genomic information. These embeddings are increasingly being shared via Embeddings-as-a-Service (EaaS) frameworks to facilitate downstream tasks, while supposedly protecting the privacy of the underlying raw sequences. However, as this practice becomes more prevalent, the security of these representations is being called into question. This study evaluates the resilience of DNA foundation models to model inversion attacks, whereby adversaries attempt to reconstruct sensitive training data from model outputs. In our study, the model's output for reconstructing the DNA sequence is a zero-shot embedding, which is then fed to a decoder. We evaluated the privacy of three DNA foundation models: DNABERT-2, Evo 2, and Nucleotide Transformer v2 (NTv2). Our results show that per-token embeddings allow near-perfect sequence reconstruction across all models. For mean-pooled embeddings, reconstruction quality degrades as sequence length increases, though it remains substantially above random baselines. Evo 2 and NTv2 prove to be most vulnerable, especially for shorter sequences with reconstruction similarities > 90%, while DNABERT-2's BPE tokenization provides the greatest resilience. We found that the correlation between embedding similarity and sequence similarity was a key predictor of reconstruction success. Our findings emphasize the urgent need for privacy-aware design in genomic foundation models prior to their widespread deployment in EaaS settings. Training code, model weights and evaluation pipeline are released on: https://github.com/not-a-feature/DNA-Embedding-Inversion.2026-03-06T23:52:26ZSofiane OuaariJules KreuerNico Pfeiferhttp://arxiv.org/abs/2603.06804v1Identifying genes associated with phenotypes using machine and deep learning2026-03-06T19:13:19ZIdentifying disease-associated genes enables the development of precision medicine and the understanding of biological processes. Genome-wide association studies (GWAS), gene expression data, biological pathway analysis, and protein network analysis are among the techniques used to identify causal genes. We propose a machine-learning (ML) and deep-learning (DL) pipeline to identify genes associated with a phenotype. The proposed pipeline consists of two interrelated processes. The first is classifying people into case/control based on the genotype data. The second is calculating feature importance to identify genes associated with a particular phenotype. We considered 30 phenotypes from the openSNP data for analysis, 21 ML algorithms, and 80 DL algorithms and variants. The best-performing ML and DL models, evaluated by the area under the curve (AUC), F1 score, and Matthews correlation coefficient (MCC), were used to identify important single-nucleotide polymorphisms (SNPs), and the identified SNPs were compared with the phenotype-associated SNPs from the GWAS Catalog. The mean per-phenotype gene identification ratio (GIR) was 0.84. These results suggest that SNPs selected by ML/DL algorithms that maximize classification performance can help prioritise phenotype-associated SNPs and genes, potentially supporting downstream studies aimed at understanding disease mechanisms and identifying candidate therapeutic targets.2026-03-06T19:13:19ZMuhammad MuneebDavid B. AscherYooChan Myunghttp://arxiv.org/abs/2603.06768v1Benchmarking 80 binary phenotypes from the openSNP dataset using deep learning algorithms and polygenic risk score tools2026-03-06T16:36:42ZGenotype-phenotype prediction plays a crucial role in identifying disease-causing single nucleotide polymorphisms and precision medicine. In this manuscript, we benchmark the performance of various machine/deep learning algorithms and polygenic risk score tools on 80 binary phenotypes extracted from the openSNP dataset. After cleaning and extraction, the genotype data for each phenotype is passed to PLINK for quality control, after which it is transformed separately for each of the considered tools/algorithms. To compute polygenic risk scores, we used the quality control measures for the test data and the genome-wide association studies summary statistic file, along with various combinations of clumping and pruning. For the machine learning algorithms, we used p-value thresholding on the training data to select the single nucleotide polymorphisms, and the resulting data was passed to the algorithm. Our results report the average 5-fold Area Under the Curve (AUC) for 29 machine learning algorithms, 80 deep learning algorithms, and 3 polygenic risk scores tools with 675 different clumping and pruning parameters. Machine learning outperformed for 44 phenotypes, while polygenic risk score tools excelled for 36 phenotypes. The results give us valuable insights into which techniques tend to perform better for certain phenotypes compared to more traditional polygenic risk scores tools.2026-03-06T16:36:42ZMuhammad MuneebDavid B. AscherYooChan MyungSamuel F. FengAndreas Henschelhttp://arxiv.org/abs/2602.10152v2Validating Interpretability in siRNA Efficacy Prediction: A Perturbation-Based, Dataset-Aware Protocol2026-03-06T15:55:21ZSaliency maps are increasingly used as design guidance in siRNA efficacy prediction, yet attribution methods are rarely validated before motivating sequence edits. We introduce a pre-synthesis gate: a protocol for counterfactual sensitivity faithfulness that tests whether mutating high-saliency positions changes model output more than composition-matched controls. Cross-dataset transfer reveals two failure modes that would otherwise go undetected: faithful-but-wrong (saliency valid, predictions fail) and inverted saliency (top-saliency edits less impactful than random). Strikingly, models trained on mRNA-level assays collapse on a luciferase reporter dataset, demonstrating that protocol shifts can silently invalidate deployment. Across four benchmarks, 19/20 fold instances pass; the single failure shows inverted saliency. A biology-informed regularizer (BioPrior) strengthens saliency faithfulness with modest, dataset-dependent predictive trade-offs. Our results establish saliency validation as essential pre-deployment practice for explanation-guided therapeutic design. Code is available at https://github.com/shadi97kh/BioPrior.2026-02-09T19:27:42ZAccepted at the Machine Learning for Genomics Explorations (MLGenX) Workshop at ICLR 2026ICLR 2026 Workshop on Machine Learning for Genomics Explorations (MLGenX)Zahra KhodagholiNiloofar Yousefihttp://arxiv.org/abs/2507.19229v2TrinityDNA: A Bio-Inspired Foundational Model for Efficient Long-Sequence DNA Modeling2026-03-06T10:49:55ZThe modeling of genomic sequences presents unique challenges due to their length and structural complexity. Traditional sequence models struggle to capture long-range dependencies and biological features inherent in DNA. In this work, we propose TrinityDNA, a novel DNA foundational model designed to address these challenges. The model integrates biologically informed components, including Groove Fusion for capturing DNA's structural features and Gated Reverse Complement (GRC) to handle the inherent symmetry of DNA sequences. Additionally, we introduce a multi-scale attention mechanism that allows the model to attend to varying levels of sequence dependencies, and an evolutionary training strategy that progressively adapts the model to both prokaryotic and eukaryotic genomes. TrinityDNA provides a more accurate and efficient approach to genomic sequence modeling, offering significant improvements in gene function prediction, regulatory mechanism discovery, and other genomics applications. Our model bridges the gap between machine learning techniques and biological insights, paving the way for more effective analysis of genomic data. Additionally, we introduced a new DNA long-sequence CDS annotation benchmark to make evaluations more comprehensive and oriented toward practical applications.2025-07-25T12:55:30ZAAAI 2026Qirong YangYucheng GuoZicheng LiuYujie YangQijin YinSiyuan LiShaomin JiLinlin ChaoXiaoming ZhangStan Z. Lihttp://arxiv.org/abs/2511.02263v4LA-MARRVEL: A Knowledge-Grounded, Language-Aware LLM Framework for Clinically Robust Rare Disease Gene Prioritization2026-03-05T22:52:04ZRare disease diagnosis requires matching variant-bearing genes to complex patient phenotypes across large and heterogeneous evidence sources. This process remains time-intensive in current clinical interpretation pipelines. To overcome these limitations, We present LA-MARRVEL, a knowledge-grounded, language-aware LLM framework and designed for clinical robustness and practical deployment. LA-MARRVEL delivers a 12-15 percentage-point absolute improvement in Recall@1 over established gene prioritization approaches, showing that architectural design can drive substantial accuracy gains. We found that the central contributor is structured, phenotype-rich prompt construction that explicitly encodes patient and disease phenotypes, preserving clinically meaningful context more effectively than disease labels alone. Across three real-world cohorts, LA-MARRVEL consistently improves gene-ranking performance, including in challenging cases where the causal gene was initially ranked lower by first-stage prioritization. For each candidate gene, the system delivers clinically relevant, ACMG-aligned reasoning that integrates phenotype concordance, inheritance patterns, and variant-level evidence into auditable explanations, enabling streamlined clinical review. These findings suggest that knowledge-grounded LLM layer can enhance existing rare-disease gene prioritization workflows without altering established diagnostic pipelines.2025-11-04T05:17:41ZJaeyeon LeeLin YaoHyun-Hwan JeongZhandong Liuhttp://arxiv.org/abs/2602.22289v2What Topological and Geometric Structure Do Biological Foundation Models Learn? Evidence from 141 Hypotheses2026-03-05T20:03:46ZWhen biological foundation models such as scGPT and Geneformer process single-cell gene expression, what geometric and topological structure forms in their internal representations? Is that structure biologically meaningful or a training artifact, and how confident should we be in such claims? We address these questions through autonomous large-scale hypothesis screening: an AI-driven executor-brainstormer loop that proposed, tested, and refined 141 geometric and topological hypotheses across 52 iterations, covering persistent homology, manifold distances, cross-model alignment, community structure, and directed topology, all with explicit null controls and disjoint gene-pool splits.
Three principal findings emerge. First, the models learn genuine geometric structure. Gene embedding neighborhoods exhibit non-trivial topology, with persistent homology significant in 11 of 12 transformer layers at p < 0.05 in the weakest domain and 12 of 12 in the other two. A multi-level distance hierarchy shows that manifold-aware metrics outperform Euclidean distance for identifying regulatory gene pairs, and graph community partitions track known transcription factor target relationships. Second, this structure is shared across independently trained models. CCA alignment between scGPT and Geneformer yields canonical correlation of 0.80 and gene retrieval accuracy of 72 percent, yet none of 19 tested methods reliably recover gene-level correspondences. The models agree on the global shape of gene space but not on precise gene placement. Third, the structure is more localized than it first appears. Under stringent null controls applied across all null families, robust signal concentrates in immune tissue, while lung and external lung signals weaken substantially.2026-02-25T14:33:24ZIhor Kendiukhovhttp://arxiv.org/abs/2603.05572v1Machine Learning for analysis of Multiple Sclerosis cross-tissue bulk and single-cell transcriptomics data2026-03-05T16:09:12ZMultiple Sclerosis (MS) is a chronic autoimmune disease of the central nervous system whose molecular mechanisms remain incompletely understood. In this study, we developed an end-to-end machine learning pipeline to analyze transcriptomic data from peripheral blood mononuclear cells and cerebrospinal fluid, integrating both bulk microarray and single-cell RNA sequencing datasets (concentrating on CD4+ and B-cells). After rigorous preprocessing, batch correction, and gene declustering, XGBoost classifiers were trained to distinguish MS patients from healthy controls. Explainable AI tools, namely SHapley Additive exPlanations (SHAP), were employed to identify key genes driving classification, and results were compared with Differential Expression Analysis (DEA). SHAP-prioritized genes were further investigated through interaction networks and pathway enrichment analyses. The models achieved strong performance, particularly in CSF B-cells (AUC=0.94) and microarray (AUC=0.86). SHAP gene selection proved to be complementary to classical DEA. Gene clusters identified across multiple datasets highlighted immune activation, non-canonical immune checkpoints (ITK, CLEC2D, KLRG1, CEACAM1), ribosomal and translational programs, ubiquitin-proteasome regulation, lipid trafficking, and Epstein-Barr virus-related pathways. Our integrative and explainable framework reveals complementary insights beyond conventional analysis and provides novel mechanistic hypotheses and potential biomarkers for MS pathogenesis.2026-03-05T16:09:12ZFrancesco MassafraSamuele PunzoSilvia Giulia GalfréAlessandro MaglioneSimone PerniceStefano FortiSimona RollaMarco BeccutiMarinella ClericoCorrado PriamiAlina Sîrbuhttp://arxiv.org/abs/2601.16151v2In vitro binding energies capture Klf4 occupancy across the human genome2026-03-04T09:32:15ZTranscription factors (TFs) regulate gene expression by binding to specific genomic loci determined by DNA sequence. Their sequence specificity is commonly summarized by a consensus binding motif. However, eukaryotic genomes contain billions of low-affinity DNA sequences to which TFs associate with a sequence-dependent binding energy. We currently lack insight into how the genomic sequence defines this spectrum of binding energies and the resulting pattern of TF localization. Here, we set out to obtain a quantitative understanding of sequence-dependent TF binding to both motif and non-motif sequences. We achieve this by first pursuing accurate measurements of physical binding energies of the human TF Klf4 to a library of short DNA sequences in a fluorescence-anisotropy-based bulk competitive binding assay. Second, we show that the highly non-linear sequence dependence of Klf4 binding energies can be captured by combining a linear model of binding energies with an Ising model of the coupled recognition of nucleotides by a TF. We find that this statistical mechanics model parametrized by our in vitro measurements captures Klf4 binding patterns on individual long DNA molecules stretched in the optical tweezer, and is predictive for Klf4 occupancy across the entire human genome without additional fit parameters.2026-01-22T17:49:53ZA.S., J.N., and Y.S. contributed equally to this work. Update 2025/03: correction of a few typosAnne SchwagerJonas NeipelYahor SavichDouglas DiehlFrank JülicherAnthony A. HymanStephan W. Grill