https://arxiv.org/api/s/rOTv+BqSs+7gVVvFgPdD6ZIiA2026-06-14T06:12:19Z384825515http://arxiv.org/abs/2510.27097v2Hierarchical Bayesian Model for Gene Deconvolution and Functional Analysis in Human Endometrium Across the Menstrual Cycle2025-12-12T16:09:58ZBulk tissue RNA sequencing of heterogeneous samples provides averaged gene expression profiles, obscuring cell type-specific dynamics. To address this, we present a probabilistic hierarchical Bayesian model that deconvolves bulk RNA-seq data into constituent cell-type expression profiles and proportions, leveraging a high-resolution single-cell reference. We apply our model to human endometrial tissue across the menstrual cycle, a context characterized by dramatic hormone-driven cellular composition changes. Our extended framework provides a principled inference of cell type proportions and cell-specific gene expression changes across cycle phases. We demonstrate the model's structure, priors, and inference strategy in detail, and we validate its performance with simulations and comparisons to existing methods. The results reveal dynamic shifts in epithelial, stromal, and immune cell fractions between menstrual phases, and identify cell-type-specific differential gene expression associated with endometrial function (e.g., decidualization markers in stromal cells during the secretory phase). We further conduct robustness tests and show that our Bayesian approach is resilient to reference mismatches and noise. Finally, we discuss the biological significance of our findings, potential clinical implications for fertility and endometrial disorders, and future directions, including integration of spatial transcriptomics.2025-10-31T01:48:25ZThis paper is withdrawn due to issues with attribution and citation accuracyCrystal SuKuai YuMingyuan ShaoDaniel Bauerhttp://arxiv.org/abs/2410.04996v5Assumption-Lean Post-Integrated Inference with Surrogate Control Outcomes2025-12-12T16:09:21ZData integration methods aim to extract low-dimensional embeddings from high-dimensional outcomes to remove unwanted variations, such as batch effects and unmeasured covariates, across heterogeneous datasets. However, multiple hypothesis testing after integration can be biased due to data-dependent processes. We introduce a robust post-integrated inference method that accounts for latent heterogeneity by utilizing control outcomes. Leveraging causal interpretations, we derive nonparametric identifiability of the direct effects using negative control outcomes. By utilizing surrogate control outcomes as an extension of negative control outcomes, we develop semiparametric inference on projected direct effect estimands, accounting for hidden mediators, confounders, and moderators. These estimands remain statistically meaningful under model misspecifications and with error-prone embeddings. We provide bias quantifications and finite-sample linear expansions with uniform concentration bounds. The proposed doubly robust estimators are consistent and efficient under minimal assumptions and potential misspecification, facilitating data-adaptive estimation with machine learning algorithms. Our proposal is evaluated using random forests through simulations and analysis of single-cell CRISPR perturbed datasets, which may contain potential unmeasured confounders.2024-10-07T12:52:38Z22 pages for the main text, 27 pages for the appendix, 6 figures for the main text, 7 figures for the appendixJin-Hong DuKathryn RoederLarry Wassermanhttp://arxiv.org/abs/2512.10147v1Murmur2Vec: A Hashing Based Solution For Embedding Generation Of COVID-19 Spike Sequences2025-12-10T23:03:10ZEarly detection and characterization of coronavirus disease (COVID-19), caused by SARS-CoV-2, remain critical for effective clinical response and public-health planning. The global availability of large-scale viral sequence data presents significant opportunities for computational analysis; however, existing approaches face notable limitations. Phylogenetic tree-based methods are computationally intensive and do not scale efficiently to today's multi-million-sequence datasets. Similarly, current embedding-based techniques often rely on aligned sequences or exhibit suboptimal predictive performance and high runtime costs, creating barriers to practical large-scale analysis. In this study, we focus on the most prevalent SARS-CoV-2 lineages associated with the spike protein region and introduce a scalable embedding method that leverages hashing to generate compact, low-dimensional representations of spike sequences. These embeddings are subsequently used to train a variety of machine learning models for supervised lineage classification. We conduct an extensive evaluation comparing our approach with multiple baseline and state-of-the-art biological sequence embedding methods across diverse metrics. Our results demonstrate that the proposed embeddings offer substantial improvements in efficiency, achieving up to 86.4\% classification accuracy while reducing embedding generation time by as much as 99.81\%. This highlights the method's potential as a fast, effective, and scalable solution for large-scale viral sequence analysis.2025-12-10T23:03:10ZSarwan AliTaslim Muradhttp://arxiv.org/abs/2510.12617v2Same model, better performance: the impact of shuffling on DNA Language Models benchmarking2025-12-10T22:00:37ZLarge Language Models are increasingly popular in genomics due to their potential to decode complex biological sequences. Hence, researchers require a standardized benchmark to evaluate DNA Language Models (DNA LMs) capabilities. However, evaluating DNA LMs is a complex task that intersects genomic's domain-specific challenges and machine learning methodologies, where seemingly minor implementation details can significantly compromise benchmark validity. We demonstrate this through BEND (Benchmarking DNA Language Models), where hardware-dependent hyperparameters -- number of data loading workers and buffer sizes -- create spurious performance variations of up to 4% for identical models. The problem stems from inadequate data shuffling interacting with domain specific data characteristics. Experiments with three DNA language models (HyenaDNA, DNABERT-2, ResNet-LM) show these artifacts affect both absolute performance and relative model rankings. We propose a simple solution: pre-shuffling data before storage eliminates hardware dependencies while maintaining efficiency. This work highlights how standard ML practices can interact unexpectedly with domain-specific data characteristics, with broader implications for benchmark design in specialized domains.2025-10-14T15:16:56ZDavide GrecoKonrad Rawlikhttp://arxiv.org/abs/2512.09964v1Development of an Agentic AI Model for NGS Downstream Analysis Targeting Researchers with Limited Biological Background2025-12-10T03:43:46ZNext-Generation Sequencing (NGS) has become a cornerstone of genomic research, yet the complexity of downstream analysis-ranging from differential expression gene (DEG) identification to biological interpretations-remains a significant barrier for researchers lacking specialized computational and biological expertise. While recent studies have introduced AI agents for RNA-seq analysis, most focus on general workflows without offering tailored interpretations or guidance for novices. To address this gap, we developed an Agentic AI model designed to automate NGS downstream analysis, provide literature-backed interpretations, and autonomously recommend advanced analytical methods. Built on the Llama 3 70B Large Language Model (LLM) and a Retrieval-Augmented Generation (RAG) framework, the model is deployed as an interactive Streamlit web application. The system integrates standard bioinformatics tools (Biopython, GSEApy, gProfiler) to execute core analyses, including DEG identification, clustering, and pathway enrichment. Uniquely, the agent utilizes RAG to query PubMed via Entrez, synthesizing biological insights and validating hypotheses with current literature. In a case study using cancer-related dataset, the model successfully identified significant DEGs, visualized clinical correlations, and derived evidence-based insights (e.g., linking BRAF mutations to prognosis), subsequently executing advanced survival modeling upon user selection. This framework democratizes bioinformatics by enabling researchers with limited backgrounds to seamlessly transition from basic data processing to advanced hypothesis testing and validation.2025-12-10T03:43:46ZDonghyeon LeeDongseok KimSeokhwan KoSeo-Young ParkJunghwan Chohttp://arxiv.org/abs/2512.09259v1MoDaH achieves rate optimal batch correction2025-12-10T02:31:16ZBatch effects pose a significant challenge in the analysis of single-cell omics data, introducing technical artifacts that confound biological signals. While various computational methods have achieved empirical success in correcting these effects, they lack the formal theoretical guarantees required to assess their reliability and generalization. To bridge this gap, we introduce Mixture-Model-based Data Harmonization (MoDaH), a principled batch correction algorithm grounded in a rigorous statistical framework.
Under a new Gaussian-mixture-model with explicit parametrization of batch effects, we establish the minimax optimal error rates for batch correction and prove that MoDaH achieves this rate by leveraging the recent theoretical advances in clustering data from anisotropic Gaussian mixtures. This constitutes, to the best of our knowledge, the first theoretical guarantee for batch correction. Extensive experiments on diverse single-cell RNA-seq and spatial proteomics datasets demonstrate that MoDaH not only attains theoretical optimality but also achieves empirical performance comparable to or even surpassing those of state-of-the-art heuristics (e.g., Harmony, Seurat-V5, and LIGER), effectively balancing the removal of technical noise with the conservation of biological signal.2025-12-10T02:31:16ZYang CaoZongming Mahttp://arxiv.org/abs/2509.13290v2Uchimata: a toolkit for visualization of 3D genome structures on the web and in computational notebooks2025-12-09T20:03:10ZSummary: Uchimata is a toolkit for visualization of 3D structures of genomes. It consists of two packages: a Javascript library facilitating the rendering of 3D models of genomes, and a Python widget for visualization in Jupyter Notebooks. Main features include an expressive way to specify visual encodings, and filtering of 3D genome structures based on genomic semantics and spatial aspects. Uchimata is designed to be highly integratable with biological tooling available in Python. Availability and Implementation: Uchimata is released under the MIT License. The Javascript library is available on NPM, while the widget is available as a Python package hosted on PyPI. The source code for both is available publicly on Github (https://github.com/hms-dbmi/uchimata and https://github.com/hms-dbmi/uchimata-py) and Zenodo: (https://doi.org/10.5281/zenodo.17831959 and https://doi.org/10.5281/zenodo.17832045). The documentation with examples is hosted at https://hms-dbmi.github.io/uchimata/ Contact: david_kouril@hms.harvard.edu or nils@hms.harvard.edu.2025-09-16T17:43:57ZDavid KouřilTrevor ManzTereza ClarenceNils Gehlenborghttp://arxiv.org/abs/2503.18472v2A Graph-based Approach to Variant Extraction from Sequences2025-12-09T09:20:58ZAccurate variant descriptions are of paramount importance in the field of genomics. The domain is confronted with increasingly complex variants, e.g., combinations of multiple indels, making it challenging to generate proper variant descriptions directly from chromosomal sequences. We present a graph based on all minimal alignments that is a complete representation of a variant, which gives insight into the nature of a variant compared to a single variant description. We provide three complementary extraction methods to derive variant descriptions from this graph, including one that yields domain-specific constructs from the HGVS nomenclature. Our experiments show that our methods in comparison with dbSNP, the authoritative variant database from the NCBI, result in identical HGVS descriptions for simple variants and more meaningful descriptions for complex variants, in particular for repeat expansions and contractions.2025-03-24T09:20:00Z19 pages, 10 figuresNAR Genomics and Bioinformatics, Volume 7, Issue 4, December 2025Mark A. SantcroosWalter A. KostersMihai LefterJeroen F. J. LarosJonathan K. Vis10.1093/nargab/lqaf173http://arxiv.org/abs/2512.08226v1ImmunoNX: a robust bioinformatics workflow to support personalized neoantigen vaccine trials2025-12-09T04:08:05ZPersonalized neoantigen vaccines represent a promising immunotherapy approach that harnesses tumor-specific antigens to stimulate anti-tumor immune responses. However, the design of these vaccines requires sophisticated computational workflows to predict and prioritize neoantigen candidates from patient sequencing data, coupled with rigorous review to ensure candidate quality. While numerous computational tools exist for neoantigen prediction, to our knowledge, there are no established protocols detailing the complete process from raw sequencing data through systematic candidate selection. Here, we present ImmunoNX (Immunogenomics Neoantigen eXplorer), an end-to-end protocol for neoantigen prediction and vaccine design that has supported over 185 patients across 11 clinical trials. The workflow integrates tumor DNA/RNA and matched normal DNA sequencing data through a computational pipeline built with Workflow Definition Language (WDL) and executed via Cromwell on Google Cloud Platform. ImmunoNX employs consensus-based variant calling, in-silico HLA typing, and pVACtools for neoantigen prediction. Additionally, we describe a two-stage immunogenomics review process with prioritization of neoantigen candidates, enabled by pVACview, followed by manual assessment of variants using the Integrative Genomics Viewer (IGV). This workflow enables vaccine design in under three months. We demonstrate the protocol using the HCC1395 breast cancer cell line dataset, identifying 78 high-confidence neoantigen candidates from 322 initial predictions. Although demonstrated here for vaccine development, this workflow can be adapted for diverse neoantigen therapies and experiments. Therefore, this protocol provides the research community with a reproducible, version-controlled framework for designing personalized neoantigen vaccines, supported by detailed documentation, example datasets, and open-source code.2025-12-09T04:08:05ZSupplementary tables and materials available at https://doi.org/10.5281/zenodo.17862140Kartik SinghalEvelyn SchmidtSusanna KiwalaS. Peter GoedegebuureChristopher A. MillerHuiming XiaKelsy C. CottoJinglun LiJennie YaoLuke HendricksonMiller M. RichtersMy H. HoangMariam KhanfarIsabel RischShelly O'LaughlinNancy MyersTammi VickerySherri R. DaviesFeiyu DuThomas B. MooneyAdam CoffmanGue Su ChangJasreet HundalJohn E. GarzaMichael D. McLellanJoshua F. McMichaelJohn MaruskaWilliam Blake InabinettWilliam A. HoosRachel KarchinTanner M. JohannsGavin P. DunnRussel K. PachynskiTodd A. FehnigerJeffrey P. WardJennifer A. FoltzWilliam E. GillandersObi L. GriffithMalachi Griffithhttp://arxiv.org/abs/2512.08175v1needLR: Long-read structural variant annotation with population-scale frequency estimation2025-12-09T02:08:42ZSummary: We present needLR, a structural variant (SV) annotation tool that can be used for filtering and prioritization of candidate pathogenic SVs from long-read sequencing data using population allele frequencies, annotations for genomic context, and gene-phenotype associations. When using population data from 500 presumably healthy individuals to evaluate nine test cases with known pathogenic SVs, needLR assigned allele frequencies to over 97.5% of all detected SVs and reduced the average number of novel genic SVs to 121 per case while retaining all known pathogenic variants. Availability and Implementation: needLR is implemented in bash with dependencies including Truvari v4.2.2, BEDTools v2.31.1, and BCFtools v1.19. Source code, documentation, and pre-computed population allele frequency data are freely available at https://github.com/jgust1/needLR under an MIT license.2025-12-09T02:08:42ZJonas A. GustafsonJiadong LinEvan E. EichlerDanny E. Millerhttp://arxiv.org/abs/2501.19030v2A network-driven framework for enhancing gene-disease association studies in coronary artery disease2025-12-08T14:01:19ZTranscriptome-wide association studies (TWAS) link genetic variation to complex traits by leveraging expression quantitative trait loci (eQTL) data. However, most implementations are typically limited to local (cis-acting) effects and fail to account for long-range (trans) regulatory influences mediated through gene networks. We introduce GRN-TWAS, a framework that reconstructs gene regulatory networks (GRNs) and integrates their topology into gene expression prediction models, thereby propagating distal (trans) regulatory effects through tissue-specific gene networks to trait- or disease-associated phenotypes. By incorporating network-derived trans-eQTLs, GRN-TWAS generates gene expression imputation models that capture both local and distal genetic components, enabling a more complete, systems-level view of genetic regulation consistent with the omnigenic model hypothesis. Using genotype and multi-tissue expression data from 600 coronary artery disease (CAD) cases in the STARNET study together with GWAS summary statistics, we show that GRN-TWAS improves gene-expression prediction and sharpens discovery of CAD-associated genes. Across seven tissues, the framework identified 5,779 transcriptome-wide significant genes, more than 50\% of which appear to be previously unreported in the CAD literature. A knowledge-based gene-ranking engine then prioritized 882 genes as highly CAD-relevant, including 237 regulated exclusively through trans effects. Key-driver analysis highlighted 18 putative trans mediators with high network centrality and disease relevance, offering mechanistic hypotheses that complement association signals. Collectively, these results demonstrate that embedding network topology into TWAS improves discovery and interpretability by exposing tissue-specific regulatory routes from genotype to phenotype and expanding the landscape of gene-disease associations.2025-01-31T10:54:39ZRevised version, 12 pages, 6 figures, 1 table; code available at https://github.com/guutama/GRN-TWAS; Tex Source includes a file appendix.pdf with supplementary materials (4 pages supplementary methods, 3 supplementary tables, 21 supplementary figures)Gutama Ibrahim MohammadJohan LM BjörkegrenTom Michoelhttp://arxiv.org/abs/2512.07113v1PlantBiMoE: A Bidirectional Foundation Model with SparseMoE for Plant Genomes2025-12-08T02:51:46ZUnderstanding the underlying linguistic rules of plant genomes remains a fundamental challenge in computational biology. Recent advances including AgroNT and PDLLMs have made notable progress although, they suffer from excessive parameter size and limited ability to model the bidirectional nature of DNA strands respectively. To address these limitations, we propose PlantBiMoE, a lightweight and expressive plant genome language model that integrates bidirectional Mamba and a Sparse Mixture-of-Experts (SparseMoE) framework. The bidirectional Mamba enables the model to effectively capture structural dependencies across both the forward and reverse DNA strands, while SparseMoE significantly reduces the number of active parameters, improving computational efficiency without sacrificing modeling capacity. We evaluated and tested our model on the Modified Plants Genome Benchmark (MPGB), an enhanced genomic benchmark, which consolidates 31 datasets across 11 representative tasks, with input sequence lengths ranging from 50 to 6,000 bp. Experimental results demonstrate that PlantBiMoE achieves the best performance on 20 out of 31 datasets and the average best when comparing with existing models. In summary, all above results demonstrate that our model can effectively represent plant genomic sequences, serving as a robust computational tool for diverse genomic tasks, while making substantive contributions to plant genomics, gene editing, and synthetic biology. The code is available at: https://github.com/HUST-Keep-Lin/PlantBiMoE2025-12-08T02:51:46Z6 pages, 5 figures, accept to BIBMKepeng LinQizhe ZhangRui WangXuehai HuWei Xuhttp://arxiv.org/abs/2512.05573v1Refined HLA Linkage Disequilibrium Architectures of World Populations by a Novel Allelic Correlation Measure2025-12-05T09:55:22ZNumerous diseases, particularly autoimmune disorders, are associated with the human leukocyte antigen (HLA), a small genomic region located on human chromosome 6. Adequate characterization of linkage disequilibrium (LD) in the HLA across populations is crucial for identifying genetic markers associated with specific traits and phenotypes. However, current LD measures often fail to capture HLA's structural complexity due to methodological limitations and sensitivity to low-frequency variants, marginal allele frequencies, and haplotype composition. To address these challenges, we introduced the Conditional Informatics Correlation Coefficient (CICC), which integrates conditional probability, information content, and haplotype-aware XOR logic to quantify LD robustly. When applied to high-resolution haploid genomes from the Human Pangenome Reference Consortium (HPRC), CICC revealed 10 novel high-LD regions in HLA. Further analyses using the 1000 Genomes Project and Genome Asia datasets identified nine strongly linked regions shared across five global populations-five in Class I and four in Class II. These results demonstrate CICC's ability to capture complex HLA LD structures across populations, highlighting its broad potential for disease gene mapping, population genomics, and guiding precision medicine.2025-12-05T09:55:22ZFei ZhangWeixiong Zhanghttp://arxiv.org/abs/2512.03286v1SpikGPT: A High-Accuracy and Interpretable Spiking Attention Framework for Single-Cell Annotation2025-12-02T22:50:57ZAccurate and scalable cell type annotation remains a challenge in single-cell transcriptomics, especially when datasets exhibit strong batch effects or contain previously unseen cell populations. Here we introduce SpikGPT, a hybrid deep learning framework that integrates scGPT-derived cell embeddings with a spiking Transformer architecture to achieve efficient and robust annotation. scGPT provides biologically informed dense representations of each cell, which are further processed by a multi-head Spiking Self-Attention mechanism for energy-efficient feature extraction. Across multiple benchmark datasets, SpikGPT consistently matches or exceeds the performance of leading annotation tools. Notably, SpikGPT uniquely identifies unseen cell types by assigning low-confidence predictions to an "Unknown" category, allowing accurate rejection of cell states absent from the training reference. Together, these results demonstrate that SpikGPT is a versatile and reliable annotation tool capable of generalizing across datasets, resolving complex cellular heterogeneity, and facilitating discovery of novel or disease-associated cell populations.2025-12-02T22:50:57ZMin HuangRishikesan Kamaleswaranhttp://arxiv.org/abs/2512.03158v1Contrastive Deep Learning for Variant Detection in Wastewater Genomic Sequencing2025-12-02T19:04:05ZWastewater-based genomic surveillance has emerged as a powerful tool for population-level viral monitoring, offering comprehensive insights into circulating viral variants across entire communities. However, this approach faces significant computational challenges stemming from high sequencing noise, low viral coverage, fragmented reads, and the complete absence of labeled variant annotations. Traditional reference-based variant calling pipelines struggle with novel mutations and require extensive computational resources. We present a comprehensive framework for unsupervised viral variant detection using Vector-Quantized Variational Autoencoders (VQ-VAE) that learns discrete codebooks of genomic patterns from k-mer tokenized sequences without requiring reference genomes or variant labels. Our approach extends the base VQ-VAE architecture with masked reconstruction pretraining for robustness to missing data and contrastive learning for highly discriminative embeddings. Evaluated on SARS-CoV-2 wastewater sequencing data comprising approximately 100,000 reads, our VQ-VAE achieves 99.52% mean token-level accuracy and 56.33% exact sequence match rate while maintaining 19.73% codebook utilization (101 of 512 codes active), demonstrating efficient discrete representation learning. Contrastive fine-tuning with different projection dimensions yields substantial clustering improvements: 64-dimensional embeddings achieve +35% Silhouette score improvement (0.31 to 0.42), while 128-dimensional embeddings achieve +42% improvement (0.31 to 0.44), clearly demonstrating the impact of embedding dimensionality on variant discrimination capability. Our reference-free framework provides a scalable, interpretable approach to genomic surveillance with direct applications to public health monitoring.2025-12-02T19:04:05Z13 pages, 4 figuresAdele ChindaRichmond AzumahHemanth Demakethepalli Venkateswara