https://arxiv.org/api/s/rOTv+BqSs+7gVVvFgPdD6ZIiA 2026-06-14T06:12:19Z 3848 255 15 http://arxiv.org/abs/2510.27097v2 Hierarchical Bayesian Model for Gene Deconvolution and Functional Analysis in Human Endometrium Across the Menstrual Cycle 2025-12-12T16:09:58Z

Bulk tissue RNA sequencing of heterogeneous samples provides averaged gene expression profiles, obscuring cell type-specific dynamics. To address this, we present a probabilistic hierarchical Bayesian model that deconvolves bulk RNA-seq data into constituent cell-type expression profiles and proportions, leveraging a high-resolution single-cell reference. We apply our model to human endometrial tissue across the menstrual cycle, a context characterized by dramatic hormone-driven cellular composition changes. Our extended framework provides a principled inference of cell type proportions and cell-specific gene expression changes across cycle phases. We demonstrate the model's structure, priors, and inference strategy in detail, and we validate its performance with simulations and comparisons to existing methods. The results reveal dynamic shifts in epithelial, stromal, and immune cell fractions between menstrual phases, and identify cell-type-specific differential gene expression associated with endometrial function (e.g., decidualization markers in stromal cells during the secretory phase). We further conduct robustness tests and show that our Bayesian approach is resilient to reference mismatches and noise. Finally, we discuss the biological significance of our findings, potential clinical implications for fertility and endometrial disorders, and future directions, including integration of spatial transcriptomics.

2025-10-31T01:48:25Z This paper is withdrawn due to issues with attribution and citation accuracy Crystal Su Kuai Yu Mingyuan Shao Daniel Bauer http://arxiv.org/abs/2410.04996v5 Assumption-Lean Post-Integrated Inference with Surrogate Control Outcomes 2025-12-12T16:09:21Z

Data integration methods aim to extract low-dimensional embeddings from high-dimensional outcomes to remove unwanted variations, such as batch effects and unmeasured covariates, across heterogeneous datasets. However, multiple hypothesis testing after integration can be biased due to data-dependent processes. We introduce a robust post-integrated inference method that accounts for latent heterogeneity by utilizing control outcomes. Leveraging causal interpretations, we derive nonparametric identifiability of the direct effects using negative control outcomes. By utilizing surrogate control outcomes as an extension of negative control outcomes, we develop semiparametric inference on projected direct effect estimands, accounting for hidden mediators, confounders, and moderators. These estimands remain statistically meaningful under model misspecifications and with error-prone embeddings. We provide bias quantifications and finite-sample linear expansions with uniform concentration bounds. The proposed doubly robust estimators are consistent and efficient under minimal assumptions and potential misspecification, facilitating data-adaptive estimation with machine learning algorithms. Our proposal is evaluated using random forests through simulations and analysis of single-cell CRISPR perturbed datasets, which may contain potential unmeasured confounders.

2024-10-07T12:52:38Z 22 pages for the main text, 27 pages for the appendix, 6 figures for the main text, 7 figures for the appendix Jin-Hong Du Kathryn Roeder Larry Wasserman http://arxiv.org/abs/2512.10147v1 Murmur2Vec: A Hashing Based Solution For Embedding Generation Of COVID-19 Spike Sequences 2025-12-10T23:03:10Z

Early detection and characterization of coronavirus disease (COVID-19), caused by SARS-CoV-2, remain critical for effective clinical response and public-health planning. The global availability of large-scale viral sequence data presents significant opportunities for computational analysis; however, existing approaches face notable limitations. Phylogenetic tree-based methods are computationally intensive and do not scale efficiently to today's multi-million-sequence datasets. Similarly, current embedding-based techniques often rely on aligned sequences or exhibit suboptimal predictive performance and high runtime costs, creating barriers to practical large-scale analysis. In this study, we focus on the most prevalent SARS-CoV-2 lineages associated with the spike protein region and introduce a scalable embedding method that leverages hashing to generate compact, low-dimensional representations of spike sequences. These embeddings are subsequently used to train a variety of machine learning models for supervised lineage classification. We conduct an extensive evaluation comparing our approach with multiple baseline and state-of-the-art biological sequence embedding methods across diverse metrics. Our results demonstrate that the proposed embeddings offer substantial improvements in efficiency, achieving up to 86.4\% classification accuracy while reducing embedding generation time by as much as 99.81\%. This highlights the method's potential as a fast, effective, and scalable solution for large-scale viral sequence analysis.

2025-12-10T23:03:10Z Sarwan Ali Taslim Murad http://arxiv.org/abs/2510.12617v2 Same model, better performance: the impact of shuffling on DNA Language Models benchmarking 2025-12-10T22:00:37Z

Large Language Models are increasingly popular in genomics due to their potential to decode complex biological sequences. Hence, researchers require a standardized benchmark to evaluate DNA Language Models (DNA LMs) capabilities. However, evaluating DNA LMs is a complex task that intersects genomic's domain-specific challenges and machine learning methodologies, where seemingly minor implementation details can significantly compromise benchmark validity. We demonstrate this through BEND (Benchmarking DNA Language Models), where hardware-dependent hyperparameters -- number of data loading workers and buffer sizes -- create spurious performance variations of up to 4% for identical models. The problem stems from inadequate data shuffling interacting with domain specific data characteristics. Experiments with three DNA language models (HyenaDNA, DNABERT-2, ResNet-LM) show these artifacts affect both absolute performance and relative model rankings. We propose a simple solution: pre-shuffling data before storage eliminates hardware dependencies while maintaining efficiency. This work highlights how standard ML practices can interact unexpectedly with domain-specific data characteristics, with broader implications for benchmark design in specialized domains.

2025-10-14T15:16:56Z Davide Greco Konrad Rawlik http://arxiv.org/abs/2512.09964v1 Development of an Agentic AI Model for NGS Downstream Analysis Targeting Researchers with Limited Biological Background 2025-12-10T03:43:46Z

Next-Generation Sequencing (NGS) has become a cornerstone of genomic research, yet the complexity of downstream analysis-ranging from differential expression gene (DEG) identification to biological interpretations-remains a significant barrier for researchers lacking specialized computational and biological expertise. While recent studies have introduced AI agents for RNA-seq analysis, most focus on general workflows without offering tailored interpretations or guidance for novices. To address this gap, we developed an Agentic AI model designed to automate NGS downstream analysis, provide literature-backed interpretations, and autonomously recommend advanced analytical methods. Built on the Llama 3 70B Large Language Model (LLM) and a Retrieval-Augmented Generation (RAG) framework, the model is deployed as an interactive Streamlit web application. The system integrates standard bioinformatics tools (Biopython, GSEApy, gProfiler) to execute core analyses, including DEG identification, clustering, and pathway enrichment. Uniquely, the agent utilizes RAG to query PubMed via Entrez, synthesizing biological insights and validating hypotheses with current literature. In a case study using cancer-related dataset, the model successfully identified significant DEGs, visualized clinical correlations, and derived evidence-based insights (e.g., linking BRAF mutations to prognosis), subsequently executing advanced survival modeling upon user selection. This framework democratizes bioinformatics by enabling researchers with limited backgrounds to seamlessly transition from basic data processing to advanced hypothesis testing and validation.

2025-12-10T03:43:46Z Donghyeon Lee Dongseok Kim Seokhwan Ko Seo-Young Park Junghwan Cho http://arxiv.org/abs/2512.09259v1 MoDaH achieves rate optimal batch correction 2025-12-10T02:31:16Z

Batch effects pose a significant challenge in the analysis of single-cell omics data, introducing technical artifacts that confound biological signals. While various computational methods have achieved empirical success in correcting these effects, they lack the formal theoretical guarantees required to assess their reliability and generalization. To bridge this gap, we introduce Mixture-Model-based Data Harmonization (MoDaH), a principled batch correction algorithm grounded in a rigorous statistical framework. Under a new Gaussian-mixture-model with explicit parametrization of batch effects, we establish the minimax optimal error rates for batch correction and prove that MoDaH achieves this rate by leveraging the recent theoretical advances in clustering data from anisotropic Gaussian mixtures. This constitutes, to the best of our knowledge, the first theoretical guarantee for batch correction. Extensive experiments on diverse single-cell RNA-seq and spatial proteomics datasets demonstrate that MoDaH not only attains theoretical optimality but also achieves empirical performance comparable to or even surpassing those of state-of-the-art heuristics (e.g., Harmony, Seurat-V5, and LIGER), effectively balancing the removal of technical noise with the conservation of biological signal.

2025-12-10T02:31:16Z Yang Cao Zongming Ma http://arxiv.org/abs/2509.13290v2 Uchimata: a toolkit for visualization of 3D genome structures on the web and in computational notebooks 2025-12-09T20:03:10Z

Summary: Uchimata is a toolkit for visualization of 3D structures of genomes. It consists of two packages: a Javascript library facilitating the rendering of 3D models of genomes, and a Python widget for visualization in Jupyter Notebooks. Main features include an expressive way to specify visual encodings, and filtering of 3D genome structures based on genomic semantics and spatial aspects. Uchimata is designed to be highly integratable with biological tooling available in Python. Availability and Implementation: Uchimata is released under the MIT License. The Javascript library is available on NPM, while the widget is available as a Python package hosted on PyPI. The source code for both is available publicly on Github (https://github.com/hms-dbmi/uchimata and https://github.com/hms-dbmi/uchimata-py) and Zenodo: (https://doi.org/10.5281/zenodo.17831959 and https://doi.org/10.5281/zenodo.17832045). The documentation with examples is hosted at https://hms-dbmi.github.io/uchimata/ Contact: david_kouril@hms.harvard.edu or nils@hms.harvard.edu.

2025-09-16T17:43:57Z David Kouřil Trevor Manz Tereza Clarence Nils Gehlenborg http://arxiv.org/abs/2503.18472v2 A Graph-based Approach to Variant Extraction from Sequences 2025-12-09T09:20:58Z

Accurate variant descriptions are of paramount importance in the field of genomics. The domain is confronted with increasingly complex variants, e.g., combinations of multiple indels, making it challenging to generate proper variant descriptions directly from chromosomal sequences. We present a graph based on all minimal alignments that is a complete representation of a variant, which gives insight into the nature of a variant compared to a single variant description. We provide three complementary extraction methods to derive variant descriptions from this graph, including one that yields domain-specific constructs from the HGVS nomenclature. Our experiments show that our methods in comparison with dbSNP, the authoritative variant database from the NCBI, result in identical HGVS descriptions for simple variants and more meaningful descriptions for complex variants, in particular for repeat expansions and contractions.

2025-03-24T09:20:00Z 19 pages, 10 figures NAR Genomics and Bioinformatics, Volume 7, Issue 4, December 2025 Mark A. Santcroos Walter A. Kosters Mihai Lefter Jeroen F. J. Laros Jonathan K. Vis 10.1093/nargab/lqaf173 http://arxiv.org/abs/2512.08226v1 ImmunoNX: a robust bioinformatics workflow to support personalized neoantigen vaccine trials 2025-12-09T04:08:05Z

Personalized neoantigen vaccines represent a promising immunotherapy approach that harnesses tumor-specific antigens to stimulate anti-tumor immune responses. However, the design of these vaccines requires sophisticated computational workflows to predict and prioritize neoantigen candidates from patient sequencing data, coupled with rigorous review to ensure candidate quality. While numerous computational tools exist for neoantigen prediction, to our knowledge, there are no established protocols detailing the complete process from raw sequencing data through systematic candidate selection. Here, we present ImmunoNX (Immunogenomics Neoantigen eXplorer), an end-to-end protocol for neoantigen prediction and vaccine design that has supported over 185 patients across 11 clinical trials. The workflow integrates tumor DNA/RNA and matched normal DNA sequencing data through a computational pipeline built with Workflow Definition Language (WDL) and executed via Cromwell on Google Cloud Platform. ImmunoNX employs consensus-based variant calling, in-silico HLA typing, and pVACtools for neoantigen prediction. Additionally, we describe a two-stage immunogenomics review process with prioritization of neoantigen candidates, enabled by pVACview, followed by manual assessment of variants using the Integrative Genomics Viewer (IGV). This workflow enables vaccine design in under three months. We demonstrate the protocol using the HCC1395 breast cancer cell line dataset, identifying 78 high-confidence neoantigen candidates from 322 initial predictions. Although demonstrated here for vaccine development, this workflow can be adapted for diverse neoantigen therapies and experiments. Therefore, this protocol provides the research community with a reproducible, version-controlled framework for designing personalized neoantigen vaccines, supported by detailed documentation, example datasets, and open-source code.

2025-12-09T04:08:05Z Supplementary tables and materials available at https://doi.org/10.5281/zenodo.17862140 Kartik Singhal Evelyn Schmidt Susanna Kiwala S. Peter Goedegebuure Christopher A. Miller Huiming Xia Kelsy C. Cotto Jinglun Li Jennie Yao Luke Hendrickson Miller M. Richters My H. Hoang Mariam Khanfar Isabel Risch Shelly O'Laughlin Nancy Myers Tammi Vickery Sherri R. Davies Feiyu Du Thomas B. Mooney Adam Coffman Gue Su Chang Jasreet Hundal John E. Garza Michael D. McLellan Joshua F. McMichael John Maruska William Blake Inabinett William A. Hoos Rachel Karchin Tanner M. Johanns Gavin P. Dunn Russel K. Pachynski Todd A. Fehniger Jeffrey P. Ward Jennifer A. Foltz William E. Gillanders Obi L. Griffith Malachi Griffith http://arxiv.org/abs/2512.08175v1 needLR: Long-read structural variant annotation with population-scale frequency estimation 2025-12-09T02:08:42Z

Summary: We present needLR, a structural variant (SV) annotation tool that can be used for filtering and prioritization of candidate pathogenic SVs from long-read sequencing data using population allele frequencies, annotations for genomic context, and gene-phenotype associations. When using population data from 500 presumably healthy individuals to evaluate nine test cases with known pathogenic SVs, needLR assigned allele frequencies to over 97.5% of all detected SVs and reduced the average number of novel genic SVs to 121 per case while retaining all known pathogenic variants. Availability and Implementation: needLR is implemented in bash with dependencies including Truvari v4.2.2, BEDTools v2.31.1, and BCFtools v1.19. Source code, documentation, and pre-computed population allele frequency data are freely available at https://github.com/jgust1/needLR under an MIT license.

2025-12-09T02:08:42Z Jonas A. Gustafson Jiadong Lin Evan E. Eichler Danny E. Miller http://arxiv.org/abs/2501.19030v2 A network-driven framework for enhancing gene-disease association studies in coronary artery disease 2025-12-08T14:01:19Z

Transcriptome-wide association studies (TWAS) link genetic variation to complex traits by leveraging expression quantitative trait loci (eQTL) data. However, most implementations are typically limited to local (cis-acting) effects and fail to account for long-range (trans) regulatory influences mediated through gene networks. We introduce GRN-TWAS, a framework that reconstructs gene regulatory networks (GRNs) and integrates their topology into gene expression prediction models, thereby propagating distal (trans) regulatory effects through tissue-specific gene networks to trait- or disease-associated phenotypes. By incorporating network-derived trans-eQTLs, GRN-TWAS generates gene expression imputation models that capture both local and distal genetic components, enabling a more complete, systems-level view of genetic regulation consistent with the omnigenic model hypothesis. Using genotype and multi-tissue expression data from 600 coronary artery disease (CAD) cases in the STARNET study together with GWAS summary statistics, we show that GRN-TWAS improves gene-expression prediction and sharpens discovery of CAD-associated genes. Across seven tissues, the framework identified 5,779 transcriptome-wide significant genes, more than 50\% of which appear to be previously unreported in the CAD literature. A knowledge-based gene-ranking engine then prioritized 882 genes as highly CAD-relevant, including 237 regulated exclusively through trans effects. Key-driver analysis highlighted 18 putative trans mediators with high network centrality and disease relevance, offering mechanistic hypotheses that complement association signals. Collectively, these results demonstrate that embedding network topology into TWAS improves discovery and interpretability by exposing tissue-specific regulatory routes from genotype to phenotype and expanding the landscape of gene-disease associations.

2025-01-31T10:54:39Z Revised version, 12 pages, 6 figures, 1 table; code available at https://github.com/guutama/GRN-TWAS; Tex Source includes a file appendix.pdf with supplementary materials (4 pages supplementary methods, 3 supplementary tables, 21 supplementary figures) Gutama Ibrahim Mohammad Johan LM Björkegren Tom Michoel http://arxiv.org/abs/2512.07113v1 PlantBiMoE: A Bidirectional Foundation Model with SparseMoE for Plant Genomes 2025-12-08T02:51:46Z

Understanding the underlying linguistic rules of plant genomes remains a fundamental challenge in computational biology. Recent advances including AgroNT and PDLLMs have made notable progress although, they suffer from excessive parameter size and limited ability to model the bidirectional nature of DNA strands respectively. To address these limitations, we propose PlantBiMoE, a lightweight and expressive plant genome language model that integrates bidirectional Mamba and a Sparse Mixture-of-Experts (SparseMoE) framework. The bidirectional Mamba enables the model to effectively capture structural dependencies across both the forward and reverse DNA strands, while SparseMoE significantly reduces the number of active parameters, improving computational efficiency without sacrificing modeling capacity. We evaluated and tested our model on the Modified Plants Genome Benchmark (MPGB), an enhanced genomic benchmark, which consolidates 31 datasets across 11 representative tasks, with input sequence lengths ranging from 50 to 6,000 bp. Experimental results demonstrate that PlantBiMoE achieves the best performance on 20 out of 31 datasets and the average best when comparing with existing models. In summary, all above results demonstrate that our model can effectively represent plant genomic sequences, serving as a robust computational tool for diverse genomic tasks, while making substantive contributions to plant genomics, gene editing, and synthetic biology. The code is available at: https://github.com/HUST-Keep-Lin/PlantBiMoE

2025-12-08T02:51:46Z 6 pages, 5 figures, accept to BIBM Kepeng Lin Qizhe Zhang Rui Wang Xuehai Hu Wei Xu http://arxiv.org/abs/2512.05573v1 Refined HLA Linkage Disequilibrium Architectures of World Populations by a Novel Allelic Correlation Measure 2025-12-05T09:55:22Z

Numerous diseases, particularly autoimmune disorders, are associated with the human leukocyte antigen (HLA), a small genomic region located on human chromosome 6. Adequate characterization of linkage disequilibrium (LD) in the HLA across populations is crucial for identifying genetic markers associated with specific traits and phenotypes. However, current LD measures often fail to capture HLA's structural complexity due to methodological limitations and sensitivity to low-frequency variants, marginal allele frequencies, and haplotype composition. To address these challenges, we introduced the Conditional Informatics Correlation Coefficient (CICC), which integrates conditional probability, information content, and haplotype-aware XOR logic to quantify LD robustly. When applied to high-resolution haploid genomes from the Human Pangenome Reference Consortium (HPRC), CICC revealed 10 novel high-LD regions in HLA. Further analyses using the 1000 Genomes Project and Genome Asia datasets identified nine strongly linked regions shared across five global populations-five in Class I and four in Class II. These results demonstrate CICC's ability to capture complex HLA LD structures across populations, highlighting its broad potential for disease gene mapping, population genomics, and guiding precision medicine.

2025-12-05T09:55:22Z Fei Zhang Weixiong Zhang http://arxiv.org/abs/2512.03286v1 SpikGPT: A High-Accuracy and Interpretable Spiking Attention Framework for Single-Cell Annotation 2025-12-02T22:50:57Z

Accurate and scalable cell type annotation remains a challenge in single-cell transcriptomics, especially when datasets exhibit strong batch effects or contain previously unseen cell populations. Here we introduce SpikGPT, a hybrid deep learning framework that integrates scGPT-derived cell embeddings with a spiking Transformer architecture to achieve efficient and robust annotation. scGPT provides biologically informed dense representations of each cell, which are further processed by a multi-head Spiking Self-Attention mechanism for energy-efficient feature extraction. Across multiple benchmark datasets, SpikGPT consistently matches or exceeds the performance of leading annotation tools. Notably, SpikGPT uniquely identifies unseen cell types by assigning low-confidence predictions to an "Unknown" category, allowing accurate rejection of cell states absent from the training reference. Together, these results demonstrate that SpikGPT is a versatile and reliable annotation tool capable of generalizing across datasets, resolving complex cellular heterogeneity, and facilitating discovery of novel or disease-associated cell populations.

2025-12-02T22:50:57Z Min Huang Rishikesan Kamaleswaran http://arxiv.org/abs/2512.03158v1 Contrastive Deep Learning for Variant Detection in Wastewater Genomic Sequencing 2025-12-02T19:04:05Z

Wastewater-based genomic surveillance has emerged as a powerful tool for population-level viral monitoring, offering comprehensive insights into circulating viral variants across entire communities. However, this approach faces significant computational challenges stemming from high sequencing noise, low viral coverage, fragmented reads, and the complete absence of labeled variant annotations. Traditional reference-based variant calling pipelines struggle with novel mutations and require extensive computational resources. We present a comprehensive framework for unsupervised viral variant detection using Vector-Quantized Variational Autoencoders (VQ-VAE) that learns discrete codebooks of genomic patterns from k-mer tokenized sequences without requiring reference genomes or variant labels. Our approach extends the base VQ-VAE architecture with masked reconstruction pretraining for robustness to missing data and contrastive learning for highly discriminative embeddings. Evaluated on SARS-CoV-2 wastewater sequencing data comprising approximately 100,000 reads, our VQ-VAE achieves 99.52% mean token-level accuracy and 56.33% exact sequence match rate while maintaining 19.73% codebook utilization (101 of 512 codes active), demonstrating efficient discrete representation learning. Contrastive fine-tuning with different projection dimensions yields substantial clustering improvements: 64-dimensional embeddings achieve +35% Silhouette score improvement (0.31 to 0.42), while 128-dimensional embeddings achieve +42% improvement (0.31 to 0.44), clearly demonstrating the impact of embedding dimensionality on variant discrimination capability. Our reference-free framework provides a scalable, interpretable approach to genomic surveillance with direct applications to public health monitoring.

2025-12-02T19:04:05Z 13 pages, 4 figures Adele Chinda Richmond Azumah Hemanth Demakethepalli Venkateswara