https://arxiv.org/api/TWjw1WLiMR/mj95ywqyt72pTLKc 2026-06-14T18:42:59Z 3848 435 15 http://arxiv.org/abs/2506.19097v1 Quantum Gradient Optimized Drug Repurposing Prototype for Omics Data 2025-06-23T20:17:55Z

This paper presents a novel quantum-enhanced prototype for drug repurposing and addresses the challenge of managing massive genomics data in precision medicine.

2025-06-23T20:17:55Z Conference details can be found here: https://www.insticc.org/node/technicalprogram/DATA/2025 Don Roosan Saif Nirzhor Rubayat Khan Fahmida Hai http://arxiv.org/abs/2506.18940v1 eccDNAMamba: A Pre-Trained Model for Ultra-Long eccDNA Sequence Analysis 2025-06-22T17:50:57Z

Extrachromosomal circular DNA (eccDNA) plays key regulatory roles and contributes to oncogene overexpression in cancer through high-copy amplification and long-range interactions. Despite advances in modeling, no pre-trained models currently support full-length circular eccDNA for downstream analysis. Existing genomic models are either limited to single-nucleotide resolution or hindered by the inefficiency of the quadratic attention mechanism. Here, we introduce eccDNAMamba, the first bidirectional state-space encoder tailored for circular DNA sequences. It combines forward and reverse passes for full-context representation learning with linear-time complexity, and preserves circular structure through a novel augmentation strategy. Tested on two real-world datasets, eccDNAMamba achieves strong classification performance and scales to sequences up to 200 Kbp, offering a robust and efficient framework for modeling circular genomes. Our codes are available at https://github.com/zzq1zh/GenAI-Lab.

2025-06-22T17:50:57Z Accepted by ICML 2025 Generative AI and Biology (GenBio) Workshop Zhenke Liu Jien Li Ziqi Zhang http://arxiv.org/abs/2501.10004v3 Static Three-Dimensional Structures Determine Fast Dynamics Between Distal Loci Pairs in Interphase Chromosomes 2025-06-22T03:49:28Z

Live-cell imaging experiments have shown that the distal dynamics between enhancers and promoters are unexpectedly rapid and incompatible with standard polymer models. The discordance between the compact static chromatin organization and dynamics is a conundrum that violates the expected structure-function relationship. We developed a theory to predict chromatin dynamics by accurately determining three-dimensional (3D) structures from static Hi-C contact maps or fixed-cell imaging data. Using the calculated 3D coordinates, the theory accurately forecasts experimentally observed two-point chromatin dynamics. It predicts rapid enhancer-promoter interactions and uncovers a scaling relationship between two-point relaxation time and genomic separation, closely matching recent measurements. The theory predicts that cohesin depletion accelerates single-locus diffusion while significantly slowing relaxation dynamics within topologically associating domains (TADs). Our results demonstrate that chromatin dynamics can be reliably inferred from static structural data, reinforcing the notion that 3D chromatin structure governs dynamic behavior. This general framework offers powerful tools for exploring chromatin dynamics across diverse biological contexts.

2025-01-17T07:39:26Z Guang Shi Sucheol Shin D. Thirumalai http://arxiv.org/abs/2506.17766v1 Improving Genomic Models via Task-Specific Self-Pretraining 2025-06-21T17:19:21Z

Pretraining DNA language models (DNALMs) on the full human genome is resource-intensive, yet often considered necessary for strong downstream performance. Inspired by recent findings in NLP and long-context modeling, we explore an alternative: self-pretraining on task-specific, unlabeled data. Using the BEND benchmark, we show that DNALMs trained with self-pretraining match or exceed the performance of models trained from scratch under identical compute. While genome-scale pretraining may still offer higher absolute performance, task-specific self-pretraining provides a practical and compute-efficient strategy for building stronger supervised baselines.

2025-06-21T17:19:21Z 4 pages Sohan Mupparapu Parameswari Krishnamurthy Ratish Puduppully http://arxiv.org/abs/2506.15761v1 Advancing Digital Precision Medicine for Chronic Fatigue Syndrome through Longitudinal Large-Scale Multi-Modal Biological Omics Modeling with Machine Learning and Artificial Intelligence 2025-06-18T15:31:26Z

We studied a generalized question: chronic diseases like ME/CFS and long COVID exhibit high heterogeneity with multifactorial etiology and progression, complicating diagnosis and treatment. To address this, we developed BioMapAI, an explainable Deep Learning framework using the richest longitudinal multi-omics dataset for ME/CFS to date. This dataset includes gut metagenomics, plasma metabolome, immune profiling, blood labs, and clinical symptoms. By connecting multi-omics to a symptom matrix, BioMapAI identified both disease- and symptom-specific biomarkers, reconstructed symptoms, and achieved state-of-the-art precision in disease classification. We also created the first connectivity map of these omics in both healthy and disease states and revealed how microbiome-immune-metabolome crosstalk shifted from healthy to ME/CFS.

2025-06-18T15:31:26Z Ruoyun Xiong http://arxiv.org/abs/2506.15383v1 Global Ground Metric Learning with Applications to scRNA data 2025-06-18T11:53:13Z

Optimal transport provides a robust framework for comparing probability distributions. Its effectiveness is significantly influenced by the choice of the underlying ground metric. Traditionally, the ground metric has either been (i) predefined, e.g., as the Euclidean distance, or (ii) learned in a supervised way, by utilizing labeled data to learn a suitable ground metric for enhanced task-specific performance. Yet, predefined metrics typically cannot account for the inherent structure and varying importance of different features in the data, and existing supervised approaches to ground metric learning often do not generalize across multiple classes or are restricted to distributions with shared supports. To address these limitations, we propose a novel approach for learning metrics for arbitrary distributions over a shared metric space. Our method provides a distance between individual points like a global metric, but requires only class labels on a distribution-level for training. The learned global ground metric enables more accurate optimal transport distances, leading to improved performance in embedding, clustering and classification tasks. We demonstrate the effectiveness and interpretability of our approach using patient-level scRNA-seq data spanning multiple diseases.

2025-06-18T11:53:13Z This method is provided as a Python package on PyPI, see https://github.com/DaminK/ggml-ot Proceedings of The 28th International Conference on Artificial Intelligence and Statistics (AISTATS), 2025, PMLR 258:3295-3303 Damin Kühn Michael T. Schaub http://arxiv.org/abs/2507.02877v1 AuraGenome: An LLM-Powered Framework for On-the-Fly Reusable and Scalable Circular Genome Visualizations 2025-06-18T03:29:30Z

Circular genome visualizations are essential for exploring structural variants and gene regulation. However, existing tools often require complex scripting and manual configuration, making the process time-consuming, error-prone, and difficult to learn. To address these challenges, we introduce AuraGenome, an LLM-powered framework for rapid, reusable, and scalable generation of multi-layered circular genome visualizations. AuraGenome combines a semantic-driven multi-agent workflow with an interactive visual analytics system. The workflow employs seven specialized LLM-driven agents, each assigned distinct roles such as intent recognition, layout planning, and code generation, to transform raw genomic data into tailored visualizations. The system supports multiple coordinated views tailored for genomic data, offering ring, radial, and chord-based layouts to represent multi-layered circular genome visualizations. In addition to enabling interactions and configuration reuse, the system supports real-time refinement and high-quality report export. We validate its effectiveness through two case studies and a comprehensive user study. AuraGenome is available at: https://github.com/Darius18/AuraGenome.

2025-06-18T03:29:30Z Chi Zhang Yu Dong Yang Wang Yuetong Han Guihua Shan Bixia Tang http://arxiv.org/abs/2412.05096v2 Approaches to studying virus pangenome variation graphs 2025-06-17T11:47:14Z

Pangenome variation graphs (PVGs) allow for the representation of genetic diversity in a more nuanced way than traditional reference-based approaches. Here we focus on how PVGs are a powerful tool for studying genetic variation in viruses, offering insights into the complexities of viral quasispecies, mutation rates, and population dynamics. PVGs originated in human genomics and hold great promise for viral genomics. Previous work has been constrained by small sample sizes and gene-centric methods, PVGs enable a more comprehensive approach to studying viral diversity. Large viral genome collections should be used to make PVGs, which offer significant advantages: we outline accessible tools to achieve this. This spans PVG construction, PVG file formats, PVG manipulation and analysis, PVG visualisation, measuring PVG openness, and mapping reads to PVGs. Additionally, the development of PVG-specific formats for mutation representation and personalised PVGs that reflect specific research questions will further enhance PVG applications. Challenges remain, particularly in managing nested variants, optimising error detection, optimising k-mer/minimizer-based approaches for AT-rich genomes, incorporating long read sequencing data, and scalable visualisation approaches. Nevertheless, PVGs offer a new opportunities for viral population genomics, and a testing ground for tool development prior to application to larger eukaryotic genomes. These advances will enable more accurate and comprehensive detection of viral mutations, contributing to a deeper understanding of viral evolution and genotype-phenotype associations.

2024-12-06T14:50:41Z 3 figures Tim Downing http://arxiv.org/abs/2506.13344v1 LapDDPM: A Conditional Graph Diffusion Model for scRNA-seq Generation with Spectral Adversarial Perturbations 2025-06-16T10:35:32Z

Generating high-fidelity and biologically plausible synthetic single-cell RNA sequencing (scRNA-seq) data, especially with conditional control, is challenging due to its high dimensionality, sparsity, and complex biological variations. Existing generative models often struggle to capture these unique characteristics and ensure robustness to structural noise in cellular networks. We introduce LapDDPM, a novel conditional Graph Diffusion Probabilistic Model for robust and high-fidelity scRNA-seq generation. LapDDPM uniquely integrates graph-based representations with a score-based diffusion model, enhanced by a novel spectral adversarial perturbation mechanism on graph edge weights. Our contributions are threefold: we leverage Laplacian Positional Encodings (LPEs) to enrich the latent space with crucial cellular relationship information; we develop a conditional score-based diffusion model for effective learning and generation from complex scRNA-seq distributions; and we employ a unique spectral adversarial training scheme on graph edge weights, boosting robustness against structural variations. Extensive experiments on diverse scRNA-seq datasets demonstrate LapDDPM's superior performance, achieving high fidelity and generating biologically-plausible, cell-type-specific samples. LapDDPM sets a new benchmark for conditional scRNA-seq data generation, offering a robust tool for various downstream biological applications.

2025-06-16T10:35:32Z LapDDPM is a novel conditional graph diffusion model for scRNA-seq generation. Leveraging spectral adversarial perturbations, it ensures robustness and yields high-fidelity, biologically plausible, and cell-type-specific samples for complex data. Proceedings of the ICML 2025 GenBio Workshop: The 2nd Workshop on Generative AI and Biology, Vancouver, Canada, 2025 Lorenzo Bini Stephane Marchand-Maillet http://arxiv.org/abs/2409.02143v3 MLOmics: Cancer Multi-Omics Database for Machine Learning 2025-06-16T10:34:25Z

Framing the investigation of diverse cancers as a machine learning problem has recently shown significant potential in multi-omics analysis and cancer research. Empowering these successful machine learning models are the high-quality training datasets with sufficient data volume and adequate preprocessing. However, while there exist several public data portals, including The Cancer Genome Atlas (TCGA) multi-omics initiative or open-bases such as the LinkedOmics, these databases are not off-the-shelf for existing machine learning models. In this paper, we introduce MLOmics, an open cancer multi-omics database aiming at serving better the development and evaluation of bioinformatics and machine learning models. MLOmics contains 8,314 patient samples covering all 32 cancer types with four omics types, stratified features, and extensive baselines. Complementary support for downstream analysis and bio-knowledge linking are also included to support interdisciplinary analysis.

2024-09-02T22:04:08Z This work has been published in Scientific Data Ziwei Yang Rikuto Kotoge Xihao Piao Zheng Chen Lingwei Zhu Peng Gao Yasuko Matsubara Yasushi Sakurai Jimeng Sun http://arxiv.org/abs/2506.13119v1 PhenoKG: Knowledge Graph-Driven Gene Discovery and Patient Insights from Phenotypes Alone 2025-06-16T05:54:12Z

Identifying causative genes from patient phenotypes remains a significant challenge in precision medicine, with important implications for the diagnosis and treatment of genetic disorders. We propose a novel graph-based approach for predicting causative genes from patient phenotypes, with or without an available list of candidate genes, by integrating a rare disease knowledge graph (KG). Our model, combining graph neural networks and transformers, achieves substantial improvements over the current state-of-the-art. On the real-world MyGene2 dataset, it attains a mean reciprocal rank (MRR) of 24.64\% and nDCG@100 of 33.64\%, surpassing the best baseline (SHEPHERD) at 19.02\% MRR and 30.54\% nDCG@100. We perform extensive ablation studies to validate the contribution of each model component. Notably, the approach generalizes to cases where only phenotypic data are available, addressing key challenges in clinical decision support when genomic information is incomplete.

2025-06-16T05:54:12Z Kamilia Zaripova Ege Özsoy Nassir Navab Azade Farshad http://arxiv.org/abs/2506.11491v2 SemanticST: Spatially Informed Semantic Graph Learning for Clustering, Integration, and Scalable Analysis of Spatial Transcriptomics 2025-06-16T01:40:40Z

Spatial transcriptomics (ST) technologies enable gene expression profiling with spatial resolution, offering unprecedented insights into tissue organization and disease heterogeneity. However, current analysis methods often struggle with noisy data, limited scalability, and inadequate modelling of complex cellular relationships. We present SemanticST, a biologically informed, graph-based deep learning framework that models diverse cellular contexts through multi-semantic graph construction. SemanticST builds multiple context-specific graphs capturing spatial proximity, gene expression similarity, and tissue domain structure, and learns disentangled embeddings for each. These are fused using an attention-inspired strategy to yield a unified, biologically meaningful representation. A community-aware min-cut loss improves robustness over contrastive learning, particularly in sparse ST data. SemanticST supports mini-batch training, making it the first graph neural network scalable to large-scale datasets such as Xenium (500,000 cells). Benchmarking across four platforms (Visium, Slide-seq, Stereo-seq, Xenium) and multiple human and mouse tissues shows consistent 20 percentage gains in ARI, NMI, and trajectory fidelity over DeepST, GraphST, and IRIS. In re-analysis of breast cancer Xenium data, SemanticST revealed rare and clinically significant niches, including triple receptor-positive clusters, spatially distinct DCIS-to-IDC transition zones, and FOXC2 tumour-associated myoepithelial cells, suggesting non-canonical EMT programs with stem-like features. SemanticST thus provides a scalable, interpretable, and biologically grounded framework for spatial transcriptomics analysis, enabling robust discovery across tissue types and diseases, and paving the way for spatially resolved tissue atlases and next-generation precision medicine.

2025-06-13T06:30:48Z 6 Figures Roxana Zahedi Ahmadreza Argha Nona Farbehi Ivan Bakhshayeshi Youqiong Ye Nigel H. Lovell Hamid Alinejad-Rokny http://arxiv.org/abs/2506.13817v1 DeepSeq: High-Throughput Single-Cell RNA Sequencing Data Labeling via Web Search-Augmented Agentic Generative AI Foundation Models 2025-06-14T23:30:22Z

Generative AI foundation models offer transformative potential for processing structured biological data, particularly in single-cell RNA sequencing, where datasets are rapidly scaling toward billions of cells. We propose the use of agentic foundation models with real-time web search to automate the labeling of experimental data, achieving up to 82.5% accuracy. This addresses a key bottleneck in supervised learning for structured omics data by increasing annotation throughput without manual curation and human error. Our approach enables the development of virtual cell foundation models capable of downstream tasks such as cell-typing and perturbation prediction. As data volume grows, these models may surpass human performance in labeling, paving the way for reliable inference in large-scale perturbation screens. This application demonstrates domain-specific innovation in health monitoring and diagnostics, aligned with efforts like the Human Cell Atlas and Human Tumor Atlas Network.

2025-06-14T23:30:22Z 4 pages, 5 figures, Accepted by ICML 2025 FM4LS https://openreview.net/forum?id=zNjXOZxEYB . Workshop on Multi-modal Foundation Models and Large Language Models for Life Sciences (FM4LS)}, July 2025 International Conference on Machine Learning (ICML). Workshop on Multi-modal Foundation Models and Large Language Models for Life Sciences (FM4LS), July 2025 Saleem A. Al Dajani Abel Sanchez John R. Williams http://arxiv.org/abs/2506.11942v1 Viral Dark Matter: Illuminating Protein Function, Ecology, and Biotechnological Promises 2025-06-13T16:50:20Z

Viruses are the most abundant biological entities on Earth and play central roles in shaping microbiomes and influencing ecosystem functions. Yet, most viral genes remain uncharacterized, comprising what is commonly referred to as "viral dark matter." Metagenomic studies across diverse environments consistently show that 40-90% of viral genes lack known homologs or annotated functions. This persistent knowledge gap limits our ability to interpret viral sequence data, understand virus-host interactions, and assess the ecological or applied significance of viral genes. Among the most intriguing components of viral dark matter are auxiliary viral genes (AVGs), including auxiliary metabolic genes (AMGs), regulatory genes (AReGs), and host physiology-modifying genes (APGs), which may alter host function during infection and contribute to microbial metabolism, stress tolerance, or resistance. In this review, we explore recent advances in the discovery and functional characterization of viral dark matter. We highlight representative examples of novel viral proteins across diverse ecosystems including human microbiomes, soil, oceans, and extreme environments, and discuss what is known, and still unknown, about their roles. We then examine the bioinformatic and experimental challenges that hinder functional characterization, and present emerging strategies to overcome these barriers. Finally, we highlight both the fundamental and applied benefits that multidisciplinary efforts to characterize viral proteins can bring. By integrating computational predictions with experimental validation, and fostering collaboration across disciplines, we emphasize that illuminating viral dark matter is both feasible and essential for advancing microbial ecology and unlocking new tools for biotechnology.

2025-06-13T16:50:20Z James C. Kosmopoulos Karthik Anantharaman http://arxiv.org/abs/2506.11896v1 GlobDB: A comprehensive species-dereplicated microbial genome resource 2025-06-13T15:43:15Z

Over the past years, substantial numbers of microbial species' genomes have been deposited outside of conventional INSDC databases. The GlobDB aggregates 14 independent genomic catalogues to provide a comprehensive database of species-dereplicated microbial genomes, with consistent taxonomy, annotations, and additional analysis resources. The GlobDB is available at https://globdb.org/.

2025-06-13T15:43:15Z 11 pages (including references), 1 table, 0 figures Daan R. Speth Centre for Microbiology and Environmental Systems Science, University of Vienna, Vienna, Austria Nick Pullen Centre for Microbiology and Environmental Systems Science, University of Vienna, Vienna, Austria Samuel T. N. Aroney Centre for Microbiome Research School of Biomedical Sciences, Queensland University of Technology, Translational Research Institute, Woolloongabba, Australia Benjamin L. Coltman Centre for Microbiology and Environmental Systems Science, University of Vienna, Vienna, Austria Jay T. Osvatic Joint Microbiome Facility of the Medical University of Vienna and the University of Vienna, Vienna, Austria Department of Laboratory Medicine, Medical University of Vienna, Vienna, Austria Ben J. Woodcroft Centre for Microbiome Research School of Biomedical Sciences, Queensland University of Technology, Translational Research Institute, Woolloongabba, Australia Thomas Rattei Centre for Microbiology and Environmental Systems Science, University of Vienna, Vienna, Austria Michael Wagner Centre for Microbiology and Environmental Systems Science, University of Vienna, Vienna, Austria Center for Microbial Communities, Department of Chemistry and Bioscience, Aalborg University, Aalborg, Denmark