https://arxiv.org/api/TWjw1WLiMR/mj95ywqyt72pTLKc2026-06-14T18:42:59Z384843515http://arxiv.org/abs/2506.19097v1Quantum Gradient Optimized Drug Repurposing Prototype for Omics Data2025-06-23T20:17:55ZThis paper presents a novel quantum-enhanced prototype for drug repurposing and addresses the challenge of managing massive genomics data in precision medicine.2025-06-23T20:17:55ZConference details can be found here: https://www.insticc.org/node/technicalprogram/DATA/2025Don RoosanSaif NirzhorRubayat KhanFahmida Haihttp://arxiv.org/abs/2506.18940v1eccDNAMamba: A Pre-Trained Model for Ultra-Long eccDNA Sequence Analysis2025-06-22T17:50:57ZExtrachromosomal circular DNA (eccDNA) plays key regulatory roles and contributes to oncogene overexpression in cancer through high-copy amplification and long-range interactions. Despite advances in modeling, no pre-trained models currently support full-length circular eccDNA for downstream analysis. Existing genomic models are either limited to single-nucleotide resolution or hindered by the inefficiency of the quadratic attention mechanism. Here, we introduce eccDNAMamba, the first bidirectional state-space encoder tailored for circular DNA sequences. It combines forward and reverse passes for full-context representation learning with linear-time complexity, and preserves circular structure through a novel augmentation strategy. Tested on two real-world datasets, eccDNAMamba achieves strong classification performance and scales to sequences up to 200 Kbp, offering a robust and efficient framework for modeling circular genomes. Our codes are available at https://github.com/zzq1zh/GenAI-Lab.2025-06-22T17:50:57ZAccepted by ICML 2025 Generative AI and Biology (GenBio) WorkshopZhenke LiuJien LiZiqi Zhanghttp://arxiv.org/abs/2501.10004v3Static Three-Dimensional Structures Determine Fast Dynamics Between Distal Loci Pairs in Interphase Chromosomes2025-06-22T03:49:28ZLive-cell imaging experiments have shown that the distal dynamics between enhancers and promoters are unexpectedly rapid and incompatible with standard polymer models. The discordance between the compact static chromatin organization and dynamics is a conundrum that violates the expected structure-function relationship. We developed a theory to predict chromatin dynamics by accurately determining three-dimensional (3D) structures from static Hi-C contact maps or fixed-cell imaging data. Using the calculated 3D coordinates, the theory accurately forecasts experimentally observed two-point chromatin dynamics. It predicts rapid enhancer-promoter interactions and uncovers a scaling relationship between two-point relaxation time and genomic separation, closely matching recent measurements. The theory predicts that cohesin depletion accelerates single-locus diffusion while significantly slowing relaxation dynamics within topologically associating domains (TADs). Our results demonstrate that chromatin dynamics can be reliably inferred from static structural data, reinforcing the notion that 3D chromatin structure governs dynamic behavior. This general framework offers powerful tools for exploring chromatin dynamics across diverse biological contexts.2025-01-17T07:39:26ZGuang ShiSucheol ShinD. Thirumalaihttp://arxiv.org/abs/2506.17766v1Improving Genomic Models via Task-Specific Self-Pretraining2025-06-21T17:19:21ZPretraining DNA language models (DNALMs) on the full human genome is resource-intensive, yet often considered necessary for strong downstream performance. Inspired by recent findings in NLP and long-context modeling, we explore an alternative: self-pretraining on task-specific, unlabeled data. Using the BEND benchmark, we show that DNALMs trained with self-pretraining match or exceed the performance of models trained from scratch under identical compute. While genome-scale pretraining may still offer higher absolute performance, task-specific self-pretraining provides a practical and compute-efficient strategy for building stronger supervised baselines.2025-06-21T17:19:21Z4 pagesSohan MupparapuParameswari KrishnamurthyRatish Puduppullyhttp://arxiv.org/abs/2506.15761v1Advancing Digital Precision Medicine for Chronic Fatigue Syndrome through Longitudinal Large-Scale Multi-Modal Biological Omics Modeling with Machine Learning and Artificial Intelligence2025-06-18T15:31:26ZWe studied a generalized question: chronic diseases like ME/CFS and long COVID exhibit high heterogeneity with multifactorial etiology and progression, complicating diagnosis and treatment. To address this, we developed BioMapAI, an explainable Deep Learning framework using the richest longitudinal multi-omics dataset for ME/CFS to date. This dataset includes gut metagenomics, plasma metabolome, immune profiling, blood labs, and clinical symptoms. By connecting multi-omics to a symptom matrix, BioMapAI identified both disease- and symptom-specific biomarkers, reconstructed symptoms, and achieved state-of-the-art precision in disease classification. We also created the first connectivity map of these omics in both healthy and disease states and revealed how microbiome-immune-metabolome crosstalk shifted from healthy to ME/CFS.2025-06-18T15:31:26ZRuoyun Xionghttp://arxiv.org/abs/2506.15383v1Global Ground Metric Learning with Applications to scRNA data2025-06-18T11:53:13ZOptimal transport provides a robust framework for comparing probability distributions. Its effectiveness is significantly influenced by the choice of the underlying ground metric. Traditionally, the ground metric has either been (i) predefined, e.g., as the Euclidean distance, or (ii) learned in a supervised way, by utilizing labeled data to learn a suitable ground metric for enhanced task-specific performance. Yet, predefined metrics typically cannot account for the inherent structure and varying importance of different features in the data, and existing supervised approaches to ground metric learning often do not generalize across multiple classes or are restricted to distributions with shared supports. To address these limitations, we propose a novel approach for learning metrics for arbitrary distributions over a shared metric space. Our method provides a distance between individual points like a global metric, but requires only class labels on a distribution-level for training. The learned global ground metric enables more accurate optimal transport distances, leading to improved performance in embedding, clustering and classification tasks. We demonstrate the effectiveness and interpretability of our approach using patient-level scRNA-seq data spanning multiple diseases.2025-06-18T11:53:13ZThis method is provided as a Python package on PyPI, see https://github.com/DaminK/ggml-otProceedings of The 28th International Conference on Artificial Intelligence and Statistics (AISTATS), 2025, PMLR 258:3295-3303Damin KühnMichael T. Schaubhttp://arxiv.org/abs/2507.02877v1AuraGenome: An LLM-Powered Framework for On-the-Fly Reusable and Scalable Circular Genome Visualizations2025-06-18T03:29:30ZCircular genome visualizations are essential for exploring structural variants and gene regulation. However, existing tools often require complex scripting and manual configuration, making the process time-consuming, error-prone, and difficult to learn. To address these challenges, we introduce AuraGenome, an LLM-powered framework for rapid, reusable, and scalable generation of multi-layered circular genome visualizations. AuraGenome combines a semantic-driven multi-agent workflow with an interactive visual analytics system. The workflow employs seven specialized LLM-driven agents, each assigned distinct roles such as intent recognition, layout planning, and code generation, to transform raw genomic data into tailored visualizations. The system supports multiple coordinated views tailored for genomic data, offering ring, radial, and chord-based layouts to represent multi-layered circular genome visualizations. In addition to enabling interactions and configuration reuse, the system supports real-time refinement and high-quality report export. We validate its effectiveness through two case studies and a comprehensive user study. AuraGenome is available at: https://github.com/Darius18/AuraGenome.2025-06-18T03:29:30ZChi ZhangYu DongYang WangYuetong HanGuihua ShanBixia Tanghttp://arxiv.org/abs/2412.05096v2Approaches to studying virus pangenome variation graphs2025-06-17T11:47:14ZPangenome variation graphs (PVGs) allow for the representation of genetic diversity in a more nuanced way than traditional reference-based approaches. Here we focus on how PVGs are a powerful tool for studying genetic variation in viruses, offering insights into the complexities of viral quasispecies, mutation rates, and population dynamics. PVGs originated in human genomics and hold great promise for viral genomics. Previous work has been constrained by small sample sizes and gene-centric methods, PVGs enable a more comprehensive approach to studying viral diversity. Large viral genome collections should be used to make PVGs, which offer significant advantages: we outline accessible tools to achieve this. This spans PVG construction, PVG file formats, PVG manipulation and analysis, PVG visualisation, measuring PVG openness, and mapping reads to PVGs. Additionally, the development of PVG-specific formats for mutation representation and personalised PVGs that reflect specific research questions will further enhance PVG applications. Challenges remain, particularly in managing nested variants, optimising error detection, optimising k-mer/minimizer-based approaches for AT-rich genomes, incorporating long read sequencing data, and scalable visualisation approaches. Nevertheless, PVGs offer a new opportunities for viral population genomics, and a testing ground for tool development prior to application to larger eukaryotic genomes. These advances will enable more accurate and comprehensive detection of viral mutations, contributing to a deeper understanding of viral evolution and genotype-phenotype associations.2024-12-06T14:50:41Z3 figuresTim Downinghttp://arxiv.org/abs/2506.13344v1LapDDPM: A Conditional Graph Diffusion Model for scRNA-seq Generation with Spectral Adversarial Perturbations2025-06-16T10:35:32ZGenerating high-fidelity and biologically plausible synthetic single-cell RNA sequencing (scRNA-seq) data, especially with conditional control, is challenging due to its high dimensionality, sparsity, and complex biological variations. Existing generative models often struggle to capture these unique characteristics and ensure robustness to structural noise in cellular networks. We introduce LapDDPM, a novel conditional Graph Diffusion Probabilistic Model for robust and high-fidelity scRNA-seq generation. LapDDPM uniquely integrates graph-based representations with a score-based diffusion model, enhanced by a novel spectral adversarial perturbation mechanism on graph edge weights. Our contributions are threefold: we leverage Laplacian Positional Encodings (LPEs) to enrich the latent space with crucial cellular relationship information; we develop a conditional score-based diffusion model for effective learning and generation from complex scRNA-seq distributions; and we employ a unique spectral adversarial training scheme on graph edge weights, boosting robustness against structural variations. Extensive experiments on diverse scRNA-seq datasets demonstrate LapDDPM's superior performance, achieving high fidelity and generating biologically-plausible, cell-type-specific samples. LapDDPM sets a new benchmark for conditional scRNA-seq data generation, offering a robust tool for various downstream biological applications.2025-06-16T10:35:32ZLapDDPM is a novel conditional graph diffusion model for scRNA-seq generation. Leveraging spectral adversarial perturbations, it ensures robustness and yields high-fidelity, biologically plausible, and cell-type-specific samples for complex data. Proceedings of the ICML 2025 GenBio Workshop: The 2nd Workshop on Generative AI and Biology, Vancouver, Canada, 2025Lorenzo BiniStephane Marchand-Maillethttp://arxiv.org/abs/2409.02143v3MLOmics: Cancer Multi-Omics Database for Machine Learning2025-06-16T10:34:25ZFraming the investigation of diverse cancers as a machine learning problem has recently shown significant potential in multi-omics analysis and cancer research. Empowering these successful machine learning models are the high-quality training datasets with sufficient data volume and adequate preprocessing. However, while there exist several public data portals, including The Cancer Genome Atlas (TCGA) multi-omics initiative or open-bases such as the LinkedOmics, these databases are not off-the-shelf for existing machine learning models. In this paper, we introduce MLOmics, an open cancer multi-omics database aiming at serving better the development and evaluation of bioinformatics and machine learning models. MLOmics contains 8,314 patient samples covering all 32 cancer types with four omics types, stratified features, and extensive baselines. Complementary support for downstream analysis and bio-knowledge linking are also included to support interdisciplinary analysis.2024-09-02T22:04:08ZThis work has been published in Scientific DataZiwei YangRikuto KotogeXihao PiaoZheng ChenLingwei ZhuPeng GaoYasuko MatsubaraYasushi SakuraiJimeng Sunhttp://arxiv.org/abs/2506.13119v1PhenoKG: Knowledge Graph-Driven Gene Discovery and Patient Insights from Phenotypes Alone2025-06-16T05:54:12ZIdentifying causative genes from patient phenotypes remains a significant challenge in precision medicine, with important implications for the diagnosis and treatment of genetic disorders. We propose a novel graph-based approach for predicting causative genes from patient phenotypes, with or without an available list of candidate genes, by integrating a rare disease knowledge graph (KG). Our model, combining graph neural networks and transformers, achieves substantial improvements over the current state-of-the-art. On the real-world MyGene2 dataset, it attains a mean reciprocal rank (MRR) of 24.64\% and nDCG@100 of 33.64\%, surpassing the best baseline (SHEPHERD) at 19.02\% MRR and 30.54\% nDCG@100. We perform extensive ablation studies to validate the contribution of each model component. Notably, the approach generalizes to cases where only phenotypic data are available, addressing key challenges in clinical decision support when genomic information is incomplete.2025-06-16T05:54:12ZKamilia ZaripovaEge ÖzsoyNassir NavabAzade Farshadhttp://arxiv.org/abs/2506.11491v2SemanticST: Spatially Informed Semantic Graph Learning for Clustering, Integration, and Scalable Analysis of Spatial Transcriptomics2025-06-16T01:40:40ZSpatial transcriptomics (ST) technologies enable gene expression profiling with spatial resolution, offering unprecedented insights into tissue organization and disease heterogeneity. However, current analysis methods often struggle with noisy data, limited scalability, and inadequate modelling of complex cellular relationships. We present SemanticST, a biologically informed, graph-based deep learning framework that models diverse cellular contexts through multi-semantic graph construction. SemanticST builds multiple context-specific graphs capturing spatial proximity, gene expression similarity, and tissue domain structure, and learns disentangled embeddings for each. These are fused using an attention-inspired strategy to yield a unified, biologically meaningful representation. A community-aware min-cut loss improves robustness over contrastive learning, particularly in sparse ST data. SemanticST supports mini-batch training, making it the first graph neural network scalable to large-scale datasets such as Xenium (500,000 cells). Benchmarking across four platforms (Visium, Slide-seq, Stereo-seq, Xenium) and multiple human and mouse tissues shows consistent 20 percentage gains in ARI, NMI, and trajectory fidelity over DeepST, GraphST, and IRIS. In re-analysis of breast cancer Xenium data, SemanticST revealed rare and clinically significant niches, including triple receptor-positive clusters, spatially distinct DCIS-to-IDC transition zones, and FOXC2 tumour-associated myoepithelial cells, suggesting non-canonical EMT programs with stem-like features. SemanticST thus provides a scalable, interpretable, and biologically grounded framework for spatial transcriptomics analysis, enabling robust discovery across tissue types and diseases, and paving the way for spatially resolved tissue atlases and next-generation precision medicine.2025-06-13T06:30:48Z6 FiguresRoxana ZahediAhmadreza ArghaNona FarbehiIvan BakhshayeshiYouqiong YeNigel H. LovellHamid Alinejad-Roknyhttp://arxiv.org/abs/2506.13817v1DeepSeq: High-Throughput Single-Cell RNA Sequencing Data Labeling via Web Search-Augmented Agentic Generative AI Foundation Models2025-06-14T23:30:22ZGenerative AI foundation models offer transformative potential for processing structured biological data, particularly in single-cell RNA sequencing, where datasets are rapidly scaling toward billions of cells. We propose the use of agentic foundation models with real-time web search to automate the labeling of experimental data, achieving up to 82.5% accuracy. This addresses a key bottleneck in supervised learning for structured omics data by increasing annotation throughput without manual curation and human error. Our approach enables the development of virtual cell foundation models capable of downstream tasks such as cell-typing and perturbation prediction. As data volume grows, these models may surpass human performance in labeling, paving the way for reliable inference in large-scale perturbation screens. This application demonstrates domain-specific innovation in health monitoring and diagnostics, aligned with efforts like the Human Cell Atlas and Human Tumor Atlas Network.2025-06-14T23:30:22Z4 pages, 5 figures, Accepted by ICML 2025 FM4LS https://openreview.net/forum?id=zNjXOZxEYB . Workshop on Multi-modal Foundation Models and Large Language Models for Life Sciences (FM4LS)}, July 2025International Conference on Machine Learning (ICML). Workshop on Multi-modal Foundation Models and Large Language Models for Life Sciences (FM4LS), July 2025Saleem A. Al DajaniAbel SanchezJohn R. Williamshttp://arxiv.org/abs/2506.11942v1Viral Dark Matter: Illuminating Protein Function, Ecology, and Biotechnological Promises2025-06-13T16:50:20ZViruses are the most abundant biological entities on Earth and play central roles in shaping microbiomes and influencing ecosystem functions. Yet, most viral genes remain uncharacterized, comprising what is commonly referred to as "viral dark matter." Metagenomic studies across diverse environments consistently show that 40-90% of viral genes lack known homologs or annotated functions. This persistent knowledge gap limits our ability to interpret viral sequence data, understand virus-host interactions, and assess the ecological or applied significance of viral genes. Among the most intriguing components of viral dark matter are auxiliary viral genes (AVGs), including auxiliary metabolic genes (AMGs), regulatory genes (AReGs), and host physiology-modifying genes (APGs), which may alter host function during infection and contribute to microbial metabolism, stress tolerance, or resistance. In this review, we explore recent advances in the discovery and functional characterization of viral dark matter. We highlight representative examples of novel viral proteins across diverse ecosystems including human microbiomes, soil, oceans, and extreme environments, and discuss what is known, and still unknown, about their roles. We then examine the bioinformatic and experimental challenges that hinder functional characterization, and present emerging strategies to overcome these barriers. Finally, we highlight both the fundamental and applied benefits that multidisciplinary efforts to characterize viral proteins can bring. By integrating computational predictions with experimental validation, and fostering collaboration across disciplines, we emphasize that illuminating viral dark matter is both feasible and essential for advancing microbial ecology and unlocking new tools for biotechnology.2025-06-13T16:50:20ZJames C. KosmopoulosKarthik Anantharamanhttp://arxiv.org/abs/2506.11896v1GlobDB: A comprehensive species-dereplicated microbial genome resource2025-06-13T15:43:15ZOver the past years, substantial numbers of microbial species' genomes have been deposited outside of conventional INSDC databases. The GlobDB aggregates 14 independent genomic catalogues to provide a comprehensive database of species-dereplicated microbial genomes, with consistent taxonomy, annotations, and additional analysis resources. The GlobDB is available at https://globdb.org/.2025-06-13T15:43:15Z11 pages (including references), 1 table, 0 figuresDaan R. SpethCentre for Microbiology and Environmental Systems Science, University of Vienna, Vienna, AustriaNick PullenCentre for Microbiology and Environmental Systems Science, University of Vienna, Vienna, AustriaSamuel T. N. AroneyCentre for Microbiome Research School of Biomedical Sciences, Queensland University of Technology, Translational Research Institute, Woolloongabba, AustraliaBenjamin L. ColtmanCentre for Microbiology and Environmental Systems Science, University of Vienna, Vienna, AustriaJay T. OsvaticJoint Microbiome Facility of the Medical University of Vienna and the University of Vienna, Vienna, AustriaDepartment of Laboratory Medicine, Medical University of Vienna, Vienna, AustriaBen J. WoodcroftCentre for Microbiome Research School of Biomedical Sciences, Queensland University of Technology, Translational Research Institute, Woolloongabba, AustraliaThomas RatteiCentre for Microbiology and Environmental Systems Science, University of Vienna, Vienna, AustriaMichael WagnerCentre for Microbiology and Environmental Systems Science, University of Vienna, Vienna, AustriaCenter for Microbial Communities, Department of Chemistry and Bioscience, Aalborg University, Aalborg, Denmark