https://arxiv.org/api/QSQL6RruOjYNQWJNJFm7BecRzRw2026-06-14T14:27:00Z384837515http://arxiv.org/abs/2509.00897v1Network Community Detection and Novelty Scoring Reveal Underexplored Hub Genes in Rheumatoid Arthritis2025-08-31T15:22:19ZUnderstanding the modular structure and central elements of complex biological networks is critical for uncovering system-level mechanisms in disease. Here, we constructed weighted gene co-expression networks from bulk RNA-seq data of rheumatoid arthritis (RA) synovial tissue, using pairwise correlation and a percolation-guided thresholding strategy. Community detection with Louvain and Leiden algorithms revealed robust modules, and node-strength ranking identified the top 50 hub genes globally and within communities. To assess novelty, we integrated genome-wide association studies (GWAS) with literature-based evidence from PubMed, highlighting five high-centrality genes with little to no prior RA-specific association. Functional enrichment confirmed their roles in immune-related processes, including adaptive immune response and lymphocyte regulation. Notably, these hubs showed strong positive correlations with T- and B-cell markers and negative correlations with NK-cell markers, consistent with RA immunopathology. Overall, our framework demonstrates how correlation-based network construction, modularity-driven clustering, and centrality-guided novelty scoring can jointly reveal informative structure in omics-scale data. This generalizable approach offers a scalable path to gene prioritization in RA and other autoimmune conditions.2025-08-31T15:22:19Z12 pages, 5 figures, 2 tables. Submitted to COMPLEX NETWORKS 2025Neda AmiriradHiroki Sayamahttp://arxiv.org/abs/2509.00349v1Computational approaches for virus host prediction: A review of methods and applications2025-08-30T04:19:23ZAccurate prediction of virus-host interactions is critical for understanding viral ecology and developing applications like phage therapy. However, the growing number of computational tools has created a complex landscape, making direct performance comparison challenging due to inconsistent benchmarks and varying usability. Here, we provide a systematic review and a rigorous benchmark of 27 virus-host prediction tools. We formulate the host prediction task into two primary frameworks, link prediction and multi-class classification, and construct two benchmark datasets to evaluate tool performance in distinct scenarios: a database-centric dataset (RefSeq-VHDB) and a metagenomic discovery dataset (MetaHiC-VHDB). Our results reveal that no single tool is universally optimal. Performance is highly context-dependent, with tools like CHERRY and iPHoP demonstrating robust, broad applicability, while others, such as RaFAH and PHIST, excel in specific contexts. We further identify a critical trade-off between predictive accuracy, prediction rate, and computational cost. This work serves as a practical guide for researchers and establishes a standardized benchmark to drive future innovation in deciphering complex virus-host interactions.2025-08-30T04:19:23ZJiayu ShangCheng PengJiaojiao GuanDehan CaiDonglin WangYanni Sunhttp://arxiv.org/abs/2508.01490v2A Large-Scale Benchmark of Cross-Modal Learning for Histology and Gene Expression in Spatial Transcriptomics2025-08-27T08:13:02ZSpatial transcriptomics enables simultaneous measurement of gene expression and tissue morphology, offering unprecedented insights into cellular organization and disease mechanisms. However, the field lacks comprehensive benchmarks for evaluating multimodal learning methods that leverage both histology images and gene expression data. Here, we present HESCAPE, a large-scale benchmark for cross-modal contrastive pretraining in spatial transcriptomics, built on a curated pan-organ dataset spanning 6 different gene panels and 54 donors. We systematically evaluated state-of-the-art image and gene expression encoders across multiple pretraining strategies and assessed their effectiveness on two downstream tasks: gene mutation classification and gene expression prediction. Our benchmark demonstrates that gene expression encoders are the primary determinant of strong representational alignment, and that gene models pretrained on spatial transcriptomics data outperform both those trained without spatial data and simple baseline approaches. However, downstream task evaluation reveals a striking contradiction: while contrastive pretraining consistently improves gene mutation classification performance, it degrades direct gene expression prediction compared to baseline encoders trained without cross-modal objectives. We identify batch effects as a key factor that interferes with effective cross-modal alignment. Our findings highlight the critical need for batch-robust multimodal learning approaches in spatial transcriptomics. To accelerate progress in this direction, we release HESCAPE, providing standardized datasets, evaluation protocols, and benchmarking tools for the community2025-08-02T21:11:36ZThe code is accessible at: https://github.com/peng-lab/hescapeRushin H. GindraGiovanni PallaMathias NguyenSophia J. WagnerManuel TranFabian J TheisDieter SaurLorin CrawfordTingying Penghttp://arxiv.org/abs/2506.10271v3Evaluating DNA function understanding in genomic language models using evolutionarily implausible sequences2025-08-26T04:00:29ZGenomic language models (gLMs) hold promise for generating novel, functional DNA sequences for synthetic biology. However, realizing this potential requires models to go beyond evolutionary plausibility and understand how DNA sequence encodes gene expression and regulation. We introduce a benchmark called Nullsettes, which assesses how well models can predict in silico loss-of-function (LOF) mutations, in synthetic expression cassettes with little evolutionary precedent. Testing 12 state-of-the-art gLMs, we find that most fail to consistently detect these strong LOF mutations. All models show a sharp drop in predictive accuracy as the likelihood assigned to the original (nonmutant) sequence decreases, suggesting that gLMs rely heavily on pattern-matching to their evolutionary prior rather than on any mechanistic understanding of gene expression. Our findings highlight fundamental limitations in how gLMs generalize to engineered, non-natural sequences, and underscore the need for benchmarks and modeling strategies that prioritize functional understanding.2025-06-12T01:28:04Z19 pages, 5 figuresShiyu JiangXuyin LiuZitong Jerry Wanghttp://arxiv.org/abs/2508.17345v1ShortListing Model: A Streamlined SimplexDiffusion for Discrete Variable Generation2025-08-24T13:03:02ZGenerative modeling of discrete variables is challenging yet crucial for applications in natural language processing and biological sequence design. We introduce the Shortlisting Model (SLM), a novel simplex-based diffusion model inspired by progressive candidate pruning. SLM operates on simplex centroids, reducing generation complexity and enhancing scalability. Additionally, SLM incorporates a flexible implementation of classifier-free guidance, enhancing unconditional generation performance. Extensive experiments on DNA promoter and enhancer design, protein design, character-level and large-vocabulary language modeling demonstrate the competitive performance and strong potential of SLM. Our code can be found at https://github.com/GenSI-THUAIR/SLM2025-08-24T13:03:02ZYuxuan SongZhe ZhangYu PeiJingjing GongQiying YuZheng ZhangMingxuan WangHao ZhouJingjing LiuWei-Ying Mahttp://arxiv.org/abs/2504.00020v2Celler:A Genomic Language Model for Long-Tailed Single-Cell Annotation2025-08-24T10:48:05ZRecent breakthroughs in single-cell technology have ushered in unparalleled opportunities to decode the molecular intricacy of intricate biological systems, especially those linked to diseases unique to humans. However, these progressions have also ushered in novel obstacles-specifically, the efficient annotation of extensive, long-tailed single-cell data pertaining to disease conditions. To effectively surmount this challenge, we introduce Celler, a state-of-the-art generative pre-training model crafted specifically for the annotation of single-cell data. Celler incorporates two groundbreaking elements: First, we introduced the Gaussian Inflation (GInf) Loss function. By dynamically adjusting sample weights, GInf Loss significantly enhances the model's ability to learn from rare categories while reducing the risk of overfitting for common categories. Secondly, we introduce an innovative Hard Data Mining (HDM) strategy into the training process, specifically targeting the challenging-to-learn minority data samples, which significantly improved the model's predictive accuracy. Additionally, to further advance research in this field, we have constructed a large-scale single-cell dataset: Celler-75, which encompasses 40 million cells distributed across 80 human tissues and 75 specific diseases. This dataset provides critical support for comprehensively exploring the potential of single-cell technology in disease research. Our code is available at https://github.com/AI4science-ym/HiCeller.2025-03-28T02:04:26ZHuan ZhaoYiming LiuJina YaoLing XiongZexin ZhouZixing Zhanghttp://arxiv.org/abs/2508.18304v1scI2CL: Effectively Integrating Single-cell Multi-omics by Intra- and Inter-omics Contrastive Learning2025-08-23T01:42:28ZSingle-cell multi-omics data contain huge information of cellular states, and analyzing these data can reveal valuable insights into cellular heterogeneity, diseases, and biological processes. However, as cell differentiation \& development is a continuous and dynamic process, it remains challenging to computationally model and infer cell interaction patterns based on single-cell multi-omics data. This paper presents scI2CL, a new single-cell multi-omics fusion framework based on intra- and inter-omics contrastive learning, to learn comprehensive and discriminative cellular representations from complementary multi-omics data for various downstream tasks. Extensive experiments of four downstream tasks validate the effectiveness of scI2CL and its superiority over existing peers. Concretely, in cell clustering, scI2CL surpasses eight state-of-the-art methods on four widely-used real-world datasets. In cell subtyping, scI2CL effectively distinguishes three latent monocyte cell subpopulations, which are not discovered by existing methods. Simultaneously, scI2CL is the only method that correctly constructs the cell developmental trajectory from hematopoietic stem and progenitor cells to Memory B cells. In addition, scI2CL resolves the misclassification of cell types between two subpopulations of CD4+ T cells, while existing methods fail to precisely distinguish the mixed cells. In summary, scI2CL can accurately characterize cross-omics relationships among cells, thus effectively fuses multi-omics data and learns discriminative cellular representations to support various downstream analysis tasks.2025-08-23T01:42:28Z22 pages, 6figuresWuchao LiuHan PengWengen LiYichao ZhangJihong GuanShuigeng Zhouhttp://arxiv.org/abs/2508.14934v1AGP: A Novel Arabidopsis thaliana Genomics-Phenomics Dataset and its HyperGraph Baseline Benchmarking2025-08-19T21:21:23ZUnderstanding which genes control which traits in an organism remains one of the central challenges in biology. Despite significant advances in data collection technology, our ability to map genes to traits is still limited. This genome-to-phenome (G2P) challenge spans several problem domains, including plant breeding, and requires models capable of reasoning over high-dimensional, heterogeneous, and biologically structured data. Currently, however, many datasets solely capture genetic information or solely capture phenotype information. Additionally, phenotype data is very heterogeneous, which many datasets do not fully capture. The critical drawback is that these datasets are not integrated, that is, they do not link with each other to describe the same biological specimens. This limits machine learning models' ability to be informed on the various aspects of these specimens, impacting the breadth of correlations learned, and therefore their ability to make more accurate predictions. To address this gap, we present the Arabidopsis Genomics-Phenomics (AGP) Dataset, a curated multi-modal dataset linking gene expression profiles with phenotypic trait measurements in Arabidopsis thaliana, a model organism in plant biology. AGP supports tasks such as phenotype prediction and interpretable graph learning. In addition, we benchmark conventional regression and explanatory baselines, including a biologically-informed hypergraph baseline, to validate gene-trait associations. To the best of our knowledge, this is the first dataset that provides multi-modal gene information and heterogeneous trait or phenotype data for the same Arabidopsis thaliana specimens. With AGP, we aim to foster the research community towards accurately understanding the connection between genotypes and phenotypes using gene information, higher-order gene pairings, and trait data from several sources.2025-08-19T21:21:23ZManuel Serna-AguileraFiona L. GogginAranyak GoswamiAlexander BuckschSuxing LiuKhoa Luuhttp://arxiv.org/abs/2508.14924v1A U-Statistic-based random forest approach for genetic interaction study2025-08-19T06:22:20ZVariations in complex traits are influenced by multiple genetic variants, environmental risk factors, and their interactions. Though substantial progress has been made in identifying single genetic variants associated with complex traits, detecting the gene-gene and gene-environment interactions remains a great challenge. When a large number of genetic variants and environmental risk factors are involved, searching for interactions is limited to pair-wise interactions due to the exponentially increased feature space and computational intensity. Alternatively, recursive partitioning approaches, such as random forests, have gained popularity in high-dimensional genetic association studies. In this article, we propose a U-Statistic-based random forest approach, referred to as Forest U-Test, for genetic association studies with quantitative traits. Through simulation studies, we showed that the Forest U-Test outperformed existing methods. The proposed method was also applied to study Cannabis Dependence CD, using three independent datasets from the Study of Addiction: Genetics and Environment. A significant joint association was detected with an empirical p-value less than 0.001. The finding was also replicated in two independent datasets with p-values of 5.93e-19 and 4.70e-17, respectively.2025-08-19T06:22:20ZMing LiRuo-Sin PengChangshuai WeiQing Lu10.2741/e576http://arxiv.org/abs/2508.13498v1Improving the FAIRness and Sustainability of the NHGRI Resources Ecosystem2025-08-19T04:17:24ZIn 2024, NHGRI-funded genomic resource projects completed a Self-Assessment Tool (SAT) and interviews to evaluate their application of FAIR (Findable, Accessible, Interoperable, Reusable) principles and sustainability. Key challenges were identified in metadata tools, data curation, variant identifiers, and data processing. Addressing these needs, we engaged the community through webinars and discussions, leading to a two-day workshop in March 2025. The workshop developed targeted recommendations, including improving transparency, standardizing identifiers, enhancing usability, implementing APIs, leveraging AI/ML for curation, and evaluating impact. These outcomes provide a framework for advancing FAIR practices, fostering collaboration, and strengthening the sustainability of NHGRI resources.2025-08-19T04:17:24Z35 pages, 3 appendicesLarry BabbCarol BultVincent J. CareyRobert J. CarrollBenjamin C. HitzChris J. MungallHeidi L. RehmMichael C. SchatzAlex WagnerNHGRI Resource Workshop Communityhttp://arxiv.org/abs/2508.12731v1Mechanism of Quercetin in Inhibiting Triple-Negative Breast Cancer by Regulating T Cell-Related Targets: An Analysis Based on Single-Cell Sequencing and Network Pharmacology2025-08-18T08:54:05ZObjective: To investigate the mechanism by which quercetin inhibits triple-negative breast cancer (TNBC) through regulating T-cell-related targets, providing a novel strategy for TNBC immunotherapy.Methods: Single-cell RNA sequencing (GSE161529 dataset) and network pharmacology were integrated. PCA and UMAP clustering identified T-cell subsets and differentially expressed genes in TNBC microenvironment. TNBC-related targets were screened via CTD and OMIM databases, with functional pathways analyzed by GO/KEGG enrichment. Molecular docking and PPI networks validated interactions between quercetin and core targets.Results: Quercetin intersected with 79 TNBC targets, including AKT1, EGFR, and MMP9, enriched in EGFR inhibitor resistance and endocrine resistance pathways. Molecular docking revealed the highest affinity between quercetin and GSK3B (-13.2 kJ/mol). AKT1 and MMP9 expression correlated with patient survival.Conclusion: Quercetin may reverse TNBC immunosuppression by multi-target modulation of T-cell function, but clinical application requires solutions for its low bioavailability, such as delivery systems or combination therapies.2025-08-18T08:54:05ZRuiqi ChenLiang HangFengyun Wanghttp://arxiv.org/abs/2408.08867v3Quantum Annealing for Enhanced Feature Selection in Single-Cell RNA Sequencing Data Analysis2025-08-15T16:17:42ZFeature selection is a machine learning technique for identifying relevant variables in classification and regression models. In single-cell RNA sequencing (scRNA-seq) data analysis, feature selection is used to identify relevant genes that are crucial for understanding cellular processes. Traditional feature selection methods often struggle with the complexity of scRNA-seq data and suffer from interpretation difficulties. Quantum annealing presents a promising alternative approach. In this study, we implement quantum annealing-empowered quadratic unconstrained binary optimization (QUBO) for feature selection in scRNA-seq data. Using data from a human cell differentiation system and an anticancer drug resistance study, we demonstrate that QUBO feature selection effectively identifies genes whose expression patterns reflect critical cell state transitions associated with differentiation and drug resistance development. Our findings indicate that quantum annealing-powered QUBO reveals complex gene expression patterns potentially missed by traditional methods, thereby enhancing scRNA-seq data analysis and interpretation.2024-08-16T17:51:07ZSelim RomeroShreyan GuptaVictoria GatlinRobert S. ChapkinJames J. Cai10.1007/s42484-025-00312-1http://arxiv.org/abs/2508.13191v1NucEL: Single-Nucleotide ELECTRA-Style Genomic Pre-training for Efficient and Interpretable Representations2025-08-15T12:34:51ZPre-training large language models on genomic sequences is a powerful approach for learning biologically meaningful representations. Masked language modeling (MLM) methods, such as DNABERT and Nucleotide Transformer (NT), achieve strong performance but suffer from partial token supervision, pre-training/fine-tuning mismatches, and high computational costs. We introduce NucEL, the first ELECTRA-style pre-training framework for genomic foundation models, addressing these limitations. Using a discriminator to identify tokens altered by a generator, NucEL provides comprehensive token-level supervision across all sequence positions, improving efficiency over the partial supervision of MLM. Incorporating ModernBERT's hybrid local-global attention and flash attention, NucEL offers an optimized BERT architecture for genomic modeling. Unlike 6-mer tokenization, NucEL uses single-nucleotide tokens for fine-grained resolution, boosting both efficiency and interpretability. Pre-trained on the human genome, NucEL achieves state-of-the-art results on diverse downstream tasks -- regulatory element identification (e.g., promoters, enhancers), transcription factor binding prediction, open chromatin classification, and histone modification profiling -- surpassing similarly sized MLM-based models and rivaling models 25x larger, such as NT. Ablation studies highlight optimal tokenization and masking strategies for ELECTRA-style DNA pre-training. Attention analysis reveals NucEL's superior capture of biologically relevant motifs compared to NT, providing insights into hierarchical learning and regulatory element modeling. These findings demonstrate ELECTRA-style pre-training as an efficient, effective strategy for genomic representation learning with broad implications for genomic research.2025-08-15T12:34:51ZKe DingBrian ParkerJiayu Wenhttp://arxiv.org/abs/2508.11190v1Quantum-Boosted High-Fidelity Deep Learning2025-08-15T03:51:20ZA fundamental limitation of probabilistic deep learning is its predominant reliance on Gaussian priors. This simplistic assumption prevents models from accurately capturing the complex, non-Gaussian landscapes of natural data, particularly in demanding domains like complex biological data, severely hindering the fidelity of the model for scientific discovery. The physically-grounded Boltzmann distribution offers a more expressive alternative, but it is computationally intractable on classical computers. To date, quantum approaches have been hampered by the insufficient qubit scale and operational stability required for the iterative demands of deep learning. Here, we bridge this gap by introducing the Quantum Boltzmann Machine-Variational Autoencoder (QBM-VAE), a large-scale and long-time stable hybrid quantum-classical architecture. Our framework leverages a quantum processor for efficient sampling from the Boltzmann distribution, enabling its use as a powerful prior within a deep generative model. Applied to million-scale single-cell datasets from multiple sources, the QBM-VAE generates a latent space that better preserves complex biological structures, consistently outperforming conventional Gaussian-based deep learning models like VAE and SCVI in essential tasks such as omics data integration, cell-type classification, and trajectory inference. It also provides a typical example of introducing a physics priori into deep learning to drive the model to acquire scientific discovery capabilities that breaks through data limitations. This work provides the demonstration of a practical quantum advantage in deep learning on a large-scale scientific problem and offers a transferable blueprint for developing hybrid quantum AI models.2025-08-15T03:51:20ZFeng-ao WangShaobo ChenYao XuanJunwei LiuQi GaoHongdong ZhuJunjie HouLixin YuanJinyu ChengChenxin YiHai WeiYin MaTao XuKai WenYixue Lihttp://arxiv.org/abs/2508.08331v2miRKatAI: An Integrated Database and Multi-agent AI system for microRNA Research2025-08-13T09:11:34ZMicroRNAs (miRs) are robust regulators of gene expression, implicated in most biological processes. microRNAs predominantly downregulate the expression of genes post-transcriptionally and each miR is predicted to target several hundred genes. The accurate identification and annotation of miR-mRNA target interactions is central to understanding miRs function and their therapeutic potential. However, computational target prediction is challenging due to imperfect complementarity of miRs with their targets and the growing volume and heterogeneity of experimental data present challenges in accessing, integrating, and analysing miR-target interaction information across biological contexts. This creates a need for integrated resources and intelligent query tools.
We present the miRKat Suite, comprising miRKatDB, a comprehensive, curated database of predicted and validated miR-target interactions and associated annotations, and miRKatAI, a multi-agent system powered by large language models (LLMs) and LangGraph. miRKatDB integrates data from multiple publicly available sources, providing a comprehensive foundation for miR studies, including miR target genes and changes in levels of tissue expression previously reported. miRKatAI offers a natural language interface for complex querying of miRKatDB, facilitates grounded information retrieval from established sources in the field, and supports basic data visualisation. The miRKat Suite aims to accelerate miR research by streamlining data access, enhancing exploratory analysis, and supporting hypothesis generation.2025-08-10T11:24:40Z10 pages, 1 figure, app noteKaren Guerrero-VazquezJacopo Umberto VergaPilib O BroinEva SzegezdiKatarzyna Goljanek-Whysall