https://arxiv.org/api/aVXnad+kzA2KpSgLf+HyBzYYGYo2026-06-14T16:29:18Z384840515http://arxiv.org/abs/2507.20440v1BioNeuralNet: A Graph Neural Network based Multi-Omics Network Data Analysis Tool2025-07-27T23:21:04ZMulti-omics data offer unprecedented insights into complex biological systems, yet their high dimensionality, sparsity, and intricate interactions pose significant analytical challenges. Network-based approaches have advanced multi-omics research by effectively capturing biologically relevant relationships among molecular entities. While these methods are powerful for representing molecular interactions, there remains a need for tools specifically designed to effectively utilize these network representations across diverse downstream analyses. To fulfill this need, we introduce BioNeuralNet, a flexible and modular Python framework tailored for end-to-end network-based multi-omics data analysis. BioNeuralNet leverages Graph Neural Networks (GNNs) to learn biologically meaningful low-dimensional representations from multi-omics networks, converting these complex molecular networks into versatile embeddings. BioNeuralNet supports all major stages of multi-omics network analysis, including several network construction techniques, generation of low-dimensional representations, and a broad range of downstream analytical tasks. Its extensive utilities, including diverse GNN architectures, and compatibility with established Python packages (e.g., scikit-learn, PyTorch, NetworkX), enhance usability and facilitate quick adoption. BioNeuralNet is an open-source, user-friendly, and extensively documented framework designed to support flexible and reproducible multi-omics network analysis in precision medicine.2025-07-27T23:21:04Z6 pages, 1 figure, 2 tables; Software available on PyPI as BioNeuralNet. For documentation, tutorials, and workflows see https://bioneuralnet.readthedocs.ioVicente RamosDepartment of Computer Science and Engineering, University of Colorado Denver, Denver, USASundous HusseinDepartment of Computer Science and Engineering, University of Colorado Denver, Denver, USAMohamed Abdel-HafizDepartment of Computer Science and Engineering, University of Colorado Denver, Denver, USAArunangshu SarkarDepartment of Biostatistics and Informatics, University of Colorado Anschutz Medical Campus, Aurora, USAWeixuan LiuDepartment of Biostatistics and Informatics, University of Colorado Anschutz Medical Campus, Aurora, USAKaterina J. KechrisDepartment of Biostatistics and Informatics, University of Colorado Anschutz Medical Campus, Aurora, USARussell P. BowlerGenomic Medicine Institute, Cleveland Clinic, Cleveland, USALeslie LangeDivision of Biomedical Informatics and Personalized Medicine, University of Colorado Anschutz Medical Campus, Aurora, USAFarnoush Banaei-KashaniDepartment of Computer Science and Engineering, University of Colorado Denver, Denver, USAhttp://arxiv.org/abs/2507.11588v2SToFM: a Multi-scale Foundation Model for Spatial Transcriptomics2025-07-23T15:22:26ZSpatial Transcriptomics (ST) technologies provide biologists with rich insights into single-cell biology by preserving spatial context of cells. Building foundational models for ST can significantly enhance the analysis of vast and complex data sources, unlocking new perspectives on the intricacies of biological tissues. However, modeling ST data is inherently challenging due to the need to extract multi-scale information from tissue slices containing vast numbers of cells. This process requires integrating macro-scale tissue morphology, micro-scale cellular microenvironment, and gene-scale gene expression profile. To address this challenge, we propose SToFM, a multi-scale Spatial Transcriptomics Foundation Model. SToFM first performs multi-scale information extraction on each ST slice, to construct a set of ST sub-slices that aggregate macro-, micro- and gene-scale information. Then an SE(2) Transformer is used to obtain high-quality cell representations from the sub-slices. Additionally, we construct \textbf{SToCorpus-88M}, the largest high-resolution spatial transcriptomics corpus for pretraining. SToFM achieves outstanding performance on a variety of downstream tasks, such as tissue region semantic segmentation and cell type annotation, demonstrating its comprehensive understanding of ST data through capturing and integrating multi-scale information.2025-07-15T14:47:01ZAccpeted by ICML 2025Suyuan ZhaoYizhen LuoGanbo YangYan ZhongHao ZhouZaiqing Niehttp://arxiv.org/abs/2507.16978v1Fast and Scalable Gene Embedding Search: A Comparative Study of FAISS and ScaNN2025-07-22T19:28:54ZThe exponential growth of DNA sequencing data has outpaced traditional heuristic-based methods, which struggle to scale effectively. Efficient computational approaches are urgently needed to support large-scale similarity search, a foundational task in bioinformatics for detecting homology, functional similarity, and novelty among genomic and proteomic sequences. Although tools like BLAST have been widely used and remain effective in many scenarios, they suffer from limitations such as high computational cost and poor performance on divergent sequences.
In this work, we explore embedding-based similarity search methods that learn latent representations capturing deeper structural and functional patterns beyond raw sequence alignment. We systematically evaluate two state-of-the-art vector search libraries, FAISS and ScaNN, on biologically meaningful gene embeddings. Unlike prior studies, our analysis focuses on bioinformatics-specific embeddings and benchmarks their utility for detecting novel sequences, including those from uncharacterized taxa or genes lacking known homologs. Our results highlight both computational advantages (in memory and runtime efficiency) and improved retrieval quality, offering a promising alternative to traditional alignment-heavy tools.2025-07-22T19:28:54ZGenBio @ ICML 2025 Workshop, OpenReview, June 2025 GenBio @ ICML 2025 Workshop, OpenReview, June 2025Mohammad Saleh RefahiGavin HearneHarrison MullerKieran LynchBahrad A. SokhansanjJames R. BrownGail Rosenhttp://arxiv.org/abs/1910.00048v2Constructing Semi-Directed Level-1 Phylogenetic Networks from Quarnets2025-07-22T16:05:34ZSemi-directed networks provide a graphical structure for describing the evolutionary history of organisms in the presence of hybridization. We introduce two algorithms for reconstructing semi-directed level-1 phylogenetic networks from their complete set of 4-leaf subnetworks, known as quarnets. The sequential algorithm begins with a single quarnet and adds one leaf at a time until the entire network has been reconstructed. The cherry-blob algorithm is a novel approach inspired by cherry-picking techniques on trees2019-09-30T18:33:27Z21 pagesSophia HueblerRachel MorrisJoseph Rusinkohttp://arxiv.org/abs/2507.09754v2Explainable AI in Genomics: Transcription Factor Binding Site Prediction with Mixture of Experts2025-07-18T13:26:46ZTranscription Factor Binding Site (TFBS) prediction is crucial for understanding gene regulation and various biological processes. This study introduces a novel Mixture of Experts (MoE) approach for TFBS prediction, integrating multiple pre-trained Convolutional Neural Network (CNN) models, each specializing in different TFBS patterns. We evaluate the performance of our MoE model against individual expert models on both in-distribution and out-of-distribution (OOD) datasets, using six randomly selected transcription factors (TFs) for OOD testing. Our results demonstrate that the MoE model achieves competitive or superior performance across diverse TF binding sites, particularly excelling in OOD scenarios. The Analysis of Variance (ANOVA) statistical test confirms the significance of these performance differences. Additionally, we introduce ShiftSmooth, a novel attribution mapping technique that provides more robust model interpretability by considering small shifts in input sequences. Through comprehensive explainability analysis, we show that ShiftSmooth offers superior attribution for motif discovery and localization compared to traditional Vanilla Gradient methods. Our work presents an efficient, generalizable, and interpretable solution for TFBS prediction, potentially enabling new discoveries in genome biology and advancing our understanding of transcriptional regulation.2025-07-13T19:21:41ZAakash TripathiIan E. NielsenMuhammad UmerRavi P. RamachandranGhulam Rasoolhttp://arxiv.org/abs/2507.11950v1RNAMunin: A Deep Machine Learning Model for Non-coding RNA Discovery2025-07-16T06:33:50ZFunctional annotation of microbial genomes is often biased toward protein-coding genes, leaving a vast, unexplored landscape of non-coding RNAs (ncRNAs) that are critical for regulating bacterial and archaeal physiology, stress response and metabolism. Identifying ncRNAs directly from genomic sequence is a paramount challenge in bioinformatics and biology, essential for understanding the complete regulatory potential of an organism. This paper presents RNAMunin, a machine learning (ML) model that is capable of finding ncRNAs using genomic sequence alone. It is also computationally viable for large sequence datasets such as long read metagenomic assemblies with contigs totaling multiple Gbp. RNAMunin is trained on Rfam sequences extracted from approximately 60 Gbp of long read metagenomes from 16 San Francisco Estuary samples. We know of no other model that can detect ncRNAs based solely on genomic sequence at this scale. Since RNAMunin only requires genomic sequence as input, we do not need for an ncRNA to be transcribed to find it, i.e., we do not need transcriptomics data. We wrote this manuscript in a narrative style in order to best convey how RNAMunin was developed and how it works in detail. Unlike almost all current ML models, at approximately 1M parameters, RNAMunin is very small and very fast.2025-07-16T06:33:50ZLauren LuiTorben Nielsenhttp://arxiv.org/abs/2507.11313v1Tree inference with varifold distances2025-07-15T13:45:20ZIn this paper, we consider a tree inference problem motivated by the critical problem in single-cell genomics of reconstructing dynamic cellular processes from sequencing data. In particular, given a population of cells sampled from such a process, we are interested in the problem of ordering the cells according to their progression in the process. This is known as trajectory inference. If the process is differentiation, this amounts to reconstructing the corresponding differentiation tree. One way of doing this in practice is to estimate the shortest-path distance between nodes based on cell similarities observed in sequencing data. Recent sequencing techniques make it possible to measure two types of data: gene expression levels, and RNA velocity, a vector that predicts changes in gene expression. The data then consist of a discrete vector field on a (subset of a) Euclidean space of dimension equal to the number of genes under consideration. By integrating this velocity field, we trace the evolution of gene expression levels in each single cell from some initial stage to its current stage. Eventually, we assume that we have a faithful embedding of the differentiation tree in a Euclidean space, but which we only observe through the curves representing the paths from the root to the nodes. Using varifold distances between such curves, we define a similarity measure between nodes which we prove approximates the shortest-path distance in a tree that is isomorphic to the target tree.2025-07-15T13:45:20ZElodie MaignantTim ConradChristoph von Tycowiczhttp://arxiv.org/abs/2507.10039v1Towards Applying Large Language Models to Complement Single-Cell Foundation Models2025-07-14T08:16:58ZSingle-cell foundation models such as scGPT represent a significant advancement in single-cell omics, with an ability to achieve state-of-the-art performance on various downstream biological tasks. However, these models are inherently limited in that a vast amount of information in biology exists as text, which they are unable to leverage. There have therefore been several recent works that propose the use of LLMs as an alternative to single-cell foundation models, achieving competitive results. However, there is little understanding of what factors drive this performance, along with a strong focus on using LLMs as an alternative, rather than complementary approach to single-cell foundation models. In this study, we therefore investigate what biological insights contribute toward the performance of LLMs when applied to single-cell data, and introduce scMPT; a model which leverages synergies between scGPT, and single-cell representations from LLMs that capture these insights. scMPT demonstrates stronger, more consistent performance than either of its component models, which frequently have large performance gaps between each other across datasets. We also experiment with alternate fusion methods, demonstrating the potential of combining specialized reasoning models with scGPT to improve performance. This study ultimately showcases the potential for LLMs to complement single-cell foundation models and drive improvements in single-cell analysis.2025-07-14T08:16:58ZSteven PalayewBo WangGary Baderhttp://arxiv.org/abs/2507.09967v1SimOmics: A Simulation Toolkit for Multivariate and Multi-Omics Data2025-07-14T06:33:05ZSimOmics is an R package designed to generate realistic, multivariate, and multi-omics synthetic datasets. It is intended for use in benchmarking, method development, and reproducibility in bioinformatics, particularly in the context of omics integration tasks such as those encountered in transcriptomics, proteomics, and metabolomics. SimOmics supports latent factor simulation, sparsity structures, block-wise covariance modeling, and biologically inspired noise models and feature dimensions.2025-07-14T06:33:05Z4 pages, 1 figureKaitao Laihttp://arxiv.org/abs/2507.09890v1Soft Graph Clustering for single-cell RNA Sequencing Data2025-07-14T03:49:12ZClustering analysis is fundamental in single-cell RNA sequencing (scRNA-seq) data analysis for elucidating cellular heterogeneity and diversity. Recent graph-based scRNA-seq clustering methods, particularly graph neural networks (GNNs), have significantly improved in tackling the challenges of high-dimension, high-sparsity, and frequent dropout events that lead to ambiguous cell population boundaries. However, their reliance on hard graph constructions derived from thresholded similarity matrices presents challenges:(i) The simplification of intercellular relationships into binary edges (0 or 1) by applying thresholds, which restricts the capture of continuous similarity features among cells and leads to significant information loss.(ii) The presence of significant inter-cluster connections within hard graphs, which can confuse GNN methods that rely heavily on graph structures, potentially causing erroneous message propagation and biased clustering outcomes. To tackle these challenges, we introduce scSGC, a Soft Graph Clustering for single-cell RNA sequencing data, which aims to more accurately characterize continuous similarities among cells through non-binary edge weights, thereby mitigating the limitations of rigid data structures. The scSGC framework comprises three core components: (i) a zero-inflated negative binomial (ZINB)-based feature autoencoder; (ii) a dual-channel cut-informed soft graph embedding module; and (iii) an optimal transport-based clustering optimization module. Extensive experiments across ten datasets demonstrate that scSGC outperforms 13 state-of-the-art clustering models in clustering accuracy, cell type annotation, and computational efficiency. These results highlight its substantial potential to advance scRNA-seq data analysis and deepen our understanding of cellular heterogeneity.2025-07-14T03:49:12ZPing XuPengfei WangZhiyuan NingMeng XiaoMin WuYuanchun Zhouhttp://arxiv.org/abs/2507.08062v1AMRScan: A hybrid R and Nextflow toolkit for rapid antimicrobial resistance gene detection from sequencing data2025-07-10T15:58:26ZAMRScan is a hybrid bioinformatics toolkit implemented in both R and [Nextflow](https://www.nextflow.io/) for the rapid and reproducible detection of antimicrobial resistance (AMR) genes from next-generation sequencing (NGS) data. The toolkit enables users to identify AMR gene hits in sequencing reads by aligning them against reference databases such as CARD using BLAST.
The R implementation provides a concise, script-based approach suitable for single-sample analysis, teaching, and rapid prototyping. In contrast, the Nextflow implementation enables reproducible, scalable workflows for multi-sample batch processing in high-performance computing (HPC) and containerized environments. It leverages modular pipeline design with support for automated database setup, quality control, conversion, BLAST alignment, and results parsing.
AMRScan helps bridge the gap between lightweight exploratory analysis and production-ready surveillance pipelines, making it suitable for both research and public health genomics applications.2025-07-10T15:58:26Z3 pagesKaitao Laihttp://arxiv.org/abs/2507.08060v1MicroTrace: A Lightweight R Tool for SNP-Based Pathogen Clustering in Outbreak Detection2025-07-10T15:41:53ZMicroTrace is an open-source R tool that performs SNP-based hierarchical clustering to detect potential transmission clusters from pathogen whole-genome sequencing (WGS) data. Designed for epidemiologists, microbiologists, and genomic surveillance teams, it processes SNP distance matrices and outputs dendrograms and cluster tables with optional metadata integration. MicroTrace enables reproducible outbreak detection workflows with minimal setup.2025-07-10T15:41:53Z3 pagesKaitao Laihttp://arxiv.org/abs/2507.08058v1HybridQC: Machine Learning-Augmented Quality Control for Single-Cell RNA-seq Data2025-07-10T14:48:55ZHybridQC is an R package that streamlines quality control (QC) of single-cell RNA sequencing (scRNA-seq) data by combining traditional threshold-based filtering with machine learning-based outlier detection. It provides an efficient and adaptive framework to identify low-quality cells in noisy or shallow-depth datasets using techniques such as Isolation Forest, while remaining compatible with widely adopted formats such as Seurat objects.
The package is lightweight, easy to install, and suitable for small-to-medium scRNA-seq datasets in research settings. HybridQC is especially useful for projects involving non-model organisms, rare samples, or pilot studies, where automated and flexible QC is critical for reproducibility and downstream analysis.2025-07-10T14:48:55Z3 pages, 1 figureKaitao Laihttp://arxiv.org/abs/2507.07761v1Widespread remote introgression in the grass genomes2025-07-10T13:37:42ZGenetic transfers are pervasive across both prokaryotes and eukaryotes, encompassing canonical genomic introgression between species or genera and horizontal gene transfer (HGT) across kingdoms. However, DNA transfer between phylogenetically distant species, here defined as remote introgression (RI), has remained poorly explored in evolutionary genomics. In this study, we present RIFinder, a novel phylogeny-based method for RI event detection, and apply it to a comprehensive dataset of 122 grass genomes. Our analysis identifies 622 RI events originating from 543 distinct homologous genes, revealing distinct characteristics among grass subfamilies. Specifically, the subfamily Pooideae exhibits the highest number of introgressed genes while Bambusoideae contains the lowest. Comparisons among accepted genes, their donor copies and native homologs demonstrate that introgressed genes undergo post-transfer localized adaptation, with significant functional enrichment in stress-response pathways. Notably, we identify a large Triticeae-derived segment in a Chloridoideae species Cleistogenes songorica, which is potentially associated with its exceptional drought tolerance. Furthermore, we provide compelling evidence that RI has contributed to the origin and diversification of biosynthetic gene clusters of gramine, a defensive alkaloid chemical, across grass species. Collectively, our study establishes a robust method for RI detection and highlights its critical role in adaptive evolution.2025-07-10T13:37:42ZYujie HuangShiyu ZhangHanyang LinChenxu LiuZhefu LiKun YangYutong LiuLinfeng JinChuanlong LuYuan ChengChaoyi HuHuifang ZhaoGuoping ZhangQian QianLongjiang FanDongya Wuhttp://arxiv.org/abs/2412.16074v2Motif Caller: Sequence Reconstruction for Motif-Based DNA Storage2025-07-10T11:26:25ZDNA data storage is rapidly emerging as a promising solution for long-term data archiving, largely due to its exceptional durability. However, the synthesis of DNA strands remains a significant bottleneck in terms of cost and speed. To address this, new methods have been developed that encode information by concatenating long data-carrying DNA sequences from pre-synthesized DNA subsequences - known as motifs - from a library.
Reading back data from DNA storage relies on basecalling - the process of translating raw nanopore sequencing signals into DNA base sequences using machine learning models. These sequences are then decoded back into binary data. However, current basecalling approaches are not optimized for decoding motif-carrying DNA: they first predict individual bases from the raw signal and only afterward attempt to identify higher-level motifs. This two-step, motif-agnostic process is both imprecise and inefficient.
In this paper we introduce Motif Caller, a machine learning model designed to directly detect entire motifs from raw nanopore signals, bypassing the need for intermediate basecalling. By targeting motifs directly, Motif Caller leverages richer signal features associated with each motif, resulting in significantly improved accuracy. This direct approach also enhances the efficiency of data retrieval in motif-based DNA storage systems.2024-12-20T17:18:22ZParv AgarwalNimesh PinnamaneniThomas Heinis