https://arxiv.org/api/aVXnad+kzA2KpSgLf+HyBzYYGYo 2026-06-14T16:29:18Z 3848 405 15 http://arxiv.org/abs/2507.20440v1 BioNeuralNet: A Graph Neural Network based Multi-Omics Network Data Analysis Tool 2025-07-27T23:21:04Z

Multi-omics data offer unprecedented insights into complex biological systems, yet their high dimensionality, sparsity, and intricate interactions pose significant analytical challenges. Network-based approaches have advanced multi-omics research by effectively capturing biologically relevant relationships among molecular entities. While these methods are powerful for representing molecular interactions, there remains a need for tools specifically designed to effectively utilize these network representations across diverse downstream analyses. To fulfill this need, we introduce BioNeuralNet, a flexible and modular Python framework tailored for end-to-end network-based multi-omics data analysis. BioNeuralNet leverages Graph Neural Networks (GNNs) to learn biologically meaningful low-dimensional representations from multi-omics networks, converting these complex molecular networks into versatile embeddings. BioNeuralNet supports all major stages of multi-omics network analysis, including several network construction techniques, generation of low-dimensional representations, and a broad range of downstream analytical tasks. Its extensive utilities, including diverse GNN architectures, and compatibility with established Python packages (e.g., scikit-learn, PyTorch, NetworkX), enhance usability and facilitate quick adoption. BioNeuralNet is an open-source, user-friendly, and extensively documented framework designed to support flexible and reproducible multi-omics network analysis in precision medicine.

2025-07-27T23:21:04Z 6 pages, 1 figure, 2 tables; Software available on PyPI as BioNeuralNet. For documentation, tutorials, and workflows see https://bioneuralnet.readthedocs.io Vicente Ramos Department of Computer Science and Engineering, University of Colorado Denver, Denver, USA Sundous Hussein Department of Computer Science and Engineering, University of Colorado Denver, Denver, USA Mohamed Abdel-Hafiz Department of Computer Science and Engineering, University of Colorado Denver, Denver, USA Arunangshu Sarkar Department of Biostatistics and Informatics, University of Colorado Anschutz Medical Campus, Aurora, USA Weixuan Liu Department of Biostatistics and Informatics, University of Colorado Anschutz Medical Campus, Aurora, USA Katerina J. Kechris Department of Biostatistics and Informatics, University of Colorado Anschutz Medical Campus, Aurora, USA Russell P. Bowler Genomic Medicine Institute, Cleveland Clinic, Cleveland, USA Leslie Lange Division of Biomedical Informatics and Personalized Medicine, University of Colorado Anschutz Medical Campus, Aurora, USA Farnoush Banaei-Kashani Department of Computer Science and Engineering, University of Colorado Denver, Denver, USA http://arxiv.org/abs/2507.11588v2 SToFM: a Multi-scale Foundation Model for Spatial Transcriptomics 2025-07-23T15:22:26Z

Spatial Transcriptomics (ST) technologies provide biologists with rich insights into single-cell biology by preserving spatial context of cells. Building foundational models for ST can significantly enhance the analysis of vast and complex data sources, unlocking new perspectives on the intricacies of biological tissues. However, modeling ST data is inherently challenging due to the need to extract multi-scale information from tissue slices containing vast numbers of cells. This process requires integrating macro-scale tissue morphology, micro-scale cellular microenvironment, and gene-scale gene expression profile. To address this challenge, we propose SToFM, a multi-scale Spatial Transcriptomics Foundation Model. SToFM first performs multi-scale information extraction on each ST slice, to construct a set of ST sub-slices that aggregate macro-, micro- and gene-scale information. Then an SE(2) Transformer is used to obtain high-quality cell representations from the sub-slices. Additionally, we construct \textbf{SToCorpus-88M}, the largest high-resolution spatial transcriptomics corpus for pretraining. SToFM achieves outstanding performance on a variety of downstream tasks, such as tissue region semantic segmentation and cell type annotation, demonstrating its comprehensive understanding of ST data through capturing and integrating multi-scale information.

2025-07-15T14:47:01Z Accpeted by ICML 2025 Suyuan Zhao Yizhen Luo Ganbo Yang Yan Zhong Hao Zhou Zaiqing Nie http://arxiv.org/abs/2507.16978v1 Fast and Scalable Gene Embedding Search: A Comparative Study of FAISS and ScaNN 2025-07-22T19:28:54Z

The exponential growth of DNA sequencing data has outpaced traditional heuristic-based methods, which struggle to scale effectively. Efficient computational approaches are urgently needed to support large-scale similarity search, a foundational task in bioinformatics for detecting homology, functional similarity, and novelty among genomic and proteomic sequences. Although tools like BLAST have been widely used and remain effective in many scenarios, they suffer from limitations such as high computational cost and poor performance on divergent sequences. In this work, we explore embedding-based similarity search methods that learn latent representations capturing deeper structural and functional patterns beyond raw sequence alignment. We systematically evaluate two state-of-the-art vector search libraries, FAISS and ScaNN, on biologically meaningful gene embeddings. Unlike prior studies, our analysis focuses on bioinformatics-specific embeddings and benchmarks their utility for detecting novel sequences, including those from uncharacterized taxa or genes lacking known homologs. Our results highlight both computational advantages (in memory and runtime efficiency) and improved retrieval quality, offering a promising alternative to traditional alignment-heavy tools.

2025-07-22T19:28:54Z GenBio @ ICML 2025 Workshop, OpenReview, June 2025 GenBio @ ICML 2025 Workshop, OpenReview, June 2025 Mohammad Saleh Refahi Gavin Hearne Harrison Muller Kieran Lynch Bahrad A. Sokhansanj James R. Brown Gail Rosen http://arxiv.org/abs/1910.00048v2 Constructing Semi-Directed Level-1 Phylogenetic Networks from Quarnets 2025-07-22T16:05:34Z

Semi-directed networks provide a graphical structure for describing the evolutionary history of organisms in the presence of hybridization. We introduce two algorithms for reconstructing semi-directed level-1 phylogenetic networks from their complete set of 4-leaf subnetworks, known as quarnets. The sequential algorithm begins with a single quarnet and adds one leaf at a time until the entire network has been reconstructed. The cherry-blob algorithm is a novel approach inspired by cherry-picking techniques on trees

2019-09-30T18:33:27Z 21 pages Sophia Huebler Rachel Morris Joseph Rusinko http://arxiv.org/abs/2507.09754v2 Explainable AI in Genomics: Transcription Factor Binding Site Prediction with Mixture of Experts 2025-07-18T13:26:46Z

Transcription Factor Binding Site (TFBS) prediction is crucial for understanding gene regulation and various biological processes. This study introduces a novel Mixture of Experts (MoE) approach for TFBS prediction, integrating multiple pre-trained Convolutional Neural Network (CNN) models, each specializing in different TFBS patterns. We evaluate the performance of our MoE model against individual expert models on both in-distribution and out-of-distribution (OOD) datasets, using six randomly selected transcription factors (TFs) for OOD testing. Our results demonstrate that the MoE model achieves competitive or superior performance across diverse TF binding sites, particularly excelling in OOD scenarios. The Analysis of Variance (ANOVA) statistical test confirms the significance of these performance differences. Additionally, we introduce ShiftSmooth, a novel attribution mapping technique that provides more robust model interpretability by considering small shifts in input sequences. Through comprehensive explainability analysis, we show that ShiftSmooth offers superior attribution for motif discovery and localization compared to traditional Vanilla Gradient methods. Our work presents an efficient, generalizable, and interpretable solution for TFBS prediction, potentially enabling new discoveries in genome biology and advancing our understanding of transcriptional regulation.

2025-07-13T19:21:41Z Aakash Tripathi Ian E. Nielsen Muhammad Umer Ravi P. Ramachandran Ghulam Rasool http://arxiv.org/abs/2507.11950v1 RNAMunin: A Deep Machine Learning Model for Non-coding RNA Discovery 2025-07-16T06:33:50Z

Functional annotation of microbial genomes is often biased toward protein-coding genes, leaving a vast, unexplored landscape of non-coding RNAs (ncRNAs) that are critical for regulating bacterial and archaeal physiology, stress response and metabolism. Identifying ncRNAs directly from genomic sequence is a paramount challenge in bioinformatics and biology, essential for understanding the complete regulatory potential of an organism. This paper presents RNAMunin, a machine learning (ML) model that is capable of finding ncRNAs using genomic sequence alone. It is also computationally viable for large sequence datasets such as long read metagenomic assemblies with contigs totaling multiple Gbp. RNAMunin is trained on Rfam sequences extracted from approximately 60 Gbp of long read metagenomes from 16 San Francisco Estuary samples. We know of no other model that can detect ncRNAs based solely on genomic sequence at this scale. Since RNAMunin only requires genomic sequence as input, we do not need for an ncRNA to be transcribed to find it, i.e., we do not need transcriptomics data. We wrote this manuscript in a narrative style in order to best convey how RNAMunin was developed and how it works in detail. Unlike almost all current ML models, at approximately 1M parameters, RNAMunin is very small and very fast.

2025-07-16T06:33:50Z Lauren Lui Torben Nielsen http://arxiv.org/abs/2507.11313v1 Tree inference with varifold distances 2025-07-15T13:45:20Z

In this paper, we consider a tree inference problem motivated by the critical problem in single-cell genomics of reconstructing dynamic cellular processes from sequencing data. In particular, given a population of cells sampled from such a process, we are interested in the problem of ordering the cells according to their progression in the process. This is known as trajectory inference. If the process is differentiation, this amounts to reconstructing the corresponding differentiation tree. One way of doing this in practice is to estimate the shortest-path distance between nodes based on cell similarities observed in sequencing data. Recent sequencing techniques make it possible to measure two types of data: gene expression levels, and RNA velocity, a vector that predicts changes in gene expression. The data then consist of a discrete vector field on a (subset of a) Euclidean space of dimension equal to the number of genes under consideration. By integrating this velocity field, we trace the evolution of gene expression levels in each single cell from some initial stage to its current stage. Eventually, we assume that we have a faithful embedding of the differentiation tree in a Euclidean space, but which we only observe through the curves representing the paths from the root to the nodes. Using varifold distances between such curves, we define a similarity measure between nodes which we prove approximates the shortest-path distance in a tree that is isomorphic to the target tree.

2025-07-15T13:45:20Z Elodie Maignant Tim Conrad Christoph von Tycowicz http://arxiv.org/abs/2507.10039v1 Towards Applying Large Language Models to Complement Single-Cell Foundation Models 2025-07-14T08:16:58Z

Single-cell foundation models such as scGPT represent a significant advancement in single-cell omics, with an ability to achieve state-of-the-art performance on various downstream biological tasks. However, these models are inherently limited in that a vast amount of information in biology exists as text, which they are unable to leverage. There have therefore been several recent works that propose the use of LLMs as an alternative to single-cell foundation models, achieving competitive results. However, there is little understanding of what factors drive this performance, along with a strong focus on using LLMs as an alternative, rather than complementary approach to single-cell foundation models. In this study, we therefore investigate what biological insights contribute toward the performance of LLMs when applied to single-cell data, and introduce scMPT; a model which leverages synergies between scGPT, and single-cell representations from LLMs that capture these insights. scMPT demonstrates stronger, more consistent performance than either of its component models, which frequently have large performance gaps between each other across datasets. We also experiment with alternate fusion methods, demonstrating the potential of combining specialized reasoning models with scGPT to improve performance. This study ultimately showcases the potential for LLMs to complement single-cell foundation models and drive improvements in single-cell analysis.

2025-07-14T08:16:58Z Steven Palayew Bo Wang Gary Bader http://arxiv.org/abs/2507.09967v1 SimOmics: A Simulation Toolkit for Multivariate and Multi-Omics Data 2025-07-14T06:33:05Z

SimOmics is an R package designed to generate realistic, multivariate, and multi-omics synthetic datasets. It is intended for use in benchmarking, method development, and reproducibility in bioinformatics, particularly in the context of omics integration tasks such as those encountered in transcriptomics, proteomics, and metabolomics. SimOmics supports latent factor simulation, sparsity structures, block-wise covariance modeling, and biologically inspired noise models and feature dimensions.

2025-07-14T06:33:05Z 4 pages, 1 figure Kaitao Lai http://arxiv.org/abs/2507.09890v1 Soft Graph Clustering for single-cell RNA Sequencing Data 2025-07-14T03:49:12Z

Clustering analysis is fundamental in single-cell RNA sequencing (scRNA-seq) data analysis for elucidating cellular heterogeneity and diversity. Recent graph-based scRNA-seq clustering methods, particularly graph neural networks (GNNs), have significantly improved in tackling the challenges of high-dimension, high-sparsity, and frequent dropout events that lead to ambiguous cell population boundaries. However, their reliance on hard graph constructions derived from thresholded similarity matrices presents challenges:(i) The simplification of intercellular relationships into binary edges (0 or 1) by applying thresholds, which restricts the capture of continuous similarity features among cells and leads to significant information loss.(ii) The presence of significant inter-cluster connections within hard graphs, which can confuse GNN methods that rely heavily on graph structures, potentially causing erroneous message propagation and biased clustering outcomes. To tackle these challenges, we introduce scSGC, a Soft Graph Clustering for single-cell RNA sequencing data, which aims to more accurately characterize continuous similarities among cells through non-binary edge weights, thereby mitigating the limitations of rigid data structures. The scSGC framework comprises three core components: (i) a zero-inflated negative binomial (ZINB)-based feature autoencoder; (ii) a dual-channel cut-informed soft graph embedding module; and (iii) an optimal transport-based clustering optimization module. Extensive experiments across ten datasets demonstrate that scSGC outperforms 13 state-of-the-art clustering models in clustering accuracy, cell type annotation, and computational efficiency. These results highlight its substantial potential to advance scRNA-seq data analysis and deepen our understanding of cellular heterogeneity.

2025-07-14T03:49:12Z Ping Xu Pengfei Wang Zhiyuan Ning Meng Xiao Min Wu Yuanchun Zhou http://arxiv.org/abs/2507.08062v1 AMRScan: A hybrid R and Nextflow toolkit for rapid antimicrobial resistance gene detection from sequencing data 2025-07-10T15:58:26Z

AMRScan is a hybrid bioinformatics toolkit implemented in both R and [Nextflow](https://www.nextflow.io/) for the rapid and reproducible detection of antimicrobial resistance (AMR) genes from next-generation sequencing (NGS) data. The toolkit enables users to identify AMR gene hits in sequencing reads by aligning them against reference databases such as CARD using BLAST. The R implementation provides a concise, script-based approach suitable for single-sample analysis, teaching, and rapid prototyping. In contrast, the Nextflow implementation enables reproducible, scalable workflows for multi-sample batch processing in high-performance computing (HPC) and containerized environments. It leverages modular pipeline design with support for automated database setup, quality control, conversion, BLAST alignment, and results parsing. AMRScan helps bridge the gap between lightweight exploratory analysis and production-ready surveillance pipelines, making it suitable for both research and public health genomics applications.

2025-07-10T15:58:26Z 3 pages Kaitao Lai http://arxiv.org/abs/2507.08060v1 MicroTrace: A Lightweight R Tool for SNP-Based Pathogen Clustering in Outbreak Detection 2025-07-10T15:41:53Z

MicroTrace is an open-source R tool that performs SNP-based hierarchical clustering to detect potential transmission clusters from pathogen whole-genome sequencing (WGS) data. Designed for epidemiologists, microbiologists, and genomic surveillance teams, it processes SNP distance matrices and outputs dendrograms and cluster tables with optional metadata integration. MicroTrace enables reproducible outbreak detection workflows with minimal setup.

2025-07-10T15:41:53Z 3 pages Kaitao Lai http://arxiv.org/abs/2507.08058v1 HybridQC: Machine Learning-Augmented Quality Control for Single-Cell RNA-seq Data 2025-07-10T14:48:55Z

HybridQC is an R package that streamlines quality control (QC) of single-cell RNA sequencing (scRNA-seq) data by combining traditional threshold-based filtering with machine learning-based outlier detection. It provides an efficient and adaptive framework to identify low-quality cells in noisy or shallow-depth datasets using techniques such as Isolation Forest, while remaining compatible with widely adopted formats such as Seurat objects. The package is lightweight, easy to install, and suitable for small-to-medium scRNA-seq datasets in research settings. HybridQC is especially useful for projects involving non-model organisms, rare samples, or pilot studies, where automated and flexible QC is critical for reproducibility and downstream analysis.

2025-07-10T14:48:55Z 3 pages, 1 figure Kaitao Lai http://arxiv.org/abs/2507.07761v1 Widespread remote introgression in the grass genomes 2025-07-10T13:37:42Z

Genetic transfers are pervasive across both prokaryotes and eukaryotes, encompassing canonical genomic introgression between species or genera and horizontal gene transfer (HGT) across kingdoms. However, DNA transfer between phylogenetically distant species, here defined as remote introgression (RI), has remained poorly explored in evolutionary genomics. In this study, we present RIFinder, a novel phylogeny-based method for RI event detection, and apply it to a comprehensive dataset of 122 grass genomes. Our analysis identifies 622 RI events originating from 543 distinct homologous genes, revealing distinct characteristics among grass subfamilies. Specifically, the subfamily Pooideae exhibits the highest number of introgressed genes while Bambusoideae contains the lowest. Comparisons among accepted genes, their donor copies and native homologs demonstrate that introgressed genes undergo post-transfer localized adaptation, with significant functional enrichment in stress-response pathways. Notably, we identify a large Triticeae-derived segment in a Chloridoideae species Cleistogenes songorica, which is potentially associated with its exceptional drought tolerance. Furthermore, we provide compelling evidence that RI has contributed to the origin and diversification of biosynthetic gene clusters of gramine, a defensive alkaloid chemical, across grass species. Collectively, our study establishes a robust method for RI detection and highlights its critical role in adaptive evolution.

2025-07-10T13:37:42Z Yujie Huang Shiyu Zhang Hanyang Lin Chenxu Liu Zhefu Li Kun Yang Yutong Liu Linfeng Jin Chuanlong Lu Yuan Cheng Chaoyi Hu Huifang Zhao Guoping Zhang Qian Qian Longjiang Fan Dongya Wu http://arxiv.org/abs/2412.16074v2 Motif Caller: Sequence Reconstruction for Motif-Based DNA Storage 2025-07-10T11:26:25Z

DNA data storage is rapidly emerging as a promising solution for long-term data archiving, largely due to its exceptional durability. However, the synthesis of DNA strands remains a significant bottleneck in terms of cost and speed. To address this, new methods have been developed that encode information by concatenating long data-carrying DNA sequences from pre-synthesized DNA subsequences - known as motifs - from a library. Reading back data from DNA storage relies on basecalling - the process of translating raw nanopore sequencing signals into DNA base sequences using machine learning models. These sequences are then decoded back into binary data. However, current basecalling approaches are not optimized for decoding motif-carrying DNA: they first predict individual bases from the raw signal and only afterward attempt to identify higher-level motifs. This two-step, motif-agnostic process is both imprecise and inefficient. In this paper we introduce Motif Caller, a machine learning model designed to directly detect entire motifs from raw nanopore signals, bypassing the need for intermediate basecalling. By targeting motifs directly, Motif Caller leverages richer signal features associated with each motif, resulting in significantly improved accuracy. This direct approach also enhances the efficiency of data retrieval in motif-based DNA storage systems.

2024-12-20T17:18:22Z Parv Agarwal Nimesh Pinnamaneni Thomas Heinis