https://arxiv.org/api/jMMOJLxW0y7kg6nkpUmVJFpqn54 2026-06-14T13:36:53Z 3848 360 15 http://arxiv.org/abs/2509.15664v1 siDPT: siRNA Efficacy Prediction via Debiased Preference-Pair Transformer 2025-09-19T06:41:45Z

Small interfering RNA (siRNA) is a short double-stranded RNA molecule (about 21-23 nucleotides) with the potential to cure diseases by silencing the function of target genes. Due to its well-understood mechanism, many siRNA-based drugs have been evaluated in clinical trials. However, selecting effective binding regions and designing siRNA sequences requires extensive experimentation, making the process costly. As genomic resources and publicly available siRNA datasets continue to grow, data-driven models can be leveraged to better understand siRNA-mRNA interactions. To fully exploit such data, curating high-quality siRNA datasets is essential to minimize experimental errors and noise. We propose siDPT: siRNA efficacy Prediction via Debiased Preference-Pair Transformer, a framework that constructs a preference-pair dataset and designs an siRNA-mRNA interactive transformer with debiased ranking objectives to improve siRNA inhibition prediction and generalization. We evaluate our approach using two public datasets and one newly collected patent dataset. Our model demonstrates substantial improvement in Pearson correlation and strong performance across other metrics.

2025-09-19T06:41:45Z Honggen Zhang Xiangrui Gao Lipeng Lai http://arxiv.org/abs/2509.14037v1 PhenoGnet: A Graph-Based Contrastive Learning Framework for Disease Similarity Prediction 2025-09-17T14:38:52Z

Understanding disease similarity is critical for advancing diagnostics, drug discovery, and personalized treatment strategies. We present PhenoGnet, a novel graph-based contrastive learning framework designed to predict disease similarity by integrating gene functional interaction networks with the Human Phenotype Ontology (HPO). PhenoGnet comprises two key components: an intra-view model that separately encodes gene and phenotype graphs using Graph Convolutional Networks (GCNs) and Graph Attention Networks (GATs), and a cross view model implemented as a shared weight multilayer perceptron (MLP) that aligns gene and phenotype embeddings through contrastive learning. The model is trained using known gene phenotype associations as positive pairs and randomly sampled unrelated pairs as negatives. Diseases are represented by the mean embeddings of their associated genes and/or phenotypes, and pairwise similarity is computed via cosine similarity. Evaluation on a curated benchmark of 1,100 similar and 866 dissimilar disease pairs demonstrates strong performance, with gene based embeddings achieving an AUCPR of 0.9012 and AUROC of 0.8764, outperforming existing state of the art methods. Notably, PhenoGnet captures latent biological relationships beyond direct overlap, offering a scalable and interpretable solution for disease similarity prediction. These results underscore its potential for enabling downstream applications in rare disease research and precision medicine.

2025-09-17T14:38:52Z Ranga Baminiwatte Kazi Jewel Rana Aaron J. Masino http://arxiv.org/abs/2509.13300v1 AmpliconHunter: A Scalable Tool for PCR Amplicon Prediction from Microbiome Samples 2025-09-16T17:55:19Z

Sequencing of PCR amplicons generated using degenerate primers (typically targeting a region of the 16S ribosomal gene) is widely used in metagenomics to profile the taxonomic composition of complex microbial samples. To reduce taxonomic biases in primer selection it is important to conduct in silico PCR analyses of the primers against large collections of up to millions of bacterial genomes. However, existing in silico PCR tools have impractical running time for analyses of this scale. In this paper we introduce AmpliconHunter, a highly scalable in silico PCR package distributed as open-source command-line tool and publicly available through a user-friendly web interface at https://ah1.engr.uconn.edu/. AmpliconHunter implements an accurate nearest-neighbor model for melting temperature calculations, allowing for primer-template hybridization with mismatches, along with three complementary methods for estimating off-target amplification. By taking advantage of multi-core parallelism and SIMD operations available on modern CPUs, the AmpliconHunter web server can complete in silico PCR analyses of commonly used degenerate primer pairs against the 2.4M genomes in the latest AllTheBacteria collection in as few as 6-7 hours.

2025-09-16T17:55:19Z 2025 ICCABS conference Rye Howard-Stone Ion Mandiou 10.1007/978-3-032-02489-3_25 http://arxiv.org/abs/2509.12428v1 MHASS: Microbiome HiFi Amplicon Sequencing Simulator 2025-09-15T20:28:52Z

Summary: Microbiome HiFi Amplicon Sequence Simulator (MHASS) creates realistic synthetic PacBio HiFi amplicon sequencing datasets for microbiome studies, by integrating genome-aware abundance modeling, realistic dual-barcoding strategies, and empirically derived pass-number distributions from actual sequencing runs. MHASS generates datasets tailored for rigorous benchmarking and validation of long-read microbiome analysis workflows, including ASV clustering and taxonomic assignment. Availability and Implementation: Implemented in Python with automated dependency management, the source code for MHASS is freely available at https://github.com/rhowardstone/MHASS along with installation instructions. Contact: rye.howard-stone@uconn.edu or ion.mandoiu@uconn.edu Supplementary information: Supplementary data are available online at https://github.com/rhowardstone/MHASS_evaluation.

2025-09-15T20:28:52Z Rye Howard-Stone Ion Mandoiu http://arxiv.org/abs/2509.13344v1 Benchmarking Dimensionality Reduction Techniques for Spatial Transcriptomics 2025-09-12T17:27:34Z

We introduce a unified framework for evaluating dimensionality reduction techniques in spatial transcriptomics beyond standard PCA approaches. We benchmark six methods PCA, NMF, autoencoder, VAE, and two hybrid embeddings on a cholangiocarcinoma Xenium dataset, systematically varying latent dimensions ($k$=5-40) and clustering resolutions ($ρ$=0.1-1.2). Each configuration is evaluated using complementary metrics including reconstruction error, explained variance, cluster cohesion, and two novel biologically-motivated measures: Cluster Marker Coherence (CMC) and Marker Exclusion Rate (MER). Our results demonstrate distinct performance profiles: PCA provides a fast baseline, NMF maximizes marker enrichment, VAE balances reconstruction and interpretability, while autoencoders occupy a middle ground. We provide systematic hyperparameter selection using Pareto optimal analysis and demonstrate how MER-guided reassignment improves biological fidelity across all methods, with CMC scores improving by up to 12\% on average. This framework enables principled selection of dimensionality reduction methods tailored to specific spatial transcriptomics analyses.

2025-09-12T17:27:34Z This paper is accepted to the 16th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM-BCB 2025), 10 page and have 4 figures Md Ishtyaq Mahmud Veena Kochat Suresh Satpati Jagan Mohan Reddy Dwarampudi Kunal Rai Tania Banerjee http://arxiv.org/abs/2509.10575v1 Gene-R1: Reasoning with Data-Augmented Lightweight LLMs for Gene Set Analysis 2025-09-11T17:14:08Z

The gene set analysis (GSA) is a foundational approach for uncovering the molecular functions associated with a group of genes. Recently, LLM-powered methods have emerged to annotate gene sets with biological functions together with coherent explanatory insights. However, existing studies primarily focus on proprietary models, which have been shown to outperform their open-source counterparts despite concerns over cost and data privacy. Furthermore, no research has investigated the application of advanced reasoning strategies to the GSA task. To address this gap, we introduce Gene-R1, a data-augmented learning framework that equips lightweight and open-source LLMs with step-by-step reasoning capabilities tailored to GSA. Experiments on 1,508 in-distribution gene sets demonstrate that Gene-R1 achieves substantial performance gains, matching commercial LLMs. On 106 out-of-distribution gene sets, Gene-R1 performs comparably to both commercial and large-scale LLMs, exhibiting robust generalizability across diverse gene sources.

2025-09-11T17:14:08Z 14 pages, 4 figures, 6 tables, 40 references Zhizheng Wang Yifan Yang Qiao Jin Zhiyong Lu http://arxiv.org/abs/2501.04718v2 Knowledge-Guided Biomarker Identification for Label-Free Single-Cell RNA-Seq Data: A Reinforcement Learning Perspective 2025-09-11T02:18:40Z

Gene panel selection aims to identify the most informative genomic biomarkers in label-free genomic datasets. Traditional approaches, which rely on domain expertise, embedded machine learning models, or heuristic-based iterative optimization, often introduce biases and inefficiencies, potentially obscuring critical biological signals. To address these challenges, we present an iterative gene panel selection strategy that harnesses ensemble knowledge from existing gene selection algorithms to establish preliminary boundaries or prior knowledge, which guide the initial search space. Subsequently, we incorporate reinforcement learning through a reward function shaped by expert behavior, enabling dynamic refinement and targeted selection of gene panels. This integration mitigates biases stemming from initial boundaries while capitalizing on RL's stochastic adaptability. Comprehensive comparative experiments, case studies, and downstream analyses demonstrate the effectiveness of our method, highlighting its improved precision and efficiency for label-free biomarker discovery. Our results underscore the potential of this approach to advance single-cell genomics data analysis.

2025-01-02T07:57:41Z 27 pages, 14 main doc, 13 supplementary doc. Accepted by IEEE TCBB. arXiv admin note: substantial text overlap with arXiv:2406.07418 Meng Xiao Weiliang Zhang Xiaohan Huang Hengshu Zhu Min Wu Xiaoli Li Yuanchun Zhou http://arxiv.org/abs/2506.09076v3 A Probabilistic Framework for Imputing Genetic Distances in Spatiotemporal Pathogen Models 2025-09-09T11:09:17Z

Pathogen genome data offers valuable structure for spatial models, but its utility is limited by incomplete sequencing coverage. We propose a probabilistic framework for inferring genetic distances between unsequenced cases and known sequences within defined transmission chains, using time-aware evolutionary distance modeling. The method estimates pairwise divergence from collection dates and observed genetic distances, enabling biologically plausible imputation grounded in observed divergence patterns, without requiring sequence alignment or known transmission chains. Applied to highly pathogenic avian influenza A/H5 cases in wild birds in the United States, this approach supports scalable, uncertainty-aware augmentation of genomic datasets and enhances the integration of evolutionary information into spatiotemporal modeling workflows.

2025-06-10T02:41:46Z 10 pages, 4 figures | Accepted as a full paper in SIGSPATIAL 2025 Haley Stone Jing Du Hao Xue Matthew Scotch David Heslop Andreas Züfle Chandini Raina MacIntyre Flora Salim 10.1145/3748636.3762779 http://arxiv.org/abs/2402.12391v3 Toward a Team of AI-made Scientists for Scientific Discovery from Gene Expression Data 2025-09-08T02:30:47Z

Machine learning has emerged as a powerful tool for scientific discovery, enabling researchers to extract meaningful insights from complex datasets. For instance, it has facilitated the identification of disease-predictive genes from gene expression data, significantly advancing healthcare. However, the traditional process for analyzing such datasets demands substantial human effort and expertise for the data selection, processing, and analysis. To address this challenge, we introduce a novel framework, a Team of AI-made Scientists (TAIS), designed to streamline the scientific discovery pipeline. TAIS comprises simulated roles, including a project manager, data engineer, and domain expert, each represented by a Large Language Model (LLM). These roles collaborate to replicate the tasks typically performed by data scientists, with a specific focus on identifying disease-predictive genes. Furthermore, we have curated a benchmark dataset to assess TAIS's effectiveness in gene identification, demonstrating our system's potential to significantly enhance the efficiency and scope of scientific exploration. Our findings represent a solid step towards automating scientific discovery through large language models.

2024-02-15T06:30:12Z Code for a more recent version of our system is available at \url{https://github.com/Liu-Hy/GenoMAS} Haoyang Liu Yijiang Li Jinglin Jian Yuxuan Cheng Jianrong Lu Shuyi Guo Jinglei Zhu Mianchen Zhang Miantong Zhang Haohan Wang http://arxiv.org/abs/2509.06234v1 Minimum-Cost Synthetic Genome Planning: An Algorithmic Framework 2025-09-07T22:46:57Z

As synthetic genomics scales toward the construction of increasingly larger genomes, computational strategies are needed to address technical feasibility. We introduce an algorithmic framework for the Minimum-Cost Synthetic Genome Planning problem, aiming to identify the most cost-effective strategy to assemble a target genome from a source genome through a combination of reuse, synthesis, and join operations. By comparing dynamic programming and greedy heuristic strategies under diverse cost regimes, we demonstrate how algorithmic choices influence the cost-efficiency of large-scale genome construction. In parallel, solving the Minimum-Cost Synthetic Genome Planning problem can help us better understand genome architecture and evolution. We applied our framework in case studies on viral genomes, including SARS-CoV-2, to examine how source-target genome similarity shapes construction costs. Our analyses revealed that conserved regions such as ORF1ab can be reconstructed cost-effectively from related templates, while highly variable regions such as the S (spike) gene are more reliant on DNA synthesis, highlighting the biological and economic trade-offs of genome design.

2025-09-07T22:46:57Z Michail Patsakis Ioannis Mouratidis Ilias Georgakopoulos-Soares http://arxiv.org/abs/2509.05539v1 Investigating DNA words and their distributions across the tree of life 2025-09-05T23:32:30Z

The frequency distributions of DNA k-mers are shaped by fundamental biological processes and offer a window into genome structure and evolution. Inspired by analogies to natural language, prior studies have attempted to model genomic k-mer usage using Zipf's law, a rank-frequency law originally formulated for words in human language. However, the extent to which this law accurately captures the distribution of k-mers across diverse species remains unclear. Here, we systematically analyze k-mer frequency spectra across more than 225,000 genome assemblies spanning all three domains of life and viruses. We demonstrate that Zipf's law consistently underperforms in modeling k-mer distributions. In contrast, we propose the truncated power law and Zipf-Mandelbrot distributions, which provide substantially improved fits across taxonomic groups. We show that genome size and GC content influence model performance, with larger and GC-content imbalanced genomes yielding better adherence. Additionally, we perform an extensive analysis on vocabulary expansion and exhaustion across the same organisms using Heaps' law. We apply our modeling framework to evaluate simulated genomes generated by k-let preserving shuffling and deep generative language models. Our results reveal substantial differences between organismal genomes and their synthetic or shuffled counterparts, offering a novel approach to benchmark the biological plausibility of artificial genomes. Collectively, this work establishes new standards for modeling genomic k-mer distributions and provides insights relevant to synthetic biology, and evolutionary sequence analysis.

2025-09-05T23:32:30Z Charalampos Koilakos Kimonas Provatas Michail Patsakis Aris Karatzikos Ilias Georgakopoulos-Soares http://arxiv.org/abs/2411.03871v4 Safe Sequences via Dominators in DAGs for Path-Covering Problems 2025-09-04T11:41:47Z

A path-covering problem on a directed acyclic graph (DAG) requires finding a set of source-to-sink paths that cover all the nodes, all the arcs, or subsets thereof, and additionally they are optimal with respect to some function. In this paper we study safe sequences of nodes or arcs, namely sequences that appear in some path of every path cover of a DAG. We show that safe sequences admit a simple characterization via cutnodes. Moreover, we establish a connection between maximal safe sequences and leaf-to-root paths in the source- and sink-dominator trees of the DAG, which may be of independent interest in the extensive literature on dominators. With dominator trees, safe sequences admit an O(n)-size representation and a linear-time output-sensitive enumeration algorithm running in time O(m + o), where n and m are the number of nodes and arcs, respectively, and o is the total length of the maximal safe sequences. We then apply maximal safe sequences to simplify Integer Linear Programs (ILPs) for two path-covering problems, LeastSquares and MinPathError, which are at the core of RNA transcript assembly problems from bioinformatics. On various datasets, maximal safe sequences can be computed in under 0.1 seconds per graph, on average, and ILP solvers whose search space is reduced in this manner exhibit significant speed-ups. For example on graphs with a large width, average speed-ups are in the range 50-250x for MinPathError and in the range 80-350x for LeastSquares. Optimizing ILPs using safe sequences can thus become a fast building block of practical RNA transcript assembly tools, and more generally, of path-covering problems.

2024-11-06T12:35:47Z Francisco Sena Romeo Rizzi Alexandru I. Tomescu http://arxiv.org/abs/2307.04479v2 A Linear Time Quantum Algorithm for Pairwise Sequence Alignment 2025-09-03T19:12:52Z

Sequence Alignment is the process of aligning biological sequences in order to identify similarities between multiple sequences. In this paper, a Quantum Algorithm for finding the optimal alignment between DNA sequences has been demonstrated which works by mapping the sequence alignment problem into a path-searching problem through a 2D graph. The transition, which converges to a fixed path on the graph, is based on a proposed oracle for profit calculation. By implementing Grover's search algorithm, our proposed approach is able to align a pair of sequences and figure out the optimal alignment within linear time, which hasn't been attained by any classical deterministic algorithm. In addition to that, the proposed algorithm is capable of quadratic speeding up to any unstructured search problem by finding out the optimal paths accurately in a deterministic manner, in contrast to existing randomized algorithms that frequently sort out the sub-optimal alignments, therefore, don't always guarantee of finding out the optimal solutions.

2023-07-10T11:01:41Z This paper was part of an undergraduate thesis project. Upon further evaluation, we acknowledge that the claimed runtime is not sufficiently proven. To maintain scholarly integrity, we are withdrawing this version Md. Rabiul Islam Khan Shadman Shahriar Shaikh Farhan Rafid http://arxiv.org/abs/2509.02648v1 Optimizing Prognostic Biomarker Discovery in Pancreatic Cancer Through Hybrid Ensemble Feature Selection and Multi-Omics Data 2025-09-02T11:09:24Z

Prediction of patient survival using high-dimensional multi-omics data requires systematic feature selection methods that ensure predictive performance, sparsity, and reliability for prognostic biomarker discovery. We developed a hybrid ensemble feature selection (hEFS) approach that combines data subsampling with multiple prognostic models, integrating both embedded and wrapper-based strategies for survival prediction. Omics features are ranked using a voting-theory-inspired aggregation mechanism across models and subsamples, while the optimal number of features is selected via a Pareto front, balancing predictive accuracy and model sparsity without any user-defined thresholds. When applied to multi-omics datasets from three pancreatic cancer cohorts, hEFS identifies significantly fewer and more stable biomarkers compared to the conventional, late-fusion CoxLasso models, while maintaining comparable discrimination performance. Implemented within the open-source mlr3fselect R package, hEFS offers a robust, interpretable, and clinically valuable tool for prognostic modelling and biomarker discovery in high-dimensional survival settings.

2025-09-02T11:09:24Z 52 pages, 5 figures, 9 Supplementary Figures, 1 Supplementary Table BioData Mining (2026) John Zobolas Anne-Marie George Alberto López Sebastian Fischer Marc Becker Tero Aittokallio 10.1186/s13040-026-00546-0 http://arxiv.org/abs/2509.02639v1 Enhanced Single-Cell RNA-seq Embedding through Gene Expression and Data-Driven Gene-Gene Interaction Integration 2025-09-01T21:19:27Z

Single-cell RNA sequencing (scRNA-seq) provides unprecedented insights into cellular heterogeneity, enabling detailed analysis of complex biological systems at single-cell resolution. However, the high dimensionality and technical noise inherent in scRNA-seq data pose significant analytical challenges. While current embedding methods focus primarily on gene expression levels, they often overlook crucial gene-gene interactions that govern cellular identity and function. To address this limitation, we present a novel embedding approach that integrates both gene expression profiles and data-driven gene-gene interactions. Our method first constructs a Cell-Leaf Graph (CLG) using random forest models to capture regulatory relationships between genes, while simultaneously building a K-Nearest Neighbor Graph (KNNG) to represent expression similarities between cells. These graphs are then combined into an Enriched Cell-Leaf Graph (ECLG), which serves as input for a graph neural network to compute cell embeddings. By incorporating both expression levels and gene-gene interactions, our approach provides a more comprehensive representation of cellular states. Extensive evaluation across multiple datasets demonstrates that our method enhances the detection of rare cell populations and improves downstream analyses such as visualization, clustering, and trajectory inference. This integrated approach represents a significant advance in single-cell data analysis, offering a more complete framework for understanding cellular diversity and dynamics.

2025-09-01T21:19:27Z 33 pages, 9 figures, article Computers in Biology and Medicine 188 (2025) 109880 Hojjat Torabi Goudarzi Maziyar Baran Pouyan 10.1016/j.compbiomed.2025.109880