https://arxiv.org/api/jMMOJLxW0y7kg6nkpUmVJFpqn542026-06-14T13:36:53Z384836015http://arxiv.org/abs/2509.15664v1siDPT: siRNA Efficacy Prediction via Debiased Preference-Pair Transformer2025-09-19T06:41:45ZSmall interfering RNA (siRNA) is a short double-stranded RNA molecule (about 21-23 nucleotides) with the potential to cure diseases by silencing the function of target genes. Due to its well-understood mechanism, many siRNA-based drugs have been evaluated in clinical trials. However, selecting effective binding regions and designing siRNA sequences requires extensive experimentation, making the process costly. As genomic resources and publicly available siRNA datasets continue to grow, data-driven models can be leveraged to better understand siRNA-mRNA interactions. To fully exploit such data, curating high-quality siRNA datasets is essential to minimize experimental errors and noise. We propose siDPT: siRNA efficacy Prediction via Debiased Preference-Pair Transformer, a framework that constructs a preference-pair dataset and designs an siRNA-mRNA interactive transformer with debiased ranking objectives to improve siRNA inhibition prediction and generalization. We evaluate our approach using two public datasets and one newly collected patent dataset. Our model demonstrates substantial improvement in Pearson correlation and strong performance across other metrics.2025-09-19T06:41:45ZHonggen ZhangXiangrui GaoLipeng Laihttp://arxiv.org/abs/2509.14037v1PhenoGnet: A Graph-Based Contrastive Learning Framework for Disease Similarity Prediction2025-09-17T14:38:52ZUnderstanding disease similarity is critical for advancing diagnostics, drug discovery, and personalized treatment strategies. We present PhenoGnet, a novel graph-based contrastive learning framework designed to predict disease similarity by integrating gene functional interaction networks with the Human Phenotype Ontology (HPO). PhenoGnet comprises two key components: an intra-view model that separately encodes gene and phenotype graphs using Graph Convolutional Networks (GCNs) and Graph Attention Networks (GATs), and a cross view model implemented as a shared weight multilayer perceptron (MLP) that aligns gene and phenotype embeddings through contrastive learning. The model is trained using known gene phenotype associations as positive pairs and randomly sampled unrelated pairs as negatives. Diseases are represented by the mean embeddings of their associated genes and/or phenotypes, and pairwise similarity is computed via cosine similarity. Evaluation on a curated benchmark of 1,100 similar and 866 dissimilar disease pairs demonstrates strong performance, with gene based embeddings achieving an AUCPR of 0.9012 and AUROC of 0.8764, outperforming existing state of the art methods. Notably, PhenoGnet captures latent biological relationships beyond direct overlap, offering a scalable and interpretable solution for disease similarity prediction. These results underscore its potential for enabling downstream applications in rare disease research and precision medicine.2025-09-17T14:38:52ZRanga BaminiwatteKazi Jewel RanaAaron J. Masinohttp://arxiv.org/abs/2509.13300v1AmpliconHunter: A Scalable Tool for PCR Amplicon Prediction from Microbiome Samples2025-09-16T17:55:19ZSequencing of PCR amplicons generated using degenerate primers (typically targeting a region of the 16S ribosomal gene) is widely used in metagenomics to profile the taxonomic composition of complex microbial samples. To reduce taxonomic biases in primer selection it is important to conduct in silico PCR analyses of the primers against large collections of up to millions of bacterial genomes. However, existing in silico PCR tools have impractical running time for analyses of this scale. In this paper we introduce AmpliconHunter, a highly scalable in silico PCR package distributed as open-source command-line tool and publicly available through a user-friendly web interface at https://ah1.engr.uconn.edu/. AmpliconHunter implements an accurate nearest-neighbor model for melting temperature calculations, allowing for primer-template hybridization with mismatches, along with three complementary methods for estimating off-target amplification. By taking advantage of multi-core parallelism and SIMD operations available on modern CPUs, the AmpliconHunter web server can complete in silico PCR analyses of commonly used degenerate primer pairs against the 2.4M genomes in the latest AllTheBacteria collection in as few as 6-7 hours.2025-09-16T17:55:19Z2025 ICCABS conferenceRye Howard-StoneIon Mandiou10.1007/978-3-032-02489-3_25http://arxiv.org/abs/2509.12428v1MHASS: Microbiome HiFi Amplicon Sequencing Simulator2025-09-15T20:28:52ZSummary: Microbiome HiFi Amplicon Sequence Simulator (MHASS) creates realistic synthetic PacBio HiFi amplicon sequencing datasets for microbiome studies, by integrating genome-aware abundance modeling, realistic dual-barcoding strategies, and empirically derived pass-number distributions from actual sequencing runs. MHASS generates datasets tailored for rigorous benchmarking and validation of long-read microbiome analysis workflows, including ASV clustering and taxonomic assignment.
Availability and Implementation: Implemented in Python with automated dependency management, the source code for MHASS is freely available at https://github.com/rhowardstone/MHASS along with installation instructions.
Contact: rye.howard-stone@uconn.edu or ion.mandoiu@uconn.edu
Supplementary information: Supplementary data are available online at https://github.com/rhowardstone/MHASS_evaluation.2025-09-15T20:28:52ZRye Howard-StoneIon Mandoiuhttp://arxiv.org/abs/2509.13344v1Benchmarking Dimensionality Reduction Techniques for Spatial Transcriptomics2025-09-12T17:27:34ZWe introduce a unified framework for evaluating dimensionality reduction techniques in spatial transcriptomics beyond standard PCA approaches. We benchmark six methods PCA, NMF, autoencoder, VAE, and two hybrid embeddings on a cholangiocarcinoma Xenium dataset, systematically varying latent dimensions ($k$=5-40) and clustering resolutions ($ρ$=0.1-1.2). Each configuration is evaluated using complementary metrics including reconstruction error, explained variance, cluster cohesion, and two novel biologically-motivated measures: Cluster Marker Coherence (CMC) and Marker Exclusion Rate (MER). Our results demonstrate distinct performance profiles: PCA provides a fast baseline, NMF maximizes marker enrichment, VAE balances reconstruction and interpretability, while autoencoders occupy a middle ground. We provide systematic hyperparameter selection using Pareto optimal analysis and demonstrate how MER-guided reassignment improves biological fidelity across all methods, with CMC scores improving by up to 12\% on average. This framework enables principled selection of dimensionality reduction methods tailored to specific spatial transcriptomics analyses.2025-09-12T17:27:34ZThis paper is accepted to the 16th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM-BCB 2025), 10 page and have 4 figuresMd Ishtyaq MahmudVeena KochatSuresh SatpatiJagan Mohan Reddy DwarampudiKunal RaiTania Banerjeehttp://arxiv.org/abs/2509.10575v1Gene-R1: Reasoning with Data-Augmented Lightweight LLMs for Gene Set Analysis2025-09-11T17:14:08ZThe gene set analysis (GSA) is a foundational approach for uncovering the molecular functions associated with a group of genes. Recently, LLM-powered methods have emerged to annotate gene sets with biological functions together with coherent explanatory insights. However, existing studies primarily focus on proprietary models, which have been shown to outperform their open-source counterparts despite concerns over cost and data privacy. Furthermore, no research has investigated the application of advanced reasoning strategies to the GSA task. To address this gap, we introduce Gene-R1, a data-augmented learning framework that equips lightweight and open-source LLMs with step-by-step reasoning capabilities tailored to GSA. Experiments on 1,508 in-distribution gene sets demonstrate that Gene-R1 achieves substantial performance gains, matching commercial LLMs. On 106 out-of-distribution gene sets, Gene-R1 performs comparably to both commercial and large-scale LLMs, exhibiting robust generalizability across diverse gene sources.2025-09-11T17:14:08Z14 pages, 4 figures, 6 tables, 40 referencesZhizheng WangYifan YangQiao JinZhiyong Luhttp://arxiv.org/abs/2501.04718v2Knowledge-Guided Biomarker Identification for Label-Free Single-Cell RNA-Seq Data: A Reinforcement Learning Perspective2025-09-11T02:18:40ZGene panel selection aims to identify the most informative genomic biomarkers in label-free genomic datasets. Traditional approaches, which rely on domain expertise, embedded machine learning models, or heuristic-based iterative optimization, often introduce biases and inefficiencies, potentially obscuring critical biological signals. To address these challenges, we present an iterative gene panel selection strategy that harnesses ensemble knowledge from existing gene selection algorithms to establish preliminary boundaries or prior knowledge, which guide the initial search space. Subsequently, we incorporate reinforcement learning through a reward function shaped by expert behavior, enabling dynamic refinement and targeted selection of gene panels. This integration mitigates biases stemming from initial boundaries while capitalizing on RL's stochastic adaptability. Comprehensive comparative experiments, case studies, and downstream analyses demonstrate the effectiveness of our method, highlighting its improved precision and efficiency for label-free biomarker discovery. Our results underscore the potential of this approach to advance single-cell genomics data analysis.2025-01-02T07:57:41Z27 pages, 14 main doc, 13 supplementary doc. Accepted by IEEE TCBB. arXiv admin note: substantial text overlap with arXiv:2406.07418Meng XiaoWeiliang ZhangXiaohan HuangHengshu ZhuMin WuXiaoli LiYuanchun Zhouhttp://arxiv.org/abs/2506.09076v3A Probabilistic Framework for Imputing Genetic Distances in Spatiotemporal Pathogen Models2025-09-09T11:09:17ZPathogen genome data offers valuable structure for spatial models, but its utility is limited by incomplete sequencing coverage. We propose a probabilistic framework for inferring genetic distances between unsequenced cases and known sequences within defined transmission chains, using time-aware evolutionary distance modeling. The method estimates pairwise divergence from collection dates and observed genetic distances, enabling biologically plausible imputation grounded in observed divergence patterns, without requiring sequence alignment or known transmission chains. Applied to highly pathogenic avian influenza A/H5 cases in wild birds in the United States, this approach supports scalable, uncertainty-aware augmentation of genomic datasets and enhances the integration of evolutionary information into spatiotemporal modeling workflows.2025-06-10T02:41:46Z10 pages, 4 figures | Accepted as a full paper in SIGSPATIAL 2025Haley StoneJing DuHao XueMatthew ScotchDavid HeslopAndreas ZüfleChandini Raina MacIntyreFlora Salim10.1145/3748636.3762779http://arxiv.org/abs/2402.12391v3Toward a Team of AI-made Scientists for Scientific Discovery from Gene Expression Data2025-09-08T02:30:47ZMachine learning has emerged as a powerful tool for scientific discovery, enabling researchers to extract meaningful insights from complex datasets. For instance, it has facilitated the identification of disease-predictive genes from gene expression data, significantly advancing healthcare. However, the traditional process for analyzing such datasets demands substantial human effort and expertise for the data selection, processing, and analysis. To address this challenge, we introduce a novel framework, a Team of AI-made Scientists (TAIS), designed to streamline the scientific discovery pipeline. TAIS comprises simulated roles, including a project manager, data engineer, and domain expert, each represented by a Large Language Model (LLM). These roles collaborate to replicate the tasks typically performed by data scientists, with a specific focus on identifying disease-predictive genes. Furthermore, we have curated a benchmark dataset to assess TAIS's effectiveness in gene identification, demonstrating our system's potential to significantly enhance the efficiency and scope of scientific exploration. Our findings represent a solid step towards automating scientific discovery through large language models.2024-02-15T06:30:12ZCode for a more recent version of our system is available at \url{https://github.com/Liu-Hy/GenoMAS}Haoyang LiuYijiang LiJinglin JianYuxuan ChengJianrong LuShuyi GuoJinglei ZhuMianchen ZhangMiantong ZhangHaohan Wanghttp://arxiv.org/abs/2509.06234v1Minimum-Cost Synthetic Genome Planning: An Algorithmic Framework2025-09-07T22:46:57ZAs synthetic genomics scales toward the construction of increasingly larger genomes, computational strategies are needed to address technical feasibility. We introduce an algorithmic framework for the Minimum-Cost Synthetic Genome Planning problem, aiming to identify the most cost-effective strategy to assemble a target genome from a source genome through a combination of reuse, synthesis, and join operations. By comparing dynamic programming and greedy heuristic strategies under diverse cost regimes, we demonstrate how algorithmic choices influence the cost-efficiency of large-scale genome construction. In parallel, solving the Minimum-Cost Synthetic Genome Planning problem can help us better understand genome architecture and evolution. We applied our framework in case studies on viral genomes, including SARS-CoV-2, to examine how source-target genome similarity shapes construction costs. Our analyses revealed that conserved regions such as ORF1ab can be reconstructed cost-effectively from related templates, while highly variable regions such as the S (spike) gene are more reliant on DNA synthesis, highlighting the biological and economic trade-offs of genome design.2025-09-07T22:46:57ZMichail PatsakisIoannis MouratidisIlias Georgakopoulos-Soareshttp://arxiv.org/abs/2509.05539v1Investigating DNA words and their distributions across the tree of life2025-09-05T23:32:30ZThe frequency distributions of DNA k-mers are shaped by fundamental biological processes and offer a window into genome structure and evolution. Inspired by analogies to natural language, prior studies have attempted to model genomic k-mer usage using Zipf's law, a rank-frequency law originally formulated for words in human language. However, the extent to which this law accurately captures the distribution of k-mers across diverse species remains unclear. Here, we systematically analyze k-mer frequency spectra across more than 225,000 genome assemblies spanning all three domains of life and viruses. We demonstrate that Zipf's law consistently underperforms in modeling k-mer distributions. In contrast, we propose the truncated power law and Zipf-Mandelbrot distributions, which provide substantially improved fits across taxonomic groups. We show that genome size and GC content influence model performance, with larger and GC-content imbalanced genomes yielding better adherence. Additionally, we perform an extensive analysis on vocabulary expansion and exhaustion across the same organisms using Heaps' law. We apply our modeling framework to evaluate simulated genomes generated by k-let preserving shuffling and deep generative language models. Our results reveal substantial differences between organismal genomes and their synthetic or shuffled counterparts, offering a novel approach to benchmark the biological plausibility of artificial genomes. Collectively, this work establishes new standards for modeling genomic k-mer distributions and provides insights relevant to synthetic biology, and evolutionary sequence analysis.2025-09-05T23:32:30ZCharalampos KoilakosKimonas ProvatasMichail PatsakisAris KaratzikosIlias Georgakopoulos-Soareshttp://arxiv.org/abs/2411.03871v4Safe Sequences via Dominators in DAGs for Path-Covering Problems2025-09-04T11:41:47ZA path-covering problem on a directed acyclic graph (DAG) requires finding a set of source-to-sink paths that cover all the nodes, all the arcs, or subsets thereof, and additionally they are optimal with respect to some function. In this paper we study safe sequences of nodes or arcs, namely sequences that appear in some path of every path cover of a DAG.
We show that safe sequences admit a simple characterization via cutnodes. Moreover, we establish a connection between maximal safe sequences and leaf-to-root paths in the source- and sink-dominator trees of the DAG, which may be of independent interest in the extensive literature on dominators. With dominator trees, safe sequences admit an O(n)-size representation and a linear-time output-sensitive enumeration algorithm running in time O(m + o), where n and m are the number of nodes and arcs, respectively, and o is the total length of the maximal safe sequences.
We then apply maximal safe sequences to simplify Integer Linear Programs (ILPs) for two path-covering problems, LeastSquares and MinPathError, which are at the core of RNA transcript assembly problems from bioinformatics. On various datasets, maximal safe sequences can be computed in under 0.1 seconds per graph, on average, and ILP solvers whose search space is reduced in this manner exhibit significant speed-ups. For example on graphs with a large width, average speed-ups are in the range 50-250x for MinPathError and in the range 80-350x for LeastSquares. Optimizing ILPs using safe sequences can thus become a fast building block of practical RNA transcript assembly tools, and more generally, of path-covering problems.2024-11-06T12:35:47ZFrancisco SenaRomeo RizziAlexandru I. Tomescuhttp://arxiv.org/abs/2307.04479v2A Linear Time Quantum Algorithm for Pairwise Sequence Alignment2025-09-03T19:12:52ZSequence Alignment is the process of aligning biological sequences in order to identify similarities between multiple sequences. In this paper, a Quantum Algorithm for finding the optimal alignment between DNA sequences has been demonstrated which works by mapping the sequence alignment problem into a path-searching problem through a 2D graph. The transition, which converges to a fixed path on the graph, is based on a proposed oracle for profit calculation. By implementing Grover's search algorithm, our proposed approach is able to align a pair of sequences and figure out the optimal alignment within linear time, which hasn't been attained by any classical deterministic algorithm. In addition to that, the proposed algorithm is capable of quadratic speeding up to any unstructured search problem by finding out the optimal paths accurately in a deterministic manner, in contrast to existing randomized algorithms that frequently sort out the sub-optimal alignments, therefore, don't always guarantee of finding out the optimal solutions.2023-07-10T11:01:41ZThis paper was part of an undergraduate thesis project. Upon further evaluation, we acknowledge that the claimed runtime is not sufficiently proven. To maintain scholarly integrity, we are withdrawing this versionMd. Rabiul Islam KhanShadman ShahriarShaikh Farhan Rafidhttp://arxiv.org/abs/2509.02648v1Optimizing Prognostic Biomarker Discovery in Pancreatic Cancer Through Hybrid Ensemble Feature Selection and Multi-Omics Data2025-09-02T11:09:24ZPrediction of patient survival using high-dimensional multi-omics data requires systematic feature selection methods that ensure predictive performance, sparsity, and reliability for prognostic biomarker discovery. We developed a hybrid ensemble feature selection (hEFS) approach that combines data subsampling with multiple prognostic models, integrating both embedded and wrapper-based strategies for survival prediction. Omics features are ranked using a voting-theory-inspired aggregation mechanism across models and subsamples, while the optimal number of features is selected via a Pareto front, balancing predictive accuracy and model sparsity without any user-defined thresholds. When applied to multi-omics datasets from three pancreatic cancer cohorts, hEFS identifies significantly fewer and more stable biomarkers compared to the conventional, late-fusion CoxLasso models, while maintaining comparable discrimination performance. Implemented within the open-source mlr3fselect R package, hEFS offers a robust, interpretable, and clinically valuable tool for prognostic modelling and biomarker discovery in high-dimensional survival settings.2025-09-02T11:09:24Z52 pages, 5 figures, 9 Supplementary Figures, 1 Supplementary TableBioData Mining (2026)John ZobolasAnne-Marie GeorgeAlberto LópezSebastian FischerMarc BeckerTero Aittokallio10.1186/s13040-026-00546-0http://arxiv.org/abs/2509.02639v1Enhanced Single-Cell RNA-seq Embedding through Gene Expression and Data-Driven Gene-Gene Interaction Integration2025-09-01T21:19:27ZSingle-cell RNA sequencing (scRNA-seq) provides unprecedented insights into cellular heterogeneity, enabling detailed analysis of complex biological systems at single-cell resolution. However, the high dimensionality and technical noise inherent in scRNA-seq data pose significant analytical challenges. While current embedding methods focus primarily on gene expression levels, they often overlook crucial gene-gene interactions that govern cellular identity and function. To address this limitation, we present a novel embedding approach that integrates both gene expression profiles and data-driven gene-gene interactions. Our method first constructs a Cell-Leaf Graph (CLG) using random forest models to capture regulatory relationships between genes, while simultaneously building a K-Nearest Neighbor Graph (KNNG) to represent expression similarities between cells. These graphs are then combined into an Enriched Cell-Leaf Graph (ECLG), which serves as input for a graph neural network to compute cell embeddings. By incorporating both expression levels and gene-gene interactions, our approach provides a more comprehensive representation of cellular states. Extensive evaluation across multiple datasets demonstrates that our method enhances the detection of rare cell populations and improves downstream analyses such as visualization, clustering, and trajectory inference. This integrated approach represents a significant advance in single-cell data analysis, offering a more complete framework for understanding cellular diversity and dynamics.2025-09-01T21:19:27Z33 pages, 9 figures, articleComputers in Biology and Medicine 188 (2025) 109880Hojjat Torabi GoudarziMaziyar Baran Pouyan10.1016/j.compbiomed.2025.109880