https://arxiv.org/api/KyP4SeiChbMOQLb7tWXcjkWN+wM 2026-06-14T21:49:19Z 3848 480 15 http://arxiv.org/abs/2505.11385v1 Compendium Manager: a tool for coordination of workflow management instances for bulk data processing in Python 2025-05-16T15:49:40Z

Compendium Manager is a command-line tool written in Python to automate the provisioning, launch, and evaluation of bioinformatics pipelines. Although workflow management tools such as Snakemake and Nextflow enable users to automate the processing of samples within a single sequencing project, integrating many datasets in bulk requires launching and monitoring hundreds or thousands of pipelines. We present the Compendium Manager, a lightweight command-line tool to enable launching and monitoring analysis pipelines at scale. The tool can gauge progress through a list of projects, load results into a shared database, and record detailed processing metrics for later evaluation and reproducibility.

2025-05-16T15:49:40Z Richard J. Abdill Ran Blekhman http://arxiv.org/abs/2505.11041v1 In silico tool for identification of colorectal cancer from cell-free DNA biomarkers 2025-05-16T09:35:11Z

Colorectal cancer remains a major global health concern, with early detection being pivotal for improving patient outcomes. In this study, we leveraged high throughput methylation profiling of cellfree DNA to identify and validate diagnostic biomarkers for CRC. The GSE124600 study data were downloaded from the Gene Expression Omnibus, as the discovery cohort, comprising 142 CRC and 132 normal cfDNA methylation profiles obtained via MCTA seq. After preprocessing and filtering, 97,863 CpG sites were retained for further analysis. Differential methylation analysis using statistical tests identified 30,791 CpG sites as significantly altered in CRC samples, where p is less than 0.05. Univariate scoring enabled the selection of top ranking features, which were further refined using multiple feature selection algorithms, including Recursive Feature Elimination, Sequential Feature Selection, and SVC L1. Various machine learning models such as Logistic Regression, Support Vector Machines, Random Forest, and Multi layer Perceptron were trained and tested using independent validation datasets. The best performance was achieved with an MLP model trained on 25 features selected by RFE, reaching an AUROC of 0.89 and MCC of 0.78 on validation data. Additionally, a deep learning based convolutional neural network achieved an AUROC of 0.78. Functional annotation of the most predictive CpG sites identified several genes involved in key cellular processes, some of which were validated for differential expression in CRC using the GEPIA2 platform. Our study highlights the potential of cfDNA methylation markers combined with ML and DL models for noninvasive and accurate CRC detection, paving the way for clinically relevant diagnostic tools.

2025-05-16T09:35:11Z Kartavya Mathur Department of Computational Biology, Indraprastha Institute of Information Technology, Delhi School of Biotechnology, Gautam Buddha University, Uttar Pradesh Shipra Jain Department of Computational Biology, Indraprastha Institute of Information Technology, Delhi Nisha Bajiya Department of Computational Biology, Indraprastha Institute of Information Technology, Delhi Nishant Kumar Department of Computational Biology, Indraprastha Institute of Information Technology, Delhi Gajendra P. S. Raghava Department of Computational Biology, Indraprastha Institute of Information Technology, Delhi http://arxiv.org/abs/2505.13503v1 Ancestry-Adjusted Polygenic Risk Scores for Predicting Obesity Risk in the Indonesian Population 2025-05-16T09:06:17Z

Obesity prevalence in Indonesian adults increased from 10.5% in 2007 to 23.4% in 2023. Studies showed that genetic predisposition significantly influences obesity susceptibility. To aid this, polygenic risk scores (PRS) help aggregate the effects of numerous genetic variants to assess genetic risk. However, 91% of genome-wide association studies (GWAS) involve European populations, limiting their applicability to Indonesians due to genetic diversity. This study aims to develop and validate an ancestry adjusted PRS for obesity in the Indonesian population using principal component analysis (PCA) method constructed from the 1000 Genomes Project data and our own genomic data from approximately 2,800 Indonesians. We calculate PRS for obesity using all races, then determine the first four principal components using ancestry-informative SNPs and develop a linear regression model to predict PRS based on these principal components. The raw PRS is adjusted by subtracting the predicted score to obtain an ancestry adjusted PRS for the Indonesian population. Our results indicate that the ancestry-adjusted PRS improves obesity risk prediction. Compared to the unadjusted PRS, the adjusted score improved classification performance with a 5% increase in area under the ROC curve (AUC). This approach underscores the importance of population-specific adjustments in genetic risk assessments to enable more effective personalized healthcare and targeted intervention strategies for diverse populations.

2025-05-16T09:06:17Z 7 pages, 8 figures Jocelyn Verna Siswanto Belinda Mutiara Felicia Austin Jonathan Susanto Cathelyn Theophila Tan Restu Unggul Kresnadi Kezia Irene http://arxiv.org/abs/2505.09883v1 DeepPlantCRE: A Transformer-CNN Hybrid Framework for Plant Gene Expression Modeling and Cross-Species Generalization 2025-05-15T00:59:29Z

The investigation of plant transcriptional regulation constitutes a fundamental basis for crop breeding, where cis-regulatory elements (CREs), as the key factor determining gene expression, have become the focus of crop genetic improvement research. Deep learning techniques, leveraging their exceptional capacity for high-dimensional feature extraction and nonlinear regulatory relationship modeling, have been extensively employed in this field. However, current methodologies present notable limitations: single CNN-based architectures struggle to capture long-range regulatory interactions, while existing CNN-Transformer hybrid models demonstrate proneness to overfitting and inadequate generalization in cross-species prediction contexts. To address these challenges, this study proposes DeepPlantCRE, a deep-learning framework for plant gene expression prediction and CRE Extraction. The model employs a Transformer-CNN hybrid architecture that achieves enhanced Accuracy, AUC-ROC, and F1-score metrics over existing baselines (DeepCRE and PhytoExpr), with improved generalization performance and overfitting inhibiting. Cross-species validation experiments conducted on gene expression datasets from \textit{Gossypium}, \textit{Arabidopsis thaliana}, \textit{Solanum lycopersicum}, \textit{Sorghum bicolor}, and \textit{Arabidopsis thaliana} reveal that the model achieves peak prediction accuracy of 92.3\%, particularly excelling in complex genomic data analysis. Furthermore, interpretability investigations using DeepLIFT and Transcription Factor Motif Discovery from the importance scores algorithm (TF-MoDISco) demonstrate that the derived motifs from our model exhibit high concordance with known transcription factor binding sites (TFBSs) such as MYR2, TSO1 in JASPAR plant database, substantiating the potential of biological interpretability and practical agricultural application of DeepPlantCRE.

2025-05-15T00:59:29Z Yingjun Wu Jingyun Huang Liang Ming Pengcheng Deng Maojun Wang Zeyu Zhang http://arxiv.org/abs/2505.09873v1 Deep Learning and Explainable AI: New Pathways to Genetic Insights 2025-05-15T00:37:03Z

Deep learning-based AI models have been extensively applied in genomics, achieving remarkable success across diverse applications. As these models gain prominence, there exists an urgent need for interpretability methods to establish trustworthiness in model-driven decisions. For genetic researchers, interpretable insights derived from these models hold significant value in providing novel perspectives for understanding biological processes. Current interpretability analyses in genomics predominantly rely on intuition and experience rather than rigorous theoretical foundations. In this review, we systematically categorize interpretability methods into input-based and model-based approaches, while critically evaluating their limitations through concrete biological application scenarios. Furthermore, we establish theoretical underpinnings to elucidate the origins of these constraints through formal mathematical demonstrations, aiming to assist genetic researchers in better understanding and designing models in the future. Finally, we provide feasible suggestions for future research on interpretability in the field of genetics.

2025-05-15T00:37:03Z Chenyu Wang Chaoying Zuo Zihan Su Yuhang Xing Lu Li Maojun Wang Zeyu Zhang http://arxiv.org/abs/2505.08918v1 When repeats drive the vocabulary: a Byte-Pair Encoding analysis of T2T primate genomes 2025-05-13T19:27:58Z

The emergence of telomere-to-telomere (T2T) genome assemblies has opened new avenues for comparative genomics, yet effective tokenization strategies for genomic sequences remain underexplored. In this pilot study, we apply Byte Pair Encoding (BPE) to nine T2T primate genomes including three human assemblies by training independent BPE tokenizers with a fixed vocabulary of 512,000 tokens using our custom tool, dnaBPE. Our analysis reveals that only 11,569 tokens are shared across all assemblies, while nearly 991,854 tokens are unique to a single genome, indicating a rapid decline in shared vocabulary with increasing assembly comparisons. Moreover, phylogenetic trees derived from token overlap failed to recapitulate established primate relationships, a discrepancy attributed to the disproportionate influence of species-specific high-copy repetitive elements. These findings underscore the dual nature of BPE tokenization: while it effectively compresses repetitive sequences, its sensitivity to high-copy elements limits its utility as a universal tool for comparative genomics. We discuss potential hybrid strategies and repeat-masking approaches to refine genomic tokenization, emphasizing the need for domain-specific adaptations in the development of large-scale genomic language models. The dnaBPE tool used in this study is open-source and available at https://github.com/aglabx/dnaBPE.

2025-05-13T19:27:58Z ICLR 2025 Workshop on Machine Learning for Genomics Explorations Marina Popova Iaroslav Chelombitko Aleksey Komissarov http://arxiv.org/abs/2505.08764v1 The Environment-Dependent Regulatory Landscape of the E. coli Genome 2025-05-13T17:33:25Z

All cells respond to changes in both their internal milieu and the environment around them through the regulation of their genes. Despite decades of effort, there remain huge gaps in our knowledge of both the function of many genes (the so-called y-ome) and how they adapt to changing environments via regulation. Here we describe a joint experimental and theoretical dissection of the regulation of a broad array of over 100 biologically interesting genes in E. coli across 39 diverse environments, permitting us to discover the binding sites and transcription factors that mediate regulatory control. Using a combination of mutagenesis, massively parallel reporter assays, mass spectrometry and tools from information theory and statistical physics, we go from complete ignorance of a promoter's environment-dependent regulatory architecture to predictive models of its behavior. As a proof of principle of the biological insights to be gained from such a study, we chose a combination of genes from the y-ome, toxin-antitoxin pairs, and genes hypothesized to be part of regulatory modules; in all cases, we discovered a host of new insights into their underlying regulatory landscape and resulting biological function.

2025-05-13T17:33:25Z Tom Röschinger Heun Jin Lee Rosalind Wenshan Pan Grace Solini Kian Faizi Baiyi Quan Tsui Fen Chou Madhav Mani Stephen Quake Rob Phillips http://arxiv.org/abs/2505.08844v1 CellTypeAgent: Trustworthy cell type annotation with Large Language Models 2025-05-13T14:34:11Z

Cell type annotation is a critical yet laborious step in single-cell RNA sequencing analysis. We present a trustworthy large language model (LLM)-agent, CellTypeAgent, which integrates LLMs with verification from relevant databases. CellTypeAgent achieves higher accuracy than existing methods while mitigating hallucinations. We evaluated CellTypeAgent across nine real datasets involving 303 cell types from 36 tissues. This combined approach holds promise for more efficient and reliable cell type annotation.

2025-05-13T14:34:11Z Jiawen Chen Jianghao Zhang Huaxiu Yao Yun Li http://arxiv.org/abs/2505.08071v1 NMP-PaK: Near-Memory Processing Acceleration of Scalable De Novo Genome Assembly 2025-05-12T21:17:20Z

De novo assembly enables investigations of unknown genomes, paving the way for personalized medicine and disease management. However, it faces immense computational challenges arising from the excessive data volumes and algorithmic complexity. While state-of-the-art de novo assemblers utilize distributed systems for extreme-scale genome assembly, they demand substantial computational and memory resources. They also fail to address the inherent challenges of de novo assembly, including a large memory footprint, memory-bound behavior, and irregular data patterns stemming from complex, interdependent data structures. Given these challenges, de novo assembly merits a custom hardware solution, though existing approaches have not fully addressed the limitations. We propose NMP-PaK, a hardware-software co-design that accelerates scalable de novo genome assembly through near-memory processing (NMP). Our channel-level NMP architecture addresses memory bottlenecks while providing sufficient scratchpad space for processing elements. Customized processing elements maximize parallelism while efficiently handling large data structures that are both dynamic and interdependent. Software optimizations include customized batch processing to reduce the memory footprint and hybrid CPU-NMP processing to address hardware underutilization caused by irregular data patterns. NMP-PaK conducts the same genome assembly while incurring a 14X smaller memory footprint compared to the state-of-the-art de novo assembly. Moreover, NMP-PaK delivers a 16X performance improvement over the CPU baseline, with a 2.4X reduction in memory operations. Consequently, NMP-PaK achieves 8.3X greater throughput than state-of-the-art de novo assembly under the same resource constraints, showcasing its superior computational efficiency.

2025-05-12T21:17:20Z To be published in ISCA 2025 Heewoo Kim Sanjay Sri Vallabh Singapuram Haojie Ye Joseph Izraelevitz Trevor Mudge Ronald Dreslinski Nishil Talati 10.1145/3695053.3731056 http://arxiv.org/abs/2505.07740v1 Pan-genome Analysis of Plastomes from Lamiales using PGR-TK 2025-05-12T16:49:38Z

Chloroplast sequences from the Lamiales order were analyzed using the Pangenome Research Toolkit (PGR-TK). Overall, most genera and families exhibited a high degree of sequence uniformity. However, at the genus level, Utricularia, Incarvillea, and Orobanche stood out as particularly divergent. At the family level, Orobanchaceae, Bignoniaceae and Lentibulariaceae displayed notably complex patterns in the generated plots. The PGR-TK algorithm successfully distinguished most genera within their respective families and often recognized misclassified plants.

2025-05-12T16:49:38Z Aadhavan Veerendra Manoj Samanta http://arxiv.org/abs/2505.07919v1 Revolutionising Bacterial Genomics: Graph-Based Strategies for Improved Variant Identification 2025-05-12T15:34:24Z

A significant advancement in bioinformatics is using genome graph techniques to improve variation discovery across organisms. Traditional approaches, such as bwa mem, rely on linear reference genomes for genomic analyses but may introduce biases when applied to highly diverse bacterial genomes of the same species. Pangenome graphs provide an alternative paradigm for evaluating structural and minor variations within a graphical framework, including insertions, deletions, and single nucleotide polymorphisms. Pangenome graphs enhance the detection and interpretation of complex genetic variants by representing the full genetic diversity of a species. In this study, we present a robust and reliable bioinformatics pipeline utilising the PanGenome Graph Builder (PGGB) and the Variation Graph toolbox (vg giraffe) to align whole-genome sequencing data, call variants against a graph reference, and construct pangenomes from assembled genomes. Our results demonstrate that leveraging pangenome graphs over a single linear reference genome significantly improves mapping rates and variant calling accuracy for simulated and actual bacterial pathogens datasets.

2025-05-12T15:34:24Z Fathima Nuzla Ismail Abira Sengupta http://arxiv.org/abs/2505.07896v1 Bridging Large Language Models and Single-Cell Transcriptomics in Dissecting Selective Motor Neuron Vulnerability 2025-05-12T03:39:33Z

Understanding cell identity and function through single-cell level sequencing data remains a key challenge in computational biology. We present a novel framework that leverages gene-specific textual annotations from the NCBI Gene database to generate biologically contextualized cell embeddings. For each cell in a single-cell RNA sequencing (scRNA-seq) dataset, we rank genes by expression level, retrieve their NCBI Gene descriptions, and transform these descriptions into vector embedding representations using large language models (LLMs). The models used include OpenAI text-embedding-ada-002, text-embedding-3-small, and text-embedding-3-large (Jan 2024), as well as domain-specific models BioBERT and SciBERT. Embeddings are computed via an expression-weighted average across the top N most highly expressed genes in each cell, providing a compact, semantically rich representation. This multimodal strategy bridges structured biological data with state-of-the-art language modeling, enabling more interpretable downstream applications such as cell-type clustering, cell vulnerability dissection, and trajectory inference.

2025-05-12T03:39:33Z Douglas Jiang Zilin Dai Luxuan Zhang Qiyi Yu Haoqi Sun Feng Tian http://arxiv.org/abs/2504.06304v2 Leveraging State Space Models in Long Range Genomics 2025-05-11T20:33:43Z

Long-range dependencies are critical for understanding genomic structure and function, yet most conventional methods struggle with them. Widely adopted transformer-based models, while excelling at short-context tasks, are limited by the attention module's quadratic computational complexity and inability to extrapolate to sequences longer than those seen in training. In this work, we explore State Space Models (SSMs) as a promising alternative by benchmarking two SSM-inspired architectures, Caduceus and Hawk, on long-range genomics modeling tasks under conditions parallel to a 50M parameter transformer baseline. We discover that SSMs match transformer performance and exhibit impressive zero-shot extrapolation across multiple tasks, handling contexts 10 to 100 times longer than those seen during training, indicating more generalizable representations better suited for modeling the long and complex human genome. Moreover, we demonstrate that these models can efficiently process sequences of 1M tokens on a single GPU, allowing for modeling entire genomic regions at once, even in labs with limited compute. Our findings establish SSMs as efficient and scalable for long-context genomic analysis.

2025-04-07T18:34:06Z Accepted at ICLR 2025 (Spotlight @ LMRL) - Project page: https://anirudharamesh.github.io/iclr-long-range-genomics/ Matvei Popov Aymen Kallala Anirudha Ramesh Narimane Hennouni Shivesh Khaitan Rick Gentry Alain-Sam Cohen http://arxiv.org/abs/2505.06127v1 FastDup: a scalable duplicate marking tool using speculation-and-test mechanism 2025-05-09T15:33:36Z

Duplicate marking is a critical preprocessing step in gene sequence analysis to flag redundant reads arising from polymerase chain reaction(PCR) amplification and sequencing artifacts. Although Picard MarkDuplicates is widely recognized as the gold-standard tool, its single-threaded implementation and reliance on global sorting result in significant computational and resource overhead, limiting its efficiency on large-scale datasets. Here, we introduce FastDup: a high-performance, scalable solution that follows the speculation-and-test mechanism. FastDup achieves up to 20x throughput speedup and guarantees 100\% identical output compared to Picard MarkDuplicates. FastDup is a C++ program available from GitHub (https://github.com/zzhofict/FastDup.git) under the MIT license.

2025-05-09T15:33:36Z 4 pages, 1 figure Zhonghai Zhang Yewen Li Ke Meng Chunming Zhang Guangming Tan http://arxiv.org/abs/2505.04431v1 An Asynchronous Distributed-Memory Parallel Algorithm for k-mer Counting 2025-05-07T14:00:03Z

This paper describes a new asynchronous algorithm and implementation for the problem of k-mer counting (KC), which concerns quantifying the frequency of length k substrings in a DNA sequence. This operation is common to many computational biology workloads and can take up to 77% of the total runtime of de novo genome assembly. The performance and scalability of the current state-of-the-art distributed-memory KC algorithm are hampered by multiple rounds of Many-To-Many collectives. Therefore, we develop an asynchronous algorithm (DAKC) that uses fine-grained, asynchronous messages to obviate most of this global communication while utilizing network bandwidth efficiently via custom message aggregation protocols. DAKC can perform strong scaling up to 256 nodes (512 sockets / 6K cores) and can count k-mers up to 9x faster than the state-of-the-art distributed-memory algorithm, and up to 100x faster than the shared-memory alternative. We also provide an analytical model to understand the hardware resource utilization of our asynchronous KC algorithm and provide insights on the performance.

2025-05-07T14:00:03Z Accepted at IEEE IPDPS 2025 Conference Souvadra Hati Akihiro Hayashi Richard Vuduc