https://arxiv.org/api/pYsnty6WWhuyBYihO+caas398Mc 2026-06-15T01:33:24Z 3849 525 15 http://arxiv.org/abs/2504.03550v1 Dimensionality reduction for k-means clustering of large-scale influenza mutation datasets 2025-04-04T15:57:48Z

Viral mutations pose significant threats to public health by increasing infectivity, strengthening vaccine resistance, and altering disease severity. To track these evolving patterns, agencies like the CDC annually evaluate thousands of virus strains, underscoring the urgent need to understand viral mutagenesis and evolution in depth. In this study, we integrate genomic analysis, clustering, and three leading dimensionality reduction approaches, namely, principal component analysis (PCA), t-distributed stochastic neighbor embedding (t-SNE), and uniform manifold approximation and projection (UMAP)-to investigate the effects of COVID-19 on influenza virus propagation. By applying these methods to extensive pre- and post-pandemic influenza datasets, we reveal how selective pressures during the pandemic have influenced the diversity of influenza genetics. Our findings indicate that combining robust dimension reduction with clustering yields critical insights into the complex dynamics of viral mutation, informing both future research directions and strategies for public health intervention.

2025-04-04T15:57:48Z Emilee Walden Jiahui Chen Guo-Wei Wei http://arxiv.org/abs/2504.02081v1 Addressing missing context in regulatory variation across primate evolution 2025-04-02T19:40:10Z

In primates, loci associated with adaptive trait variation often fall in non-coding regions. Understanding the mechanisms linking these regulatory variants to fitness-relevant phenotypes remains challenging, but can be addressed using functional genomic data. However, such data are rarely generated at scale in non-human primates. When they are, only select tissues, cell types, developmental stages, and cellular environments are typically considered, despite appreciation that adaptive variants often exhibit context-dependent effects. In this review, we 1) discuss why context-dependent regulatory loci might be especially evolutionarily relevant in primates, 2) explore challenges and emerging solutions for mapping such context-dependent variation, and 3) discuss the scientific questions these data could address. We argue that filling this gap will provide critical insights into evolutionary processes, human disease, and regulatory adaptation.

2025-04-02T19:40:10Z 20 pages, 2 figures Genevieve Housman Audrey Arner Amy Longtin Christian Gagnon Arun Durvasula Amanda Lea http://arxiv.org/abs/2504.01270v1 Defining the relationship between cathepsin B and esophageal adenocarcinoma: conjoint analysis of Mendelian randomization, transcriptome-wide association studies, and single-cell RNA sequencing data 2025-04-02T00:41:37Z

Background: Esophageal cancer poses a significant global health challenge, with the incidence of esophageal adenocarcinoma (EAC), a predominant subtype, increasing notably in Western countries. Cathepsins, a family of lysosomal proteolytic enzymes, have been implicated in the progression of various tumors. However, the causal relationship between the cathepsin family and EAC remains unresolved. Methods: To evaluate these potential causal associations, integrative analyses were conducted, integrating Mendelian randomization (MR), transcriptome-wide association study (TWAS), single-cell RNA sequencing (scRNA-seq), and single-cell expression quantitative trait locus (sc-eQTL) analyses. Results: Univariable and multivariable MR analyses demonstrated that elevated levels of cathepsin B (CTSB) were associated with a reduced risk of EAC. The TWAS analysis identified a negative association between CTSB expression in esophageal tissue and EAC, consistent with experimental validation using immunohistochemistry. The scRNA-seq data analysis indicated that CTSB expression was predominantly localized in macrophages infiltrating EAC. Colocalization analysis incorporating sc-eQTL data specific to macrophages confirmed a shared causal variant between CTSB and macrophages. Additionally, MR analysis of CTSB and macrophage scavenger receptor (MSR) types I and II established their interrelationship, suggesting that CTSB may influence the proinflammatory phenotype of macrophages, ultimately affecting EAC risk. Conclusions: This integrative analysis, utilizing MR, TWAS, scRNA-seq, and sc-eQTL data, identified a significant causal association between CTSB and EAC, potentially mediated through macrophage MSR regulation. These findings suggest that targeting cathepsin B could represent a novel strategy for the diagnosis and treatment of EAC.

2025-04-02T00:41:37Z Jialin Li Shaokang Yang Xinliang Gao Mingbo Tang Xiaobo Ma Suyan Tian Wei Liu http://arxiv.org/abs/2504.03733v1 Artificial Intelligence and Deep Learning Algorithms for Epigenetic Sequence Analysis: A Review for Epigeneticists and AI Experts 2025-04-01T01:02:34Z

Epigenetics encompasses mechanisms that can alter the expression of genes without changing the underlying genetic sequence. The epigenetic regulation of gene expression is initiated and sustained by several mechanisms such as DNA methylation, histone modifications, chromatin conformation, and non-coding RNA. The changes in gene regulation and expression can manifest in the form of various diseases and disorders such as cancer and congenital deformities. Over the last few decades, high throughput experimental approaches have been used to identify and understand epigenetic changes, but these laboratory experimental approaches and biochemical processes are time-consuming and expensive. To overcome these challenges, machine learning and artificial intelligence (AI) approaches have been extensively used for mapping epigenetic modifications to their phenotypic manifestations. In this paper we provide a narrative review of published research on AI models trained on epigenomic data to address a variety of problems such as prediction of disease markers, gene expression, enhancer promoter interaction, and chromatin states. The purpose of this review is twofold as it is addressed to both AI experts and epigeneticists. For AI researchers, we provided a taxonomy of epigenetics research problems that can benefit from an AI-based approach. For epigeneticists, given each of the above problems we provide a list of candidate AI solutions in the literature. We have also identified several gaps in the literature, research challenges, and recommendations to address these challenges.

2025-04-01T01:02:34Z journal={Computers in Biology and Medicine}, volume={183}, pages={109302}, year={2024}, publisher={Elsevier} Muhammad Tahir Mahboobeh Norouzi Shehroz S. Khan James R. Davie Soichiro Yamanaka Ahmed Ashraf 10.1016/j.compbiomed.2024.109302 http://arxiv.org/abs/2504.00306v1 LOCO-EPI: Leave-one-chromosome-out (LOCO) as a benchmarking paradigm for deep learning based prediction of enhancer-promoter interactions 2025-04-01T00:20:15Z

In mammalian and vertebrate genomes, the promoter regions of the gene and their distal enhancers may be located millions of base-pairs from each other, while a promoter may not interact with the closest enhancer. Since base-pair proximity is not a good indicator of these interactions, there is considerable work toward developing methods for predicting Enhancer-Promoter Interactions (EPI). Several machine learning methods have reported increasingly higher accuracies for predicting EPI. Typically, these approaches randomly split the dataset of Enhancer-Promoter (EP) pairs into training and testing subsets followed by model training. However, the aforementioned random splitting causes information leakage by assigning EP pairs from the same genomic region to both testing and training sets, leading to performance overestimation. In this paper we propose to use a more thorough training and testing paradigm i.e., Leave-one-chromosome-out (LOCO) cross-validation for EPI-prediction. We demonstrate that a deep learning algorithm, which gives higher accuracies when trained and tested on random-splitting setting, drops drastically in performance under LOCO setting, confirming overestimation of performance. We further propose a novel hybrid deep neural network for EPI-prediction that fuses k-mer features of the nucleotide sequence. We show that the hybrid architecture performs significantly better in the LOCO setting, demonstrating it can learn more generalizable aspects of EP interactions. With this paper we are also releasing the LOCO splitting-based EPI dataset. Research data is available in this public repository: https://github.com/malikmtahir/EPI

2025-04-01T00:20:15Z tahir2025loco, journal={Applied Intelligence}, volume={55}, number={1}, pages={1--16}, year={2025}, publisher={Springer} Muhammad Tahir Shehroz S. Khan James Davie Soichiro Yamanaka Ahmed Ashraf 10.1007/s10489-024-05848-6 http://arxiv.org/abs/2503.23691v1 A Conceptual Framework for Human-AI Collaborative Genome Annotation 2025-03-31T03:44:00Z

Genome annotation is essential for understanding the functional elements within genomes. While automated methods are indispensable for processing large-scale genomic data, they often face challenges in accurately predicting gene structures and functions. Consequently, manual curation by domain experts remains crucial for validating and refining these predictions. These combined outcomes from automated tools and manual curation highlight the importance of integrating human expertise with AI capabilities to improve both the accuracy and efficiency of genome annotation. However, the manual curation process is inherently labor-intensive and time-consuming, making it difficult to scale for large datasets. To address these challenges, we propose a conceptual framework, Human-AI Collaborative Genome Annotation (HAICoGA), which leverages the synergistic partnership between humans and artificial intelligence to enhance human capabilities and accelerate the genome annotation process. Additionally, we explore the potential of integrating Large Language Models (LLMs) into this framework to support and augment specific tasks. Finally, we discuss emerging challenges and outline open research questions to guide further exploration in this area.

2025-03-31T03:44:00Z 17 pages, 3 figures Xiaomei Li Alex Whan Meredith McNeil David Starns Jessica Irons Samuel C. Andrew Rad Suchecki http://arxiv.org/abs/2502.11982v2 Single-Cell Proteomics Using Mass Spectrometry 2025-03-29T19:11:59Z

Single-cell proteomics (SCP) is transforming our understanding of biological complexity by shifting from bulk proteomics, where signals are averaged over thousands of cells, to the proteome analysis of individual cells. This granular perspective reveals distinct cell states, population heterogeneity, and the underpinnings of disease pathogenesis that bulk approaches may obscure. However, SCP demands exceptional sensitivity, precise cell handling, and robust data processing to overcome the inherent challenges of analyzing picogram-level protein samples without amplification. Recent innovations in sample preparation, separations, data acquisition strategies, and specialized mass spectrometry instrumentation have substantially improved proteome coverage and throughput. Approaches that integrate complementary omics, streamline multi-step sample processing, and automate workflows through microfluidics and specialized platforms promise to further push SCP boundaries. Advances in computational methods, especially for data normalization and imputation, address the pervasive issue of missing values, enabling more reliable downstream biological interpretations. Despite these strides, higher throughput, reproducibility, and consensus best practices remain pressing needs in the field. This mini review summarizes the latest progress in SCP technology and software solutions, highlighting how closer integration of analytical, computational, and experimental strategies will facilitate deeper and broader coverage of single-cell proteomes.

2025-02-17T16:22:55Z Amanda Momenzadeh Jesse G. Meyer http://arxiv.org/abs/2503.21546v1 consexpressionR: an R package for consensus differential gene expression analysis 2025-03-27T14:35:17Z

Motivation: Bulk RNA-Seq is a widely used method for studying gene expression across a variety of contexts. The significance of RNA-Seq studies has grown with the advent of high-throughput sequencing technologies. Computational methods have been developed for each stage of the identification of differentially expressed genes. Nevertheless, there are few studies exploring the association between different types of methods. In this study, we evaluated the impact of the association of methodologies in the results of differential expression analysis. By adopting two data sets with qPCR data (to gold-standard reference), seven methods were implemented and assessed in R packages (EBSeq, edgeR, DESeq2, limma, SAMseq, NOISeq, and Knowseq), which was performed and assessed separately and in association. The results were evaluated considering the adopted qPCR data. Results: Here, we introduce consexpressionR, an R package that automates differential expression analysis using consensus of at least seven methodologies, producing more assertive results with a significant reduction in false positives. Availability: consexpressionR is an R package available via source code and support are available at GitHub (https://github.com/costasilvati/consexpressionR).

2025-03-27T14:35:17Z Juliana Costa-Silva David Menotti Fabricio M. Lopes http://arxiv.org/abs/2503.20451v1 Agptools: a utility suite for editing genome assemblies 2025-03-26T11:26:21Z

The AGP format is a tab-separated table format describing how components of a genome assembly fit together. A standard submission format for genome assemblies is a fasta file giving the sequence of contigs along with an AGP file showing how these components are assembled into larger pieces like scaffolds or chromosomes. For this reason, many scaffolding software pipelines output assemblies in this format. However, although many programs for assembling and scaffolding genomes read and write this format, there is currently no published software for making edits to AGP files when performing assembly curation. We present agptools, a suite of command-line programs that can perform common operations on AGP files, such as breaking and joining sequences, inverting pieces of assembly components, assembling contigs into larger sequences based on an AGP file, and transforming between coordinate systems of different assembly layouts. Additionally, agptools includes an API that writers of other software packages can use to read, write, and manipulate AGP files within their own programs. Agptools gives bioinformaticians a simple, robust, and reproducible way to edit genome assemblies that avoids the shortfalls of other methods for editing AGP files.

2025-03-26T11:26:21Z Edward S. Ricemeyer Rachel A. Carroll Wesley C. Warren http://arxiv.org/abs/2504.06282v1 ProHap Explorer: Visualizing Haplotypes in Proteogenomic Datasets 2025-03-25T14:48:20Z

In mass spectrometry-based proteomics, experts usually project data onto a single set of reference sequences, overlooking the influence of common haplotypes (combinations of genetic variants inherited together from a parent). We recently introduced ProHap, a tool for generating customized protein haplotype databases. Here, we present ProHap Explorer, a visualization interface designed to investigate the influence of common haplotypes on the human proteome. It enables users to explore haplotypes, their effects on protein sequences, and the identification of non-canonical peptides in public mass spectrometry datasets. The design builds on well-established representations in biological sequence analysis, ensuring familiarity for domain experts while integrating novel interactive elements tailored to proteogenomic data exploration. User interviews with proteomics experts confirmed the tool's utility, highlighting its ability to reveal whether haplotypes affect proteins of interest. By facilitating the intuitive exploration of proteogenomic variation, ProHap Explorer supports research in personalized medicine and the development of targeted therapies.

2025-03-25T14:48:20Z Jakub Vašíček Dafni Skiadopoulou Ksenia G. Kuznetsova Lukas Käll Marc Vaudel Stefan Bruckner http://arxiv.org/abs/2411.02796v2 Specialized Foundation Models Struggle to Beat Supervised Baselines 2025-03-21T03:59:29Z

Following its success for vision and text, the "foundation model" (FM) paradigm -- pretraining large models on massive data, then fine-tuning on target tasks -- has rapidly expanded to domains in the sciences, engineering, healthcare, and beyond. Has this achieved what the original FMs accomplished, i.e. the supplanting of traditional supervised learning in their domains? To answer we look at three modalities -- genomics, satellite imaging, and time series -- with multiple recent FMs and compare them to a standard supervised learning workflow: model development, hyperparameter tuning, and training, all using only data from the target task. Across these three specialized domains, we find that it is consistently possible to train simple supervised models -- no more complicated than a lightly modified wide ResNet or UNet -- that match or even outperform the latest foundation models. Our work demonstrates that the benefits of large-scale pretraining have yet to be realized in many specialized areas, reinforces the need to compare new FMs to strong, well-tuned baselines, and introduces two new, easy-to-use, open-source, and automated workflows for doing so.

2024-11-05T04:10:59Z The first two authors contributed equally. The order was determined by coin flip Zongzhe Xu Ritvik Gupta Wenduo Cheng Alexander Shen Junhong Shen Ameet Talwalkar Mikhail Khodak http://arxiv.org/abs/2503.17418v1 Application of Single-cell Deep Learning in Elucidating the Mapping Relationship Between Visceral and Body Surface Inflammatory Patterns 2025-03-21T03:53:25Z

As a system of integrated homeostasis, life is susceptible to disruptions by visceral inflammation, which can disturb internal environment equilibrium. The role of body-spread subcutaneous fascia (scFascia) in this process is poorly understood. In the rat model of Salmonella-induced dysentery, scRNA-seq of scFascia and deep-learning analysis revealed Warburg-like metabolic reprogramming in macrophages (MPs) with reduced citrate cycle activity. Cd34+/Pdgfra+ telocytes (CPTCs) regulated MPs differentiation and proliferation via Wnt/Fgf signal, suggesting a pathological crosstalk pattern in the scFascia, herein termed the fascia-visceral inflammatory crosstalk pattern (FVICP). PySCENIC analysis indicated increased activity transcription factors Fosl1, Nfkb2, and Atf4, modulated by CPTCs signaling to MPs, downregulating aerobic respiration and upregulating cell cycle, DNA replication, and transcription. This study highlights scFascia's role in immunomodulation and metabolic reprogramming during visceral inflammation, underscoring its function in systemic homeostasis.

2025-03-21T03:53:25Z 25pages, 7 figures, under review Haixiang Huang Bingbing Shen Zhenwei Zhang Jianming Yue Lu Mei Qiusheng Chen http://arxiv.org/abs/2503.16351v1 Lyra: An Efficient and Expressive Subquadratic Architecture for Modeling Biological Sequences 2025-03-20T17:09:18Z

Deep learning architectures such as convolutional neural networks and Transformers have revolutionized biological sequence modeling, with recent advances driven by scaling up foundation and task-specific models. The computational resources and large datasets required, however, limit their applicability in biological contexts. We introduce Lyra, a subquadratic architecture for sequence modeling, grounded in the biological framework of epistasis for understanding sequence-to-function relationships. Mathematically, we demonstrate that state space models efficiently capture global epistatic interactions and combine them with projected gated convolutions for modeling local relationships. We demonstrate that Lyra is performant across over 100 wide-ranging biological tasks, achieving state-of-the-art (SOTA) performance in many key areas, including protein fitness landscape prediction, biophysical property prediction (e.g. disordered protein region functions) peptide engineering applications (e.g. antibody binding, cell-penetrating peptide prediction), RNA structure analysis, RNA function prediction, and CRISPR guide design. It achieves this with orders-of-magnitude improvements in inference speed and reduction in parameters (up to 120,000-fold in our tests) compared to recent biology foundation models. Using Lyra, we were able to train and run every task in this study on two or fewer GPUs in under two hours, democratizing access to biological sequence modeling at SOTA performance, with potential applications to many fields.

2025-03-20T17:09:18Z 53 pages, 5 figures Krithik Ramesh Broad Institute of MIT and Harvard Massachusetts Institute of Technology Sameed M. Siddiqui Broad Institute of MIT and Harvard Computational and Systems Biology Program, Massachusetts Institute of Technology Albert Gu Machine Learning Department, Carnegie Mellon University Michael D. Mitzenmacher Broad Institute of MIT and Harvard School of Engineering and Applied Sciences, Harvard University Pardis C. Sabeti Broad Institute of MIT and Harvard Department of Organismic and Evolutionary Biology, Harvard University Department of Immunology and Infectious Diseases, Harvard T.H. Chan School of Public Health, Harvard University Howard Hughes Medical Institute http://arxiv.org/abs/2503.16582v1 Machine Learning-Based Genomic Linguistic Analysis (Gene Sequence Feature Learning): A Case Study on Predicting Heavy Metal Response Genes in Rice 2025-03-20T13:41:31Z

This study explores the application of machine learning-based genetic linguistics for identifying heavy metal response genes in rice (Oryza sativa). By integrating convolutional neural networks and random forest algorithms, we developed a hybrid model capable of extracting and learning meaningful features from gene sequences, such as k-mer frequencies and physicochemical properties. The model was trained and tested on datasets of genes, achieving high predictive performance (precision: 0.89, F1-score: 0.82). RNA-seq and qRT-PCR experiments conducted on rice leaves which exposed to Hg0, revealed differential expression of genes associated with heavy metal responses, which validated the model's predictions. Co-expression network analysis identified 103 related genes, and a literature review indicated that these genes are highly likely to be involved in heavy metal-related biological processes. By integrating and comparing the analysis results with those of differentially expressed genes (DEGs), the validity of the new machine learning method was further demonstrated. This study highlights the efficacy of combining machine learning with genetic linguistics for large-scale gene prediction. It demonstrates a cost-effective and efficient approach for uncovering molecular mechanisms underlying heavy metal responses, with potential applications in developing stress-tolerant crop varieties.

2025-03-20T13:41:31Z Ruiqi Yang Jianxu Wang Wei Yuan Xun Wang Mei Li http://arxiv.org/abs/2503.16565v1 Gene42: Long-Range Genomic Foundation Model With Dense Attention 2025-03-20T07:10:04Z

We introduce Gene42, a novel family of Genomic Foundation Models (GFMs) designed to manage context lengths of up to 192,000 base pairs (bp) at a single-nucleotide resolution. Gene42 models utilize a decoder-only (LLaMA-style) architecture with a dense self-attention mechanism. Initially trained on fixed-length sequences of 4,096 bp, our models underwent continuous pretraining to extend the context length to 192,000 bp. This iterative extension allowed for the comprehensive processing of large-scale genomic data and the capture of intricate patterns and dependencies within the human genome. Gene42 is the first dense attention model capable of handling such extensive long context lengths in genomics, challenging state-space models that often rely on convolutional operators among other mechanisms. Our pretrained models exhibit notably low perplexity values and high reconstruction accuracy, highlighting their strong ability to model genomic data. Extensive experiments on various genomic benchmarks have demonstrated state-of-the-art performance across multiple tasks, including biotype classification, regulatory region identification, chromatin profiling prediction, variant pathogenicity prediction, and species classification. The models are publicly available at huggingface.co/inceptionai.

2025-03-20T07:10:04Z Kirill Vishniakov Boulbaba Ben Amor Engin Tekin Nancy A. ElNaker Karthik Viswanathan Aleksandr Medvedev Aahan Singh Maryam Nadeem Mohammad Amaan Sayeed Praveenkumar Kanithi Tiago Magalhaes Natalia Vassilieva Dwarikanath Mahapatra Marco Pimentel and Shadab Khan