https://arxiv.org/api/d3WYpL8XZ6HFBl2dELvM4bnWbxY 2026-06-14T19:48:40Z 3848 450 15 http://arxiv.org/abs/2506.10886v2 S3Mirror: Making Genomic Data Transfers Fast, Reliable, and Observable with DBOS 2025-06-13T01:44:10Z

To meet the needs of a large pharmaceutical organization, we set out to create S3Mirror - an application for transferring large genomic sequencing datasets between S3 buckets quickly, reliably, and observably. We used the DBOS Transact durable execution framework to achieve these goals and benchmarked the performance and cost of the application. S3Mirror is an open source DBOS Python application that can run in a variety of environments, including DBOS Cloud Pro, where it runs as much as 40x faster than AWS DataSync at a fraction of the cost. Moreover, S3Mirror is resilient to failures and allows for real-time filewise observability of ongoing and past transfers.

2025-06-12T16:50:04Z Steven Vasquez-Grinnell Alex Poliakov http://arxiv.org/abs/2506.11182v1 Multimodal Modeling of CRISPR-Cas12 Activity Using Foundation Models and Chromatin Accessibility Data 2025-06-12T16:15:14Z

Predicting guide RNA (gRNA) activity is critical for effective CRISPR-Cas12 genome editing but remains challenging due to limited data, variation across protospacer adjacent motifs (PAMs-short sequence requirements for Cas binding), and reliance on large-scale training. We investigate whether pre-trained biological foundation model originally trained on transcriptomic data can improve gRNA activity estimation even without domain-specific pre-training. Using embeddings from existing RNA foundation model as input to lightweight regressor, we show substantial gains over traditional baselines. We also integrate chromatin accessibility data to capture regulatory context, improving performance further. Our results highlight the effectiveness of pre-trained foundation models and chromatin accessibility data for gRNA activity prediction.

2025-06-12T16:15:14Z This manuscript has been accepted by ICML workshop 2025 Azim Dehghani Amirabad Yanfei Zhang Artem Moskalev Sowmya Rajesh Tommaso Mansi Shuwei Li Mangal Prakash Rui Liao http://arxiv.org/abs/2501.02028v2 Selecting ChIP-seq Normalization Methods from the Perspective of their Technical Conditions 2025-06-11T23:02:07Z

Chromatin immunoprecipitation with high-throughput sequencing (ChIP-seq) provides insights into both the genomic location occupied by the protein of interest and the difference in DNA occupancy between experimental states. Given that ChIP-seq data is collected experimentally, an important step for determining regions with differential DNA occupancy between states is between-sample normalization. While between-sample normalization is crucial for downstream differential binding analysis, the technical conditions underlying between-sample normalization methods have yet to be examined for ChIP-seq. We identify three important technical conditions underlying ChIP-seq between-sample normalization methods: balanced differential DNA occupancy, equal total DNA occupancy, and equal background binding across states. To illustrate satisfying the selected normalization method's technical conditions for downstream differential binding analysis, we simulate ChIP-seq read count data where different combinations of the technical conditions are violated. We then externally verify our simulation results using experimental data. Based on our findings, we suggest that researchers use their understanding of the ChIP-seq experiment at hand to guide their choice of between-sample normalization method. Alternatively, researchers can use a high-confidence peakset, which is the intersection of the differentially bound peaksets obtained from using different between-sample normalization methods. In our two experimental analyses, roughly half of the called peaks were called as differentially bound for every normalization method. High-confidence peaks are less sensitive to choice of between-sample normalization method and could be a more robust basis for identifying genomic regions with differential DNA occupancy between experimental states when there is uncertainty about which technical conditions are satisfied.

2025-01-03T06:24:06Z Sara Colando Danae Schulz Johanna Hardin http://arxiv.org/abs/2506.11158v1 Brain-wide interpolation and conditioning of gene expression in the human brain using Implicit Neural Representations 2025-06-11T17:03:13Z

In this paper, we study the efficacy and utility of recent advances in non-local, non-linear image interpolation and extrapolation algorithms, specifically, ideas based on Implicit Neural Representations (INR), as a tool for analysis of spatial transcriptomics data. We seek to utilize the microarray gene expression data sparsely sampled in the healthy human brain, and produce fully resolved spatial maps of any given gene across the whole brain at a voxel-level resolution. To do so, we first obtained the 100 top AD risk genes, whose baseline spatial transcriptional profiles were obtained from the Allen Human Brain Atlas (AHBA). We adapted Implicit Neural Representation models so that the pipeline can produce robust voxel-resolution quantitative maps of all genes. We present a variety of experiments using interpolations obtained from Abagen as a baseline/reference.

2025-06-11T17:03:13Z Xizheng Yu Justin Torok Sneha Pandya Sourav Pal Vikas Singh Ashish Raj http://arxiv.org/abs/2506.05443v1 UniPTMs: The First Unified Multi-type PTM Site Prediction Model via Master-Slave Architecture-Based Multi-Stage Fusion Strategy and Hierarchical Contrastive Loss 2025-06-05T13:02:43Z

As a core mechanism of epigenetic regulation in eukaryotes, protein post-translational modifications (PTMs) require precise prediction to decipher dynamic life activity networks. To address the limitations of existing deep learning models in cross-modal feature fusion, domain generalization, and architectural optimization, this study proposes UniPTMs: the first unified framework for multi-type PTM prediction. The framework innovatively establishes a "Master-Slave" dual-path collaborative architecture: The master path dynamically integrates high-dimensional representations of protein sequences, structures, and evolutionary information through a Bidirectional Gated Cross-Attention (BGCA) module, while the slave path optimizes feature discrepancies and recalibration between structural and traditional features using a Low-Dimensional Fusion Network (LDFN). Complemented by a Multi-scale Adaptive convolutional Pyramid (MACP) for capturing local feature patterns and a Bidirectional Hierarchical Gated Fusion Network (BHGFN) enabling multi-level feature integration across paths, the framework employs a Hierarchical Dynamic Weighting Fusion (HDWF) mechanism to intelligently aggregate multimodal features. Enhanced by a novel Hierarchical Contrastive loss function for feature consistency optimization, UniPTMs demonstrates significant performance improvements (3.2%-11.4% MCC and 4.2%-14.3% AP increases) over state-of-the-art models across five modification types and transcends the Single-Type Prediction Paradigm. To strike a balance between model complexity and performance, we have also developed a lightweight variant named UniPTMs-mini.

2025-06-05T13:02:43Z Yiyu Lin Yan Wang You Zhou Xinye Ni Jiahui Wu Sen Yang http://arxiv.org/abs/2506.04303v1 Knowledge-guided Contextual Gene Set Analysis Using Large Language Models 2025-06-04T15:56:57Z

Gene set analysis (GSA) is a foundational approach for interpreting genomic data of diseases by linking genes to biological processes. However, conventional GSA methods overlook clinical context of the analyses, often generating long lists of enriched pathways with redundant, nonspecific, or irrelevant results. Interpreting these requires extensive, ad-hoc manual effort, reducing both reliability and reproducibility. To address this limitation, we introduce cGSA, a novel AI-driven framework that enhances GSA by incorporating context-aware pathway prioritization. cGSA integrates gene cluster detection, enrichment analysis, and large language models to identify pathways that are not only statistically significant but also biologically meaningful. Benchmarking on 102 manually curated gene sets across 19 diseases and ten disease-related biological mechanisms shows that cGSA outperforms baseline methods by over 30%, with expert validation confirming its increased precision and interpretability. Two independent case studies in melanoma and breast cancer further demonstrate its potential to uncover context-specific insights and support targeted hypothesis generation.

2025-06-04T15:56:57Z 56 pages, 9 figures, 1 table Zhizheng Wang Chi-Ping Day Chih-Hsuan Wei Qiao Jin Robert Leaman Yifan Yang Shubo Tian Aodong Qiu Yin Fang Qingqing Zhu Xinghua Lu Zhiyong Lu http://arxiv.org/abs/2506.02213v1 Quantum Ensembling Methods for Healthcare and Life Science 2025-06-02T19:54:51Z

Learning on small data is a challenge frequently encountered in many real-world applications. In this work we study how effective quantum ensemble models are when trained on small data problems in healthcare and life sciences. We constructed multiple types of quantum ensembles for binary classification using up to 26 qubits in simulation and 56 qubits on quantum hardware. Our ensemble designs use minimal trainable parameters but require long-range connections between qubits. We tested these quantum ensembles on synthetic datasets and gene expression data from renal cell carcinoma patients with the task of predicting patient response to immunotherapy. From the performance observed in simulation and initial hardware experiments, we demonstrate how quantum embedding structure affects performance and discuss how to extract informative features and build models that can learn and generalize effectively. We present these exploratory results in order to assist other researchers in the design of effective learning on small data using ensembles. Incorporating quantum computing in these data constrained problems offers hope for a wide range of studies in healthcare and life sciences where biological samples are relatively scarce given the feature space to be explored.

2025-06-02T19:54:51Z Kahn Rhrissorrakrai Kathleen E. Hamilton Prerana Bangalore Parthsarathy Aldo Guzman-Saenz Tyler Alban Filippo Utro Laxmi Parida http://arxiv.org/abs/2506.02212v1 Leveraging Natural Language Processing to Unravel the Mystery of Life: A Review of NLP Approaches in Genomics, Transcriptomics, and Proteomics 2025-06-02T19:54:03Z

Natural Language Processing (NLP) has transformed various fields beyond linguistics by applying techniques originally developed for human language to the analysis of biological sequences. This review explores the application of NLP methods to biological sequence data, focusing on genomics, transcriptomics, and proteomics. We examine how various NLP methods, from classic approaches like word2vec to advanced models employing transformers and hyena operators, are being adapted to analyze DNA, RNA, protein sequences, and entire genomes. The review also examines tokenization strategies and model architectures, evaluating their strengths, limitations, and suitability for different biological tasks. We further cover recent advances in NLP applications for biological data, such as structure prediction, gene expression, and evolutionary analysis, highlighting the potential of these methods for extracting meaningful insights from large-scale genomic data. As language models continue to advance, their integration into bioinformatics holds immense promise for advancing our understanding of biological processes in all domains of life.

2025-06-02T19:54:03Z Ella Rannon David Burstein http://arxiv.org/abs/2506.01833v1 SPACE: Your Genomic Profile Predictor is a Powerful DNA Foundation Model 2025-06-02T16:23:05Z

Inspired by the success of unsupervised pre-training paradigms, researchers have applied these approaches to DNA pre-training. However, we argue that these approaches alone yield suboptimal results because pure DNA sequences lack sufficient information, since their functions are regulated by genomic profiles like chromatin accessibility. Here, we demonstrate that supervised training for genomic profile prediction serves as a more effective alternative to pure sequence pre-training. Furthermore, considering the multi-species and multi-profile nature of genomic profile prediction, we introduce our $\textbf{S}$pecies-$\textbf{P}$rofile $\textbf{A}$daptive $\textbf{C}$ollaborative $\textbf{E}$xperts (SPACE) that leverages Mixture of Experts (MoE) to better capture the relationships between DNA sequences across different species and genomic profiles, thereby learning more effective DNA representations. Through extensive experiments across various tasks, our model achieves state-of-the-art performance, establishing that DNA models trained with supervised genomic profiles serve as powerful DNA representation learners. The code is available at https://github.com/ZhuJiwei111/SPACE.

2025-06-02T16:23:05Z Accepted to ICML 2025 Zhao Yang Jiwei Zhu Bing Su http://arxiv.org/abs/2506.01560v1 SPAC: A Python Package for Spatial Single-Cell Analysis of Multiplex Imaging 2025-06-02T11:36:32Z

Multiplexed immunofluorescence microscopy captures detailed measurements of spatially resolved, multiple biomarkers simultaneously, revealing tissue composition and cellular interactions in situ among single cells. The growing scale and dimensional complexity of these datasets demand reproducible, comprehensive and user-friendly computational tools. To address this need, we developed SPAC (SPAtial single-Cell analysis), a Python-based package and a corresponding shiny application within an integrated, modular SPAC ecosystem (Liu et al., 2025) designed specifically for biologists without extensive coding expertise. Following image segmentation and extraction of spatially resolved single-cell data, SPAC streamlines downstream phenotyping and spatial analysis, facilitating characterization of cellular heterogeneity and spatial organization within tissues. Through scalable performance, specialized spatial statistics, highly customizable visualizations, and seamless workflows from dataset to insights, SPAC significantly lowers barriers to sophisticated spatial analyses.

2025-06-02T11:36:32Z 7 pages, 1 figure; pre-print submitted to the *Journal of Open Source Software (JOSS)* Fang Liu Rui He Andrei Bombin Ahmad B. Abdallah Omar Eldaghar Tommy R. Sheeley Sam E. Ying George Zaki http://arxiv.org/abs/2506.01456v1 GenDMR: A dynamic multimodal role-swapping network for identifying risk gene phenotypes 2025-06-02T09:12:53Z

Recent studies have shown that integrating multimodal data fusion techniques for imaging and genetic features is beneficial for the etiological analysis and predictive diagnosis of Alzheimer's disease (AD). However, there are several critical flaws in current deep learning methods. Firstly, there has been insufficient discussion and exploration regarding the selection and encoding of genetic information. Secondly, due to the significantly superior classification value of AD imaging features compared to genetic features, many studies in multimodal fusion emphasize the strengths of imaging features, actively mitigating the influence of weaker features, thereby diminishing the learning of the unique value of genetic features. To address this issue, this study proposes the dynamic multimodal role-swapping network (GenDMR). In GenDMR, we develop a novel approach to encode the spatial organization of single nucleotide polymorphisms (SNPs), enhancing the representation of their genomic context. Additionally, to adaptively quantify the disease risk of SNPs and brain region, we propose a multi-instance attention module to enhance model interpretability. Furthermore, we introduce a dominant modality selection module and a contrastive self-distillation module, combining them to achieve a dynamic teacher-student role exchange mechanism based on dominant and auxiliary modalities for bidirectional co-updating of different modal data. Finally, GenDMR achieves state-of-the-art performance on the ADNI public dataset and visualizes attention to different SNPs, focusing on confirming 12 potential high-risk genes related to AD, including the most classic APOE and recently highlighted significant risk genes. This demonstrates GenDMR's interpretable analytical capability in exploring AD genetic features, providing new insights and perspectives for the development of multimodal data fusion techniques.

2025-06-02T09:12:53Z 31 pages, 9 figures Lina Qin Cheng Zhu Chuqi Zhou Yukun Huang Jiayi Zhu Ping Liang Jinju Wang Yixing Huang Cheng Luo Dezhong Yao Ying Tan http://arxiv.org/abs/2506.00673v1 DuAL-Net: A Hybrid Framework for Alzheimer's Disease Prediction from Whole-Genome Sequencing via Local SNP Windows and Global Annotations 2025-05-31T18:53:19Z

Alzheimer's disease (AD) dementia is the most common form of dementia. With the emergence of disease-modifying therapies, predicting disease risk before symptom onset has become critical. We introduce DuAL-Net, a hybrid deep learning framework for AD dementia prediction using whole genome sequencing (WGS) data. DuAL-Net integrates two components: local probability modeling, which segments the genome into non-overlapping windows, and global annotation-based modeling, which annotates SNPs and reorganizes WGS input to capture long-range functional relationships. Both employ out-of-fold stacking with TabNet and Random Forest classifiers. Final predictions combine local and global probabilities using an optimized weighting parameter alpha. We analyzed WGS data from 1,050 individuals (443 cognitively normal, 607 AD dementia) using five-fold cross-validation. DuAL-Net achieved an AUC of 0.671 using top-ranked SNPs, representing 35.0% and 20.3% higher performance than bottom-ranked and randomly selected SNPs, respectively. ROC analysis demonstrated strong positive correlation between SNP prioritization rank and predictive power. The model identified known AD-associated SNPs as top contributors alongside potentially novel variants. DuAL-Net presents a promising framework improving both predictive accuracy and biological interpretability. The framework and web implementation offer an accessible platform for broader research applications.

2025-05-31T18:53:19Z Eun Hye Lee Taeho Jo http://arxiv.org/abs/2506.00662v1 Uncertainty-Aware Genomic Classification of Alzheimer's Disease: A Transformer-Based Ensemble Approach with Monte Carlo Dropout 2025-05-31T18:20:49Z

INTRODUCTION: Alzheimer's disease (AD) is genetically complex, complicating robust classification from genomic data. METHODS: We developed a transformer-based ensemble model (TrUE-Net) using Monte Carlo Dropout for uncertainty estimation in AD classification from whole-genome sequencing (WGS). We combined a transformer that preserves single-nucleotide polymorphism (SNP) sequence structure with a concurrent random forest using flattened genotypes. An uncertainty threshold separated samples into an uncertain (high-variance) group and a more certain (low-variance) group. RESULTS: We analyzed 1050 individuals, holding out half for testing. Overall accuracy and area under the receiver operating characteristic (ROC) curve (AUC) were 0.6514 and 0.6636, respectively. Excluding the uncertain group improved accuracy from 0.6263 to 0.7287 (10.24% increase) and F1 from 0.5843 to 0.8205 (23.62% increase). DISCUSSION: Monte Carlo Dropout-driven uncertainty helps identify ambiguous cases that may require further clinical evaluation, thus improving reliability in AD genomic classification.

2025-05-31T18:20:49Z Taeho Jo Eun Hye Lee Alzheimer's Disease Sequencing Project http://arxiv.org/abs/2506.00410v1 JojoSCL: Shrinkage Contrastive Learning for single-cell RNA sequence Clustering 2025-05-31T05:59:56Z

Single-cell RNA sequencing (scRNA-seq) has revolutionized our understanding of cellular processes by enabling gene expression analysis at the individual cell level. Clustering allows for the identification of cell types and the further discovery of intrinsic patterns in single-cell data. However, the high dimensionality and sparsity of scRNA-seq data continue to challenge existing clustering models. In this paper, we introduce JojoSCL, a novel self-supervised contrastive learning framework for scRNA-seq clustering. By incorporating a shrinkage estimator based on hierarchical Bayesian estimation, which adjusts gene expression estimates towards more reliable cluster centroids to reduce intra-cluster dispersion, and optimized using Stein's Unbiased Risk Estimate (SURE), JojoSCL refines both instance-level and cluster-level contrastive learning. Experiments on ten scRNA-seq datasets substantiate that JojoSCL consistently outperforms prevalent clustering methods, with further validation of its practicality through robustness analysis and ablation studies. JojoSCL's code is available at: https://github.com/ziwenwang28/JojoSCL.

2025-05-31T05:59:56Z Ziwen Wang http://arxiv.org/abs/2506.00146v1 A DNA Methylation Classification Model Predicts Organ and Disease Site 2025-05-30T18:38:26Z

Cell-free DNA (cfDNA) analysis is a powerful, minimally invasive tool for monitoring disease progression, treatment response, and early detection. A major challenge, however, is accurately determining the tissue of origin, especially in complex or heterogeneous disease contexts. To address this, we developed a machine learning framework that leverages tissue-specific DNA methylation signatures to classify both tissue and disease origin from cfDNA data. Our model integrates methylation datasets across diverse epigenomic platforms, including Whole Genome Bisulfite Sequencing (WGBS), Illumina Infinium Bead Arrays, and Enzymatic Methyl-seq (EM-seq). To account for platform variability and data sparsity, we applied imputation strategies and harmonized CpG features to enable cross-platform learning. Dimensionality reduction revealed clear tissue-specific clustering of methylation profiles. A random forest classifier trained on these features achieved consistent classification performance (accuracy 0.75-0.8 across test sets and platforms). Notably, our model distinguished clinically relevant tissues such as inflamed synovium and peripheral blood mononuclear cells (PBMCs) in arthritis patients and deconvoluted synthetic cfDNA mixtures mimicking real-world liquid biopsy samples. The predicted tissue proportions closely matched the true values, demonstrating the model's potential for both classification and quantitative inference. These results support the feasibility of using cross-platform methylation data and machine learning for scalable, generalizable cfDNA diagnostics and lay the groundwork for future integration of disease-specific epigenetic features to guide clinical decision-making in precision medicine.

2025-05-30T18:38:26Z Keng-Jung Lee Dharanya Sampath Konstantinos Mavrommatis