https://arxiv.org/api/CGIb0mCis+1AhcwMtbfQdr6hziM 2026-06-14T17:39:31Z 3848 420 15 http://arxiv.org/abs/2507.07454v1 Mix-Geneformer: Unified Representation Learning for Human and Mouse scRNA-seq Data 2025-07-10T06:15:17Z

Single-cell RNA sequencing (scRNA-seq) enables single-cell transcriptomic profiling, revealing cellular heterogeneity and rare populations. Recent deep learning models like Geneformer and Mouse-Geneformer perform well on tasks such as cell-type classification and in silico perturbation. However, their species-specific design limits cross-species generalization and translational applications, which are crucial for advancing translational research and drug discovery. We present Mix-Geneformer, a novel Transformer-based model that integrates human and mouse scRNA-seq data into a unified representation via a hybrid self-supervised approach combining Masked Language Modeling (MLM) and SimCSE-based contrastive loss to capture both shared and species-specific gene patterns. A rank-value encoding scheme further emphasizes high-variance gene signals during training. Trained on about 50 million cells from diverse human and mouse organs, Mix-Geneformer matched or outperformed state-of-the-art baselines in cell-type classification and in silico perturbation tasks, achieving 95.8% accuracy on mouse kidney data versus 94.9% from the best existing model. It also successfully identified key regulatory genes validated by in vivo studies. By enabling scalable cross-species transcriptomic modeling, Mix-Geneformer offers a powerful tool for comparative transcriptomics and translational applications. While our results demonstrate strong performance, we also acknowledge limitations, such as the computational cost and variability in zero-shot transfer.

2025-07-10T06:15:17Z Yuki Nishio Takayoshi Yamashita Keita Ito Tsubasa Hirakawa Hironobu Fujiyoshi http://arxiv.org/abs/2506.01714v2 The optimization of crop response to climatic stress through modulation of plant stress response mechanisms. Opportunities for biostimulants and plant hormones to meet climate challenges 2025-07-09T07:10:58Z

Climate change is a major threat to crop potential and is characterized by both long-term shifts in temperature and precipitation patterns as well as increased occurrence of extreme weather events, these extreme weather events are the most immediate and intractable threat to agriculture. Crop resilience in the face of stress depends upon the speed and effectiveness with which plants and cropping systems sense and respond to that stress. A variety of agronomic practices including breeding, exogenous inputs (nutrients, water, biostimulants and others) and shifts in cultivation practice have been used to influence plant stress response to achieve the goal of increased plant and cropping system resilience. Traditional breeding is a powerful tool that has resulted in stable and long-term cultivar improvements but is often too slow and complex to meet the diverse, complex and unpredictable challenges of climate induced stresses. Increased inputs (water, nutrients, pesticides etc.) and management strategies (cropping system choice, soil management etc.) can alleviate stress but are often constrained by cost and availability of inputs. Exogenous biostimulants, microbials and plant hormones have shown great promise as mechanisms to optimize natural plant resilience resulting in immediate but non-permanent improvements in plant responses to climate induced stresses. The failure to modernize regulatory frameworks for the use of biostimulants in agriculture will constrain the development of safe effective tools and deprive growers of means to respond to the vagaries of climate change. Here we discuss the scientific rationale for eliminating the regulatory barriers that constrain the potential for biostimulants or products that modulate plant regulatory networks to address climate change challenges and propose a framework for enabling legislation to strengthen cropping system resilience.

2025-06-02T14:22:14Z Jing Li Giulia Forghieri Danny Geelen Patrick du Jardin Patrick H. Brown http://arxiv.org/abs/2503.09312v2 Terrier: A Deep Learning Repeat Classifier 2025-07-09T02:48:47Z

Repetitive DNA sequences underpin genome architecture and evolutionary processes, yet they remain challenging to classify accurately. Terrier is a deep learning model designed to overcome these challenges by classifying repetitive DNA sequences using a publicly available, curated repeat sequence library trained under the RepeatMasker schema. Poor representation of taxa within repeat databases often limits the classification accuracy and reproducibility of current repeat annotation methods, limiting our understanding of repeat evolution and function. Terrier overcomes these challenges by leveraging deep learning for improved accuracy. Trained on Repbase, which includes over 100,000 repeat families -- four times more than Dfam -- Terrier maps 97.1% of Repbase sequences to RepeatMasker categories, offering the most comprehensive classification system available. When benchmarked against DeepTE, TERL, and TEclass2 in model organisms (rice, fruit flies, humans, and mice), Terrier achieved superior accuracy while classifying a broader range of sequences. Further validation in non-model amphibian, flatworm and Northern krill genomes highlights its effectiveness in improving classification in non-model species, facilitating research on repeat-driven evolution, genomic instability, and phenotypic variation.

2025-03-12T12:03:26Z 14 pages, 9 figures Robert Turnbull Neil D. Young Edoardo Tescari Lee F. Skerratt Tiffany A. Kosch http://arxiv.org/abs/2507.04125v1 Graph Neural Networks as a Substitute for Transformers in Single-Cell Transcriptomics 2025-07-05T18:37:16Z

Graph Neural Networks (GNNs) and Transformers share significant similarities in their encoding strategies for interacting with features from nodes of interest, where Transformers use query-key scores and GNNs use edges. Compared to GNNs, which are unable to encode relative positions, Transformers leverage dynamic attention capabilities to better represent relative relationships, thereby becoming the standard backbones in large-scale sequential pre-training. However, the subtle difference prompts us to consider: if positions are no longer crucial, could we substitute Transformers with Graph Neural Networks in some fields such as Single-Cell Transcriptomics? In this paper, we first explore the similarities and differences between GNNs and Transformers, specifically in terms of relative positions. Additionally, we design a synthetic example to illustrate their equivalence where there are no relative positions between tokens in the sample. Finally, we conduct extensive experiments on a large-scale position-agnostic dataset-single-cell transcriptomics-finding that GNNs achieve competitive performance compared to Transformers while consuming fewer computation resources. These findings provide novel insights for researchers in the field of single-cell transcriptomics, challenging the prevailing notion that the Transformer is always the optimum choice.

2025-07-05T18:37:16Z 9 pages, 5 figures Jiaxin Qi Yan Cui Jinli Ou Jianqiang Huang Gaogang Xie http://arxiv.org/abs/2507.02818v1 Genetic Features for Drug Responses in Cancer -- Investigating an Ensemble-Feature-Selection Approach 2025-07-03T17:33:12Z

Predicting drug responses using genetic and transcriptomic features is crucial for enhancing personalized medicine. In this study, we implemented an ensemble of machine learning algorithms to analyze the correlation between genetic and transcriptomic features of cancer cell lines and IC50 values, a reliable metric for drug efficacy. Our analysis involved a reduction of the feature set from an original pool of 38,977 features, demonstrating a strong linear relationship between genetic features and drug responses across various algorithms, including SVR, Linear Regression, and Ridge Regression. Notably, copy number variations (CNVs) emerged as more predictive than mutations, suggesting a significant reevaluation of biomarkers for drug response prediction. Through rigorous statistical methods, we identified a highly reduced set of 421 critical features. This set offers a novel perspective that contrasts with traditional cancer driver genes, underscoring the potential for these biomarkers in designing targeted therapies. Furthermore, our findings advocate for IC50 values as a predictable measurement of drug responses and underscore the need for more data that can represent the dimensionality of genomic data in drug response prediction. Future work will aim to expand the dataset and refine feature selection to enhance the generalizability of the predictive model in clinical settings.

2025-07-03T17:33:12Z 14 pages, 8 figures Johannes Schlüter Alexander Schönhuth http://arxiv.org/abs/2506.10931v2 MARS: Processing-In-Memory Acceleration of Raw Signal Genome Analysis Inside the Storage Subsystem 2025-07-03T13:48:25Z

Raw signal genome analysis (RSGA) has emerged as a promising approach to enable real-time genome analysis by directly analyzing raw electrical signals. However, rapid advancements in sequencing technologies make it increasingly difficult for software-based RSGA to match the throughput of raw signal generation. This paper demonstrates that while hardware acceleration techniques can significantly accelerate RSGA, the high volume of genomic data shifts the performance and energy bottleneck from computation to I/O data movement. As sequencing throughput increases, I/O overhead becomes the main contributor to both runtime and energy consumption. Therefore, there is a need to design a high-performance, energy-efficient system for RSGA that can both alleviate the data movement bottleneck and provide large acceleration capabilities. We propose MARS, a storage-centric system that leverages the heterogeneous resources within modern storage systems (e.g., storage-internal DRAM, storage controller, flash chips) alongside their large storage capacity to tackle both data movement and computational overheads of RSGA in an area-efficient and low-cost manner. MARS accelerates RSGA through a novel hardware/software co-design approach. First, MARS modifies the RSGA pipeline via two filtering mechanisms and a quantization scheme, reducing hardware demands and optimizing for in-storage execution. Second, MARS accelerates the RSGA steps directly within the storage by leveraging both Processing-Near-Memory and Processing-Using-Memory paradigms. Third, MARS orchestrates the execution of all steps to fully exploit in-storage parallelism and minimize data movement. Our evaluation shows that MARS outperforms basecalling-based software and hardware-accelerated state-of-the-art read mapping pipelines by 93x and 40x, on average across different datasets, while reducing their energy consumption by 427x and 72x.

2025-06-12T17:38:12Z Melina Soysal Konstantina Koliogeorgi Can Firtina Nika Mansouri Ghiasi Rakesh Nadig Haiyu Mao Geraldo F. Oliveira Yu Liang Klea Zambaku Mohammad Sadrosadati Onur Mutlu 10.1145/3721145.3730428 http://arxiv.org/abs/2507.02980v1 Modeling Gene Expression Distributional Shifts for Unseen Genetic Perturbations 2025-07-01T06:04:28Z

We train a neural network to predict distributional responses in gene expression following genetic perturbations. This is an essential task in early-stage drug discovery, where such responses can offer insights into gene function and inform target identification. Existing methods only predict changes in the mean expression, overlooking stochasticity inherent in single-cell data. In contrast, we offer a more realistic view of cellular responses by modeling expression distributions. Our model predicts gene-level histograms conditioned on perturbations and outperforms baselines in capturing higher-order statistics, such as variance, skewness, and kurtosis, at a fraction of the training cost. To generalize to unseen perturbations, we incorporate prior knowledge via gene embeddings from large language models (LLMs). While modeling a richer output space, the method remains competitive in predicting mean expression changes. This work offers a practical step towards more expressive and biologically informative models of perturbation effects.

2025-07-01T06:04:28Z Kalyan Ramakrishnan Jonathan G. Hedley Sisi Qu Puneet K. Dokania Philip H. S. Torr Cesar A. Prada-Medina Julien Fauqueur Kaspar Martens http://arxiv.org/abs/2506.24013v1 CoMMiT: Co-informed inference of microbiome-metabolome interactions via transfer learning 2025-06-30T16:18:06Z

Recent multi-omic microbiome studies enable integrative analysis of microbes and metabolites, uncovering their associations with various host conditions. Such analyses require multivariate models capable of accounting for the complex correlation structures between microbes and metabolites. However, existing multivariate models often suffer from low statistical power for detecting microbiome-metabolome interactions due to small sample sizes and weak biological signals. To address these challenges, we introduce CoMMiT, Co-informed inference of Microbiome-Metabolome Interactions via novel Transfer learning models. Unlike conventional transfer-learning methods that borrow information from external datasets, CoMMiT leverages similarities across metabolites within a single cohort, reducing the risk of negative transfer often caused by differences in sequencing platforms and bioinformatic pipelines across studies. CoMMiT operates under the flexible assumption that auxiliary metabolites are collectively informative for the target metabolite, without requiring individual auxiliary metabolites to be informative. CoMMiT uses a novel data-driven approach to selecting the optimal set of auxiliary metabolites. Using this optimal set, CoMMiT employs a de-biasing framework to enable efficient calculation of p-values, facilitating the identification of statistically significant microbiome-metabolome interactions. Applying CoMMiT to a feeding study reveals biologically meaningful microbiome-metabolome interactions under a low glycemic load diet, demonstrating the diet-host link through gut metabolism.

2025-06-30T16:18:06Z 38 pages, 5 figures Leiyue Li Chenglong Ye Tim Randolph Meredith Hullar Johanna Lampe Marian Neuhouser Daniel Raftery Yue Wang http://arxiv.org/abs/2506.22963v1 CN-SBM: Categorical Block Modelling For Primary and Residual Copy Number Variation 2025-06-28T17:45:45Z

Cancer is a genetic disorder whose clonal evolution can be monitored by tracking noisy genome-wide copy number variants. We introduce the Copy Number Stochastic Block Model (CN-SBM), a probabilistic framework that jointly clusters samples and genomic regions based on discrete copy number states using a bipartite categorical block model. Unlike models relying on Gaussian or Poisson assumptions, CN-SBM respects the discrete nature of CNV calls and captures subpopulation-specific patterns through block-wise structure. Using a two-stage approach, CN-SBM decomposes CNV data into primary and residual components, enabling detection of both large-scale chromosomal alterations and finer aberrations. We derive a scalable variational inference algorithm for application to large cohorts and high-resolution data. Benchmarks on simulated and real datasets show improved model fit over existing methods. Applied to TCGA low-grade glioma data, CN-SBM reveals clinically relevant subtypes and structured residual variation, aiding patient stratification in survival analysis. These results establish CN-SBM as an interpretable, scalable framework for CNV analysis with direct relevance for tumor heterogeneity and prognosis.

2025-06-28T17:45:45Z 8 pages, 4 figures Kevin Lam William Daniels J Maxwell Douglas Daniel Lai Samuel Aparicio Benjamin Bloem-Reddy Yongjin Park http://arxiv.org/abs/2506.22641v1 Diversity by Design: Addressing Mode Collapse Improves scRNA-seq Perturbation Modeling on Well-Calibrated Metrics 2025-06-27T21:12:46Z

Recent benchmarks reveal that models for single-cell perturbation response are often outperformed by simply predicting the dataset mean. We trace this anomaly to a metric artifact: control-referenced deltas and unweighted error metrics reward mode collapse whenever the control is biased or the biological signal is sparse. Large-scale \textit{in silico} simulations and analysis of two real-world perturbation datasets confirm that shared reference shifts, not genuine biological change, drives high performance in these evaluations. We introduce differentially expressed gene (DEG)-aware metrics, weighted mean-squared error (WMSE) and weighted delta $R^{2}$ ($R^{2}_{w}(Δ)$) with respect to all perturbations, that measure error in niche signals with high sensitivity. We further introduce negative and positive performance baselines to calibrate these metrics. With these improvements, the mean baseline sinks to null performance while genuine predictors are correctly rewarded. Finally, we show that using WMSE as a loss function reduces mode collapse and improves model performance.

2025-06-27T21:12:46Z Gabriel M. Mejia Henry E. Miller Francis J. A. Leblanc Bo Wang Brendan Swain Lucas Paulo de Lima Camillo http://arxiv.org/abs/2507.00154v1 Five-Gene Expression Formula Accurately Detects Hepatocellular Carcinoma Tumors 2025-06-27T17:08:24Z

Hepatocellular carcinoma (HCC) is one of the leading causes of cancer-related deaths worldwide. Several diagnostic methods, such as imaging modalities and Serum Alpha-Fetoprotein (AFP) testing, have been used for HCC detection; however, their effectiveness is limited to later stages of the disease. In contrast, transcriptomic analysis of biposy samples has shown promise for early detection. While machine learning techniques have been applied to transcriptomic data for cancer detection, their clinical adoption remains limited due to challenges such as poor generalizability across different datasets, lack of interpretability, and high computational complexity. To address these limitations, we developed a novel predictive formula for HCC detection using the Kolmogorov-Arnold Network (KAN). This formula is based on the expression levels of five genes: VIPR1, CYP1A2, FCN3, ECM1, and LIFR. Derived from the GSE25097 dataset, the formula offers a simple, interpretable, efficient, and accessible approach for HCC identification. It achieves 99% accuracy on the GSE25097 test set and demonstrates robust performance on six additional independent datasets, achieving accuracies of above 90% in all cases. These findings highlight the critical role of these five genes as biomarkers for HCC detection, offering a foundation for future research and clinical applications to improve HCC diagnostic approaches.

2025-06-27T17:08:24Z It has been accepted for publication in Biotechnology Journal Aram Ansary Ogholbake Qiang Cheng http://arxiv.org/abs/2506.22228v1 Uncovering smooth structures in single-cell data with PCS-guided neighbor embeddings 2025-06-27T13:45:55Z

Single-cell sequencing is revolutionizing biology by enabling detailed investigations of cell-state transitions. Many biological processes unfold along continuous trajectories, yet it remains challenging to extract smooth, low-dimensional representations from inherently noisy, high-dimensional single-cell data. Neighbor embedding (NE) algorithms, such as t-SNE and UMAP, are widely used to embed high-dimensional single-cell data into low dimensions. But they often introduce undesirable distortions, resulting in misleading interpretations. Existing evaluation methods for NE algorithms primarily focus on separating discrete cell types rather than capturing continuous cell-state transitions, while dynamic modeling approaches rely on strong assumptions about cellular processes and specialized data. To address these challenges, we build on the Predictability-Computability-Stability (PCS) framework for reliable and reproducible data-driven discoveries. First, we systematically evaluate popular NE algorithms through empirical analysis, simulation, and theory, and reveal their key shortcomings, such as artifacts and instability. We then introduce NESS, a principled and interpretable machine learning approach to improve NE representations by leveraging algorithmic stability and to enable robust inference of smooth biological structures. NESS offers useful concepts, quantitative stability metrics, and efficient computational workflows to uncover developmental trajectories and cell-state transitions in single-cell data. Finally, we apply NESS to six single-cell datasets, spanning pluripotent stem cell differentiation, organoid development, and multiple tissue-specific lineage trajectories. Across these diverse contexts, NESS consistently yields useful biological insights, such as identification of transitional and stable cell states and quantification of transcriptional dynamics during development.

2025-06-27T13:45:55Z Rong Ma Xi Li Jingyuan Hu Bin Yu http://arxiv.org/abs/2507.05265v1 BMFM-DNA: A SNP-aware DNA foundation model to capture variant effects 2025-06-26T13:56:32Z

Large language models (LLMs) trained on text demonstrated remarkable results on natural language processing (NLP) tasks. These models have been adapted to decipher the language of DNA, where sequences of nucleotides act as "words" that encode genomic functions. However, the genome differs fundamentally from natural language, as it lacks clearly defined words or a consistent grammar. Although DNA language models (DNALMs) such as DNABERT, GENA-LM have achieved high level of performance on genome-related biological tasks, these models do not encode biological functions in the presence of sequence variations. To address this problem, we pre-train foundation models that effectively integrate sequence variations, in particular Single Nucleotide Polymorphisms (SNPs), as they underlie important biological functions. Specifically, we use ModernBERT to pre-train two different Biomedical Foundation Models (BMFM), namely, BMFM-DNA-REF in which the model is trained with sequences of varying lengths along with their reverse complements derived from the reference genome and BMFM-DNA-SNP in which the model is trained with sequences created using a novel representation scheme that encodes sequence variations. Our findings indicate that integrating sequence variations into DNALMs helps capture the biological functions as seen in improvements on all fine-tuning tasks. To explore the model's practical utility, we experimented with various strategies for SNP imputation on promoter detection task introduced in DNABERT-2. However, we acknowledge that the current benchmarks are limited in their ability to fully evaluate these models. To enable more comprehensive assessment in the future and encourage community contributions, we release our models through HuggingFace and the code to reproduce the results at https://github.com/BiomedSciAI/biomed-multi-omic

2025-06-26T13:56:32Z Hongyang Li Sanjoy Dey Bum Chul Kwon Michael Danziger Michal Rosen-Tzvi Jianying Hu James Kozloski Ching-Huei Tsou Bharath Dandala Pablo Meyer http://arxiv.org/abs/2506.20769v1 inMOTIFin: a lightweight end-to-end simulation software for regulatory sequences 2025-06-25T19:03:53Z

The accurate development, assessment, interpretation, and benchmarking of bioinformatics frameworks for analyzing transcriptional regulatory grammars rely on controlled simulations to validate the underlying methods. However, existing simulators often lack end-to-end flexibility or ease of integration, which limits their practical use. We present inMOTIFin, a lightweight, modular, and user-friendly Python-based software that addresses these gaps by providing versatile and efficient simulation and modification of DNA regulatory sequences. inMOTIFin enables users to simulate or modify regulatory sequences efficiently for the customizable generation of motifs and insertion of motif instances with precise control over their positions, co-occurrences, and spacing, as well as direct modification of real sequences, facilitating a comprehensive evaluation of motif-based methods and interpretation tools. We demonstrate inMOTIFin applications for the assessment of de novo motif discovery prediction, the analysis of transcription factor cooperativity, and the support of explainability analyses for deep learning models. inMOTIFin ensures robust and reproducible analyses for studying transcriptional regulatory grammars. inMOTIFin is available at PyPI https://pypi.org/project/inMOTIFin/ and Docker Hub https://hub.docker.com/r/cbgr/inmotifin. Detailed documentation is available at https://inmotifin.readthedocs.io/en/latest/. The code for use case analyses is available at https://bitbucket.org/CBGR/inmotifin_evaluation/src/main/.

2025-06-25T19:03:53Z Katalin Ferenc Lorenzo Martini Ieva Rauluseviciute Geir Kjetil Sandve Anthony Mathelier http://arxiv.org/abs/2412.01561v3 Harnessing the Potential of Spatial Statistics for Spatial Omics Data with pasta 2025-06-25T12:10:57Z

Spatial omics assays allow for the molecular characterisation of cells in their spatial context. Notably, the two main technological streams, imaging-based and high-throughput sequencing-based, can give rise to very different data modalities. The characteristics of the two data types are well known in adjacent fields such as spatial statistics as point patterns and lattice data, and there is a wide range of tools available. This paper discusses the application of spatial statistics to spatially-resolved omics data and in particular, discusses various advantages, challenges, and nuances. This work is accompanied by a vignette, pasta, that showcases the usefulness of spatial statistics in biology using several R packages.

2024-12-02T14:50:13Z Martin Emons Samuel Gunz Helena L. Crowell Izaskun Mallona Reinhard Furrer Mark D. Robinson