https://arxiv.org/api/CGIb0mCis+1AhcwMtbfQdr6hziM2026-06-14T17:39:31Z384842015http://arxiv.org/abs/2507.07454v1Mix-Geneformer: Unified Representation Learning for Human and Mouse scRNA-seq Data2025-07-10T06:15:17ZSingle-cell RNA sequencing (scRNA-seq) enables single-cell transcriptomic profiling, revealing cellular heterogeneity and rare populations. Recent deep learning models like Geneformer and Mouse-Geneformer perform well on tasks such as cell-type classification and in silico perturbation. However, their species-specific design limits cross-species generalization and translational applications, which are crucial for advancing translational research and drug discovery. We present Mix-Geneformer, a novel Transformer-based model that integrates human and mouse scRNA-seq data into a unified representation via a hybrid self-supervised approach combining Masked Language Modeling (MLM) and SimCSE-based contrastive loss to capture both shared and species-specific gene patterns. A rank-value encoding scheme further emphasizes high-variance gene signals during training. Trained on about 50 million cells from diverse human and mouse organs, Mix-Geneformer matched or outperformed state-of-the-art baselines in cell-type classification and in silico perturbation tasks, achieving 95.8% accuracy on mouse kidney data versus 94.9% from the best existing model. It also successfully identified key regulatory genes validated by in vivo studies. By enabling scalable cross-species transcriptomic modeling, Mix-Geneformer offers a powerful tool for comparative transcriptomics and translational applications. While our results demonstrate strong performance, we also acknowledge limitations, such as the computational cost and variability in zero-shot transfer.2025-07-10T06:15:17ZYuki NishioTakayoshi YamashitaKeita ItoTsubasa HirakawaHironobu Fujiyoshihttp://arxiv.org/abs/2506.01714v2The optimization of crop response to climatic stress through modulation of plant stress response mechanisms. Opportunities for biostimulants and plant hormones to meet climate challenges2025-07-09T07:10:58ZClimate change is a major threat to crop potential and is characterized by both long-term shifts in temperature and precipitation patterns as well as increased occurrence of extreme weather events, these extreme weather events are the most immediate and intractable threat to agriculture. Crop resilience in the face of stress depends upon the speed and effectiveness with which plants and cropping systems sense and respond to that stress. A variety of agronomic practices including breeding, exogenous inputs (nutrients, water, biostimulants and others) and shifts in cultivation practice have been used to influence plant stress response to achieve the goal of increased plant and cropping system resilience. Traditional breeding is a powerful tool that has resulted in stable and long-term cultivar improvements but is often too slow and complex to meet the diverse, complex and unpredictable challenges of climate induced stresses. Increased inputs (water, nutrients, pesticides etc.) and management strategies (cropping system choice, soil management etc.) can alleviate stress but are often constrained by cost and availability of inputs. Exogenous biostimulants, microbials and plant hormones have shown great promise as mechanisms to optimize natural plant resilience resulting in immediate but non-permanent improvements in plant responses to climate induced stresses. The failure to modernize regulatory frameworks for the use of biostimulants in agriculture will constrain the development of safe effective tools and deprive growers of means to respond to the vagaries of climate change. Here we discuss the scientific rationale for eliminating the regulatory barriers that constrain the potential for biostimulants or products that modulate plant regulatory networks to address climate change challenges and propose a framework for enabling legislation to strengthen cropping system resilience.2025-06-02T14:22:14ZJing LiGiulia ForghieriDanny GeelenPatrick du JardinPatrick H. Brownhttp://arxiv.org/abs/2503.09312v2Terrier: A Deep Learning Repeat Classifier2025-07-09T02:48:47ZRepetitive DNA sequences underpin genome architecture and evolutionary processes, yet they remain challenging to classify accurately. Terrier is a deep learning model designed to overcome these challenges by classifying repetitive DNA sequences using a publicly available, curated repeat sequence library trained under the RepeatMasker schema. Poor representation of taxa within repeat databases often limits the classification accuracy and reproducibility of current repeat annotation methods, limiting our understanding of repeat evolution and function. Terrier overcomes these challenges by leveraging deep learning for improved accuracy. Trained on Repbase, which includes over 100,000 repeat families -- four times more than Dfam -- Terrier maps 97.1% of Repbase sequences to RepeatMasker categories, offering the most comprehensive classification system available. When benchmarked against DeepTE, TERL, and TEclass2 in model organisms (rice, fruit flies, humans, and mice), Terrier achieved superior accuracy while classifying a broader range of sequences. Further validation in non-model amphibian, flatworm and Northern krill genomes highlights its effectiveness in improving classification in non-model species, facilitating research on repeat-driven evolution, genomic instability, and phenotypic variation.2025-03-12T12:03:26Z14 pages, 9 figuresRobert TurnbullNeil D. YoungEdoardo TescariLee F. SkerrattTiffany A. Koschhttp://arxiv.org/abs/2507.04125v1Graph Neural Networks as a Substitute for Transformers in Single-Cell Transcriptomics2025-07-05T18:37:16ZGraph Neural Networks (GNNs) and Transformers share significant similarities in their encoding strategies for interacting with features from nodes of interest, where Transformers use query-key scores and GNNs use edges. Compared to GNNs, which are unable to encode relative positions, Transformers leverage dynamic attention capabilities to better represent relative relationships, thereby becoming the standard backbones in large-scale sequential pre-training. However, the subtle difference prompts us to consider: if positions are no longer crucial, could we substitute Transformers with Graph Neural Networks in some fields such as Single-Cell Transcriptomics? In this paper, we first explore the similarities and differences between GNNs and Transformers, specifically in terms of relative positions. Additionally, we design a synthetic example to illustrate their equivalence where there are no relative positions between tokens in the sample. Finally, we conduct extensive experiments on a large-scale position-agnostic dataset-single-cell transcriptomics-finding that GNNs achieve competitive performance compared to Transformers while consuming fewer computation resources. These findings provide novel insights for researchers in the field of single-cell transcriptomics, challenging the prevailing notion that the Transformer is always the optimum choice.2025-07-05T18:37:16Z9 pages, 5 figuresJiaxin QiYan CuiJinli OuJianqiang HuangGaogang Xiehttp://arxiv.org/abs/2507.02818v1Genetic Features for Drug Responses in Cancer -- Investigating an Ensemble-Feature-Selection Approach2025-07-03T17:33:12ZPredicting drug responses using genetic and transcriptomic features is crucial for enhancing personalized medicine. In this study, we implemented an ensemble of machine learning algorithms to analyze the correlation between genetic and transcriptomic features of cancer cell lines and IC50 values, a reliable metric for drug efficacy. Our analysis involved a reduction of the feature set from an original pool of 38,977 features, demonstrating a strong linear relationship between genetic features and drug responses across various algorithms, including SVR, Linear Regression, and Ridge Regression. Notably, copy number variations (CNVs) emerged as more predictive than mutations, suggesting a significant reevaluation of biomarkers for drug response prediction. Through rigorous statistical methods, we identified a highly reduced set of 421 critical features. This set offers a novel perspective that contrasts with traditional cancer driver genes, underscoring the potential for these biomarkers in designing targeted therapies. Furthermore, our findings advocate for IC50 values as a predictable measurement of drug responses and underscore the need for more data that can represent the dimensionality of genomic data in drug response prediction. Future work will aim to expand the dataset and refine feature selection to enhance the generalizability of the predictive model in clinical settings.2025-07-03T17:33:12Z14 pages, 8 figuresJohannes SchlüterAlexander Schönhuthhttp://arxiv.org/abs/2506.10931v2MARS: Processing-In-Memory Acceleration of Raw Signal Genome Analysis Inside the Storage Subsystem2025-07-03T13:48:25ZRaw signal genome analysis (RSGA) has emerged as a promising approach to enable real-time genome analysis by directly analyzing raw electrical signals. However, rapid advancements in sequencing technologies make it increasingly difficult for software-based RSGA to match the throughput of raw signal generation. This paper demonstrates that while hardware acceleration techniques can significantly accelerate RSGA, the high volume of genomic data shifts the performance and energy bottleneck from computation to I/O data movement. As sequencing throughput increases, I/O overhead becomes the main contributor to both runtime and energy consumption. Therefore, there is a need to design a high-performance, energy-efficient system for RSGA that can both alleviate the data movement bottleneck and provide large acceleration capabilities. We propose MARS, a storage-centric system that leverages the heterogeneous resources within modern storage systems (e.g., storage-internal DRAM, storage controller, flash chips) alongside their large storage capacity to tackle both data movement and computational overheads of RSGA in an area-efficient and low-cost manner. MARS accelerates RSGA through a novel hardware/software co-design approach. First, MARS modifies the RSGA pipeline via two filtering mechanisms and a quantization scheme, reducing hardware demands and optimizing for in-storage execution. Second, MARS accelerates the RSGA steps directly within the storage by leveraging both Processing-Near-Memory and Processing-Using-Memory paradigms. Third, MARS orchestrates the execution of all steps to fully exploit in-storage parallelism and minimize data movement. Our evaluation shows that MARS outperforms basecalling-based software and hardware-accelerated state-of-the-art read mapping pipelines by 93x and 40x, on average across different datasets, while reducing their energy consumption by 427x and 72x.2025-06-12T17:38:12ZMelina SoysalKonstantina KoliogeorgiCan FirtinaNika Mansouri GhiasiRakesh NadigHaiyu MaoGeraldo F. OliveiraYu LiangKlea ZambakuMohammad SadrosadatiOnur Mutlu10.1145/3721145.3730428http://arxiv.org/abs/2507.02980v1Modeling Gene Expression Distributional Shifts for Unseen Genetic Perturbations2025-07-01T06:04:28ZWe train a neural network to predict distributional responses in gene expression following genetic perturbations. This is an essential task in early-stage drug discovery, where such responses can offer insights into gene function and inform target identification. Existing methods only predict changes in the mean expression, overlooking stochasticity inherent in single-cell data. In contrast, we offer a more realistic view of cellular responses by modeling expression distributions. Our model predicts gene-level histograms conditioned on perturbations and outperforms baselines in capturing higher-order statistics, such as variance, skewness, and kurtosis, at a fraction of the training cost. To generalize to unseen perturbations, we incorporate prior knowledge via gene embeddings from large language models (LLMs). While modeling a richer output space, the method remains competitive in predicting mean expression changes. This work offers a practical step towards more expressive and biologically informative models of perturbation effects.2025-07-01T06:04:28ZKalyan RamakrishnanJonathan G. HedleySisi QuPuneet K. DokaniaPhilip H. S. TorrCesar A. Prada-MedinaJulien FauqueurKaspar Martenshttp://arxiv.org/abs/2506.24013v1CoMMiT: Co-informed inference of microbiome-metabolome interactions via transfer learning2025-06-30T16:18:06ZRecent multi-omic microbiome studies enable integrative analysis of microbes and metabolites, uncovering their associations with various host conditions. Such analyses require multivariate models capable of accounting for the complex correlation structures between microbes and metabolites. However, existing multivariate models often suffer from low statistical power for detecting microbiome-metabolome interactions due to small sample sizes and weak biological signals. To address these challenges, we introduce CoMMiT, Co-informed inference of Microbiome-Metabolome Interactions via novel Transfer learning models. Unlike conventional transfer-learning methods that borrow information from external datasets, CoMMiT leverages similarities across metabolites within a single cohort, reducing the risk of negative transfer often caused by differences in sequencing platforms and bioinformatic pipelines across studies. CoMMiT operates under the flexible assumption that auxiliary metabolites are collectively informative for the target metabolite, without requiring individual auxiliary metabolites to be informative. CoMMiT uses a novel data-driven approach to selecting the optimal set of auxiliary metabolites. Using this optimal set, CoMMiT employs a de-biasing framework to enable efficient calculation of p-values, facilitating the identification of statistically significant microbiome-metabolome interactions. Applying CoMMiT to a feeding study reveals biologically meaningful microbiome-metabolome interactions under a low glycemic load diet, demonstrating the diet-host link through gut metabolism.2025-06-30T16:18:06Z38 pages, 5 figuresLeiyue LiChenglong YeTim RandolphMeredith HullarJohanna LampeMarian NeuhouserDaniel RafteryYue Wanghttp://arxiv.org/abs/2506.22963v1CN-SBM: Categorical Block Modelling For Primary and Residual Copy Number Variation2025-06-28T17:45:45ZCancer is a genetic disorder whose clonal evolution can be monitored by tracking noisy genome-wide copy number variants. We introduce the Copy Number Stochastic Block Model (CN-SBM), a probabilistic framework that jointly clusters samples and genomic regions based on discrete copy number states using a bipartite categorical block model. Unlike models relying on Gaussian or Poisson assumptions, CN-SBM respects the discrete nature of CNV calls and captures subpopulation-specific patterns through block-wise structure. Using a two-stage approach, CN-SBM decomposes CNV data into primary and residual components, enabling detection of both large-scale chromosomal alterations and finer aberrations. We derive a scalable variational inference algorithm for application to large cohorts and high-resolution data. Benchmarks on simulated and real datasets show improved model fit over existing methods. Applied to TCGA low-grade glioma data, CN-SBM reveals clinically relevant subtypes and structured residual variation, aiding patient stratification in survival analysis. These results establish CN-SBM as an interpretable, scalable framework for CNV analysis with direct relevance for tumor heterogeneity and prognosis.2025-06-28T17:45:45Z8 pages, 4 figuresKevin LamWilliam DanielsJ Maxwell DouglasDaniel LaiSamuel AparicioBenjamin Bloem-ReddyYongjin Parkhttp://arxiv.org/abs/2506.22641v1Diversity by Design: Addressing Mode Collapse Improves scRNA-seq Perturbation Modeling on Well-Calibrated Metrics2025-06-27T21:12:46ZRecent benchmarks reveal that models for single-cell perturbation response are often outperformed by simply predicting the dataset mean. We trace this anomaly to a metric artifact: control-referenced deltas and unweighted error metrics reward mode collapse whenever the control is biased or the biological signal is sparse. Large-scale \textit{in silico} simulations and analysis of two real-world perturbation datasets confirm that shared reference shifts, not genuine biological change, drives high performance in these evaluations. We introduce differentially expressed gene (DEG)-aware metrics, weighted mean-squared error (WMSE) and weighted delta $R^{2}$ ($R^{2}_{w}(Δ)$) with respect to all perturbations, that measure error in niche signals with high sensitivity. We further introduce negative and positive performance baselines to calibrate these metrics. With these improvements, the mean baseline sinks to null performance while genuine predictors are correctly rewarded. Finally, we show that using WMSE as a loss function reduces mode collapse and improves model performance.2025-06-27T21:12:46ZGabriel M. MejiaHenry E. MillerFrancis J. A. LeblancBo WangBrendan SwainLucas Paulo de Lima Camillohttp://arxiv.org/abs/2507.00154v1Five-Gene Expression Formula Accurately Detects Hepatocellular Carcinoma Tumors2025-06-27T17:08:24ZHepatocellular carcinoma (HCC) is one of the leading causes of cancer-related deaths worldwide. Several diagnostic methods, such as imaging modalities and Serum Alpha-Fetoprotein (AFP) testing, have been used for HCC detection; however, their effectiveness is limited to later stages of the disease. In contrast, transcriptomic analysis of biposy samples has shown promise for early detection. While machine learning techniques have been applied to transcriptomic data for cancer detection, their clinical adoption remains limited due to challenges such as poor generalizability across different datasets, lack of interpretability, and high computational complexity. To address these limitations, we developed a novel predictive formula for HCC detection using the Kolmogorov-Arnold Network (KAN). This formula is based on the expression levels of five genes: VIPR1, CYP1A2, FCN3, ECM1, and LIFR. Derived from the GSE25097 dataset, the formula offers a simple, interpretable, efficient, and accessible approach for HCC identification. It achieves 99% accuracy on the GSE25097 test set and demonstrates robust performance on six additional independent datasets, achieving accuracies of above 90% in all cases. These findings highlight the critical role of these five genes as biomarkers for HCC detection, offering a foundation for future research and clinical applications to improve HCC diagnostic approaches.2025-06-27T17:08:24ZIt has been accepted for publication in Biotechnology JournalAram Ansary OgholbakeQiang Chenghttp://arxiv.org/abs/2506.22228v1Uncovering smooth structures in single-cell data with PCS-guided neighbor embeddings2025-06-27T13:45:55ZSingle-cell sequencing is revolutionizing biology by enabling detailed investigations of cell-state transitions. Many biological processes unfold along continuous trajectories, yet it remains challenging to extract smooth, low-dimensional representations from inherently noisy, high-dimensional single-cell data. Neighbor embedding (NE) algorithms, such as t-SNE and UMAP, are widely used to embed high-dimensional single-cell data into low dimensions. But they often introduce undesirable distortions, resulting in misleading interpretations. Existing evaluation methods for NE algorithms primarily focus on separating discrete cell types rather than capturing continuous cell-state transitions, while dynamic modeling approaches rely on strong assumptions about cellular processes and specialized data. To address these challenges, we build on the Predictability-Computability-Stability (PCS) framework for reliable and reproducible data-driven discoveries. First, we systematically evaluate popular NE algorithms through empirical analysis, simulation, and theory, and reveal their key shortcomings, such as artifacts and instability. We then introduce NESS, a principled and interpretable machine learning approach to improve NE representations by leveraging algorithmic stability and to enable robust inference of smooth biological structures. NESS offers useful concepts, quantitative stability metrics, and efficient computational workflows to uncover developmental trajectories and cell-state transitions in single-cell data. Finally, we apply NESS to six single-cell datasets, spanning pluripotent stem cell differentiation, organoid development, and multiple tissue-specific lineage trajectories. Across these diverse contexts, NESS consistently yields useful biological insights, such as identification of transitional and stable cell states and quantification of transcriptional dynamics during development.2025-06-27T13:45:55ZRong MaXi LiJingyuan HuBin Yuhttp://arxiv.org/abs/2507.05265v1BMFM-DNA: A SNP-aware DNA foundation model to capture variant effects2025-06-26T13:56:32ZLarge language models (LLMs) trained on text demonstrated remarkable results on natural language processing (NLP) tasks. These models have been adapted to decipher the language of DNA, where sequences of nucleotides act as "words" that encode genomic functions. However, the genome differs fundamentally from natural language, as it lacks clearly defined words or a consistent grammar. Although DNA language models (DNALMs) such as DNABERT, GENA-LM have achieved high level of performance on genome-related biological tasks, these models do not encode biological functions in the presence of sequence variations. To address this problem, we pre-train foundation models that effectively integrate sequence variations, in particular Single Nucleotide Polymorphisms (SNPs), as they underlie important biological functions. Specifically, we use ModernBERT to pre-train two different Biomedical Foundation Models (BMFM), namely, BMFM-DNA-REF in which the model is trained with sequences of varying lengths along with their reverse complements derived from the reference genome and BMFM-DNA-SNP in which the model is trained with sequences created using a novel representation scheme that encodes sequence variations. Our findings indicate that integrating sequence variations into DNALMs helps capture the biological functions as seen in improvements on all fine-tuning tasks. To explore the model's practical utility, we experimented with various strategies for SNP imputation on promoter detection task introduced in DNABERT-2. However, we acknowledge that the current benchmarks are limited in their ability to fully evaluate these models. To enable more comprehensive assessment in the future and encourage community contributions, we release our models through HuggingFace and the code to reproduce the results at https://github.com/BiomedSciAI/biomed-multi-omic2025-06-26T13:56:32ZHongyang LiSanjoy DeyBum Chul KwonMichael DanzigerMichal Rosen-TzviJianying HuJames KozloskiChing-Huei TsouBharath DandalaPablo Meyerhttp://arxiv.org/abs/2506.20769v1inMOTIFin: a lightweight end-to-end simulation software for regulatory sequences2025-06-25T19:03:53ZThe accurate development, assessment, interpretation, and benchmarking of bioinformatics frameworks for analyzing transcriptional regulatory grammars rely on controlled simulations to validate the underlying methods. However, existing simulators often lack end-to-end flexibility or ease of integration, which limits their practical use. We present inMOTIFin, a lightweight, modular, and user-friendly Python-based software that addresses these gaps by providing versatile and efficient simulation and modification of DNA regulatory sequences. inMOTIFin enables users to simulate or modify regulatory sequences efficiently for the customizable generation of motifs and insertion of motif instances with precise control over their positions, co-occurrences, and spacing, as well as direct modification of real sequences, facilitating a comprehensive evaluation of motif-based methods and interpretation tools. We demonstrate inMOTIFin applications for the assessment of de novo motif discovery prediction, the analysis of transcription factor cooperativity, and the support of explainability analyses for deep learning models. inMOTIFin ensures robust and reproducible analyses for studying transcriptional regulatory grammars.
inMOTIFin is available at PyPI https://pypi.org/project/inMOTIFin/ and Docker Hub https://hub.docker.com/r/cbgr/inmotifin. Detailed documentation is available at https://inmotifin.readthedocs.io/en/latest/. The code for use case analyses is available at https://bitbucket.org/CBGR/inmotifin_evaluation/src/main/.2025-06-25T19:03:53ZKatalin FerencLorenzo MartiniIeva RauluseviciuteGeir Kjetil SandveAnthony Mathelierhttp://arxiv.org/abs/2412.01561v3Harnessing the Potential of Spatial Statistics for Spatial Omics Data with pasta2025-06-25T12:10:57ZSpatial omics assays allow for the molecular characterisation of cells in their spatial context. Notably, the two main technological streams, imaging-based and high-throughput sequencing-based, can give rise to very different data modalities. The characteristics of the two data types are well known in adjacent fields such as spatial statistics as point patterns and lattice data, and there is a wide range of tools available. This paper discusses the application of spatial statistics to spatially-resolved omics data and in particular, discusses various advantages, challenges, and nuances. This work is accompanied by a vignette, pasta, that showcases the usefulness of spatial statistics in biology using several R packages.2024-12-02T14:50:13ZMartin EmonsSamuel GunzHelena L. CrowellIzaskun MallonaReinhard FurrerMark D. Robinson