https://arxiv.org/api/GcJKCGqCkJLMR1EjuF5LSoTLY6c2026-06-14T07:21:36Z384827015http://arxiv.org/abs/2512.03111v1PanFoMa: A Lightweight Foundation Model and Benchmark for Pan-Cancer2025-12-02T08:31:31ZSingle-cell RNA sequencing (scRNA-seq) is essential for decoding tumor heterogeneity. However, pan-cancer research still faces two key challenges: learning discriminative and efficient single-cell representations, and establishing a comprehensive evaluation benchmark. In this paper, we introduce PanFoMa, a lightweight hybrid neural network that combines the strengths of Transformers and state-space models to achieve a balance between performance and efficiency. PanFoMa consists of a front-end local-context encoder with shared self-attention layers to capture complex, order-independent gene interactions; and a back-end global sequential feature decoder that efficiently integrates global context using a linear-time state-space model. This modular design preserves the expressive power of Transformers while leveraging the scalability of Mamba to enable transcriptome modeling, effectively capturing both local and global regulatory signals. To enable robust evaluation, we also construct a large-scale pan-cancer single-cell benchmark, PanFoMaBench, containing over 3.5 million high-quality cells across 33 cancer subtypes, curated through a rigorous preprocessing pipeline. Experimental results show that PanFoMa outperforms state-of-the-art models on our pan-cancer benchmark (+4.0\%) and across multiple public tasks, including cell type annotation (+7.4\%), batch integration (+4.0\%) and multi-omics integration (+3.1\%). The code is available at https://github.com/Xiaoshui-Huang/PanFoMa.2025-12-02T08:31:31ZAccepted by AAAI 2026Xiaoshui HuangTianlin ZhuYifan ZuoXue XiaZonghan WuJiebin YanDingli HuaZongyi XuYuming FangJian Zhanghttp://arxiv.org/abs/2512.02471v1scCluBench: Comprehensive Benchmarking of Clustering Algorithms for Single-Cell RNA Sequencing2025-12-02T07:04:38ZCell clustering is crucial for uncovering cellular heterogeneity in single-cell RNA sequencing (scRNA-seq) data by identifying cell types and marker genes. Despite its importance, benchmarks for scRNA-seq clustering methods remain fragmented, often lacking standardized protocols and failing to incorporate recent advances in artificial intelligence. To fill these gaps, we present scCluBench, a comprehensive benchmark of clustering algorithms for scRNA-seq data. First, scCluBench provides 36 scRNA-seq datasets collected from diverse public sources, covering multiple tissues, which are uniformly processed and standardized to ensure consistency for systematic evaluation and downstream analyses. To evaluate performance, we collect and reproduce a range of scRNA-seq clustering methods, including traditional, deep learning-based, graph-based, and biological foundation models. We comprehensively evaluate each method both quantitatively and qualitatively, using core performance metrics as well as visualization analyses. Furthermore, we construct representative downstream biological tasks, such as marker gene identification and cell type annotation, to further assess the practical utility. scCluBench then investigates the performance differences and applicability boundaries of various clustering models across diverse analytical tasks, systematically assessing their robustness and scalability in real-world scenarios. Overall, scCluBench offers a standardized and user-friendly benchmark for scRNA-seq clustering, with curated datasets, unified evaluation protocols, and transparent analyses, facilitating informed method selection and providing valuable insights into model generalizability and application scope.2025-12-02T07:04:38ZPing XuZaitian WangZhirui WangPengjiang LiJiajia WangRan ZhangPengfei WangYuanchun Zhouhttp://arxiv.org/abs/2510.24888v2Gosling Designer: a Platform to Democratize Construction and Sharing of Genomics Data Visualization Tools2025-12-01T16:41:17ZAnalysis of genomics data is central to nearly all areas of modern biology. Despite significant progress in artificial intelligence (AI) and computational methods, these technologies require significant human oversight to generate novel and reliable biological insights. Consequently, the genomics community has developed a substantial number of diverse visualization approaches and a proliferation of tools that biologists rely on in their data analysis workflows. While there are a few commonly used visualization tools for genomics data, many tools target specific use cases for genomics data interpretation and offer only a limited, predefined set of visualization types. Moreover, static visualizations often fail to support exploratory analysis. Developing interactive visualizations and tools typically requires significant time and technical expertise, even when supported by modern LLM-powered coding assistants, and the resulting visualizations can be difficult to share among collaborators. We developed Gosling Designer, an all-in-one platform for editing, exploring, and sharing visualizations of genomics data. Gosling Designer addresses four key challenges observed in existing genomics visualization tools: (1) limited versatility, (2) difficulty of visualization authoring, (3) complexity of data management, and (4) barriers to sharing and collaboration.2025-10-28T18:48:02ZSehi L'YiJohn ConroyPriya MisnerDavid KouĊilAstrid van den BrandtLisa ChoyNezar AbdennurNils Gehlenborghttp://arxiv.org/abs/2502.07299v3Life-Code: Central Dogma Modeling with Multi-Omics Sequence Unification2025-11-29T20:25:30ZThe interactions between DNA, RNA, and proteins are fundamental to biological processes, as illustrated by the central dogma of molecular biology. Although modern biological pre-trained models have achieved great success in analyzing these macromolecules individually, their interconnected nature remains underexplored. This paper follows the guidance of the central dogma to redesign both the data and model pipeline and offers a comprehensive framework, Life-Code, that spans different biological functions. As for data flow, we propose a unified pipeline to integrate multi-omics data by reverse-transcribing RNA and reverse-translating amino acids into nucleotide-based sequences. As for the model, we design a codon tokenizer and a hybrid long-sequence architecture to encode the interactions between coding and non-coding regions through masked modeling pre-training. To model the translation and folding process with coding sequences, Life-Code learns protein structures of the corresponding amino acids by knowledge distillation from off-the-shelf protein language models. Such designs enable Life-Code to capture complex interactions within genetic sequences, providing a more comprehensive understanding of multi-omics with the central dogma. Extensive experiments show that Life-Code achieves state-of-the-art results on various tasks across three omics, highlighting its potential for advancing multi-omics analysis and interpretation.2025-02-11T06:53:59ZPreprint V3 (10 pages main text)Zicheng LiuSiyuan LiZhiyuan ChenChang YuQirong YangYucheng GuoYujie YangXiaoming ZhangStan Z. Lihttp://arxiv.org/abs/2511.22821v1deepFEPS: Deep Learning-Oriented Feature Extraction for Biological Sequences2025-11-28T00:55:43ZMachine- and deep-learning approaches for biological sequences depend critically on transforming raw DNA, RNA, and protein FASTA files into informative numerical representations. However, this process is often fragmented across multiple libraries and preprocessing steps, which creates a barrier for researchers without extensive computational expertise. To address this gap, we developed deepFEPS, an open-source toolkit that unifies state-of-the-art feature extraction methods for sequence data within a single, reproducible workflow. deepFEPS integrates five families of modern feature extractors - k-mer embeddings (Word2Vec, FastText), document-level embeddings (Doc2Vec), transformer-based encoders (DNABERT, ProtBERT, and ESM2), autoencoder-derived latent features, and graph-based embeddings - into one consistent platform. The system accepts FASTA input via a web interface or command-line tool, exposes key model parameters, and outputs analysis-ready feature matrices (CSV). Each run is accompanied by an automatic quality-control report including sequence counts, dimensionality, sparsity, variance distributions, class balance, and diagnostic visualizations. By consolidating advanced sequence embeddings into one environment, deepFEPS reduces preprocessing overhead, improves reproducibility, and shortens the path from raw sequences to downstream machine- and deep-learning applications. deepFEPS lowers the practical barrier to modern representation learning for bioinformatics, enabling both novice and expert users to generate advanced embeddings for classification, clustering, and predictive modeling. Its unified framework supports exploratory analyses, high-throughput studies, and integration into institutional workflows, while remaining extensible to emerging models and methods. The webserver is accessible at https://hdismail.com/deepfeps2/.2025-11-28T00:55:43Z16 pages, 6 figures, bioinformatics tool for genomics analysisHamid IsmailMarwan Bikdashhttp://arxiv.org/abs/2506.00096v3PathGene: Benchmarking Driver Gene Mutations and Exon Prediction Using Multicenter Lung Cancer Histopathology Image Dataset2025-11-27T01:52:51ZAccurately predicting gene mutations, mutation subtypes and their exons in lung cancer is critical for personalized treatment planning and prognostic assessment. Faced with regional disparities in medical resources and the high cost of genomic assays, using artificial intelligence to infer these mutations and exon variants from routine histopathology images could greatly facilitate precision therapy. Although some prior studies have shown that deep learning can accelerate the prediction of key gene mutations from lung cancer pathology slides, their performance remains suboptimal and has so far been limited mainly to early screening tasks. To address these limitations, we have assembled PathGene, which comprises histopathology images paired with next-generation sequencing reports from 1,576 patients at the Second Xiangya Hospital, Central South University, and 448 TCGA-LUAD patients. This multi-center dataset links whole-slide images to driver gene mutation status, mutation subtypes, exon, and tumor mutational burden (TMB) status, with the goal of leveraging pathology images to predict mutations, subtypes, exon locations, and TMB for early genetic screening and to advance precision oncology. Unlike existing datasets, we provide molecular-level information related to histopathology images in PathGene to facilitate the development of biomarker prediction models. We benchmarked 11 multiple-instance learning methods on PathGene for mutation, subtype, exon, and TMB prediction tasks. These experimental methods provide valuable alternatives for early genetic screening of lung cancer patients and assisting clinicians to quickly develop personalized precision targeted treatment plans for patients. Code and data are available at https://github.com/panliangrui/NIPS2025/.2025-05-30T11:51:11ZWithdrawn due to issues related to data permissions/ethicsLiangrui PanQingchun LiangShen ZhaoSongqing FanShaoliang Penghttp://arxiv.org/abs/2511.20382v2MoRE: Batch-Robust Multi-Omics Representations from Frozen Pre-trained Transformers2025-11-26T09:57:09ZRepresentation learning on multi-omics data is challenging due to extreme dimensionality, modality heterogeneity, and cohort-specific batch effects. While pre-trained transformer backbones have shown broad generalization capabilities in biological sequence modeling, their application to multi-omics integration remains underexplored. We present MoRE (Multi-Omics Representation Embedding), a framework that repurposes frozen pre-trained transformers to align heterogeneous assays into a shared latent space. Unlike purely generative approaches, MoRE employs a parameter-efficient fine-tuning (PEFT) strategy, prioritizing cross-sample and cross-modality alignment over simple sequence reconstruction. Specifically, MoRE attaches lightweight, modality-specific adapters and a task-adaptive fusion layer to the frozen backbone. It optimizes a masked modeling objective jointly with supervised contrastive and batch-invariant alignment losses, yielding structure-preserving embeddings that generalize across unseen cell types and platforms. We benchmark MoRE against established baselines, including scGPT, scVI, and Harmony with Scrublet, evaluating integration fidelity, rare population detection, and modality transfer. Our results demonstrate that MoRE achieves competitive batch robustness and biological conservation while significantly reducing trainable parameters compared to fully fine-tuned models. This work positions MoRE as a practical step toward general-purpose omics foundation models.2025-11-25T15:04:06ZAudrey Pei-Hsuan Chenhttp://arxiv.org/abs/2511.20727v1SeqManager: A Web-Based Tool for Efficient Sequencing Data Storage Management and Duplicate Detection2025-11-25T10:21:08ZMotivation: Modern genomics laboratories generate massive volumes of sequencing data, often resulting in significant storage costs. Genomics storage consists of duplicate files, temporary processing files, and redundant intermediate data. Results: We developed SeqManager, a web-based application that provides automated identification, classification, and management of sequencing data files with intelligent duplicate detection. It also detects intermediate sequencing files that can safely be removed. Evaluation across four genomics laboratory settings demonstrate that our tool is fast and has a very low memory footprint.2025-11-25T10:21:08ZBioinformatics Advances, 2025Margot CelerieIGHAndrew OldfieldIGHWilliam RitchieIGHhttp://arxiv.org/abs/2511.19153v1Fast and Flexible Flow Decompositions in General Graphs via Dominators2025-11-24T14:18:51ZMulti-assembly methods rely at their core on a flow decomposition problem, namely, decomposing a weighted graph into weighted paths or walks. However, most results over the past decade have focused on decompositions over directed acyclic graphs (DAGs). This limitation has lead to either purely heuristic methods, or in applications transforming a graph with cycles into a DAG via preprocessing heuristics. In this paper we show that flow decomposition problems can be solved in practice also on general graphs with cycles, via a framework that yields fast and flexible Mixed Integer Linear Programming (MILP) formulations.
Our key technique relies on the graph-theoretic notion of dominator tree, which we use to find all safe sequences of edges, that are guaranteed to appear in some walk of any flow decomposition solution. We generalize previous results from DAGs to cyclic graphs, by showing that maximal safe sequences correspond to extensions of common leaves of two dominator trees, and that we can find all of them in time linear in their size. Using these, we can accelerate MILPs for any flow decomposition into walks in general graphs, by setting to (at least) 1 suitable variables encoding solution walks, and by setting to 0 other walks variables non-reachable to and from safe sequences. This reduces model size and eliminates costly linearizations of MILP variable products.
We experiment with three decomposition models (Minimum Flow Decomposition, Least Absolute Errors and Minimum Path Error), on four bacterial datasets. Our pre-processing enables up to thousand-fold speedups and solves even under 30 seconds many instances otherwise timing out. We thus hope that our dominator-based MILP simplification framework, and the accompanying software library can become building blocks in multi-assembly applications.2025-11-24T14:18:51ZFrancisco SenaAlexandru I. Tomescuhttp://arxiv.org/abs/2511.19068v1The TAG array of a multiple sequence alignment2025-11-24T13:05:25ZModern genomic analyses increasingly rely on pangenomes, that is, representations of the genome of entire populations. The simplest representation of a pangenome is a set of individual genome sequences. Compared to e.g. sequence graphs, this has the advantage that efficient exact search via indexes based on the Burrows-Wheeler Transform (BWT) is possible, that no chimeric sequences are created, and that the results are not influenced by heuristics. However, such an index may report a match in thousands of positions even if these all correspond to the same locus, making downstream analysis unnecessarily expensive. For sufficiently similar sequences (e.g. human chromosomes), a multiple sequence alignment (MSA) can be computed. Since an MSA tends to group similar strings in the same columns, it is likely that a string occurring thousands of times in the pangenome can be described by very few columns in the MSA. We describe a method to tag entries in the BWT with the corresponding column in the MSA and develop an index that can map matches in the BWT to columns in the MSA in time proportional to the output. As a by-product, we can efficiently project a match to a designated reference genome, a capability that current pangenome aligners based on the BWT lack.2025-11-24T13:05:25ZJannik OlbrichEnno Ohlebuschhttp://arxiv.org/abs/2511.18336v1Auxiliary Gene Learning: Spatial Gene Expression Estimation by Auxiliary Gene Selection2025-11-23T08:22:20ZSpatial transcriptomics (ST) is a novel technology that enables the observation of gene expression at the resolution of individual spots within pathological tissues. ST quantifies the expression of tens of thousands of genes in a tissue section; however, heavy observational noise is often introduced during measurement. In prior studies, to ensure meaningful assessment, both training and evaluation have been restricted to only a small subset of highly variable genes, and genes outside this subset have also been excluded from the training process. However, since there are likely co-expression relationships between genes, low-expression genes may still contribute to the estimation of the evaluation target. In this paper, we propose $Auxiliary \ Gene \ Learning$ (AGL) that utilizes the benefit of the ignored genes by reformulating their expression estimation as auxiliary tasks and training them jointly with the primary tasks. To effectively leverage auxiliary genes, we must select a subset of auxiliary genes that positively influence the prediction of the target genes. However, this is a challenging optimization problem due to the vast number of possible combinations. To overcome this challenge, we propose Prior-Knowledge-Based Differentiable Top-$k$ Gene Selection via Bi-level Optimization (DkGSB), a method that ranks genes by leveraging prior knowledge and relaxes the combinatorial selection problem into a differentiable top-$k$ selection problem. The experiments confirm the effectiveness of incorporating auxiliary genes and show that the proposed method outperforms conventional auxiliary task learning approaches.2025-11-23T08:22:20ZAccepted to Association for the Advancement of Artificial Intelligence (AAAI) 2026Kaito ShikuKazuya NishimuraShinnosuke MatsuoYasuhiro KojimaRyoma Bisehttp://arxiv.org/abs/2508.08312v3CFM-GP: Unified Conditional Flow Matching to Learn Gene Perturbation Across Cell Types2025-11-23T04:40:23ZUnderstanding gene perturbation effects across diverse cellular contexts is a central challenge in functional genomics, with important implications for therapeutic discovery and precision medicine. Single-cell technologies enable high-resolution measurement of transcriptional responses, but collecting such data is costly and time-consuming, especially when repeated for each cell type. Existing computational methods often require separate models per cell type, limiting scalability and generalization. We present CFM-GP, a method for cell type-agnostic gene perturbation prediction. CFM-GP learns a continuous, time-dependent transformation between unperturbed and perturbed gene expression distributions, conditioned on cell type, allowing a single model to predict across all cell types. Unlike prior approaches that use discrete modeling, CFM-GP employs a flow matching objective to capture perturbation dynamics in a scalable manner. We evaluate on five datasets: SARS-CoV-2 infection, IFN-beta stimulated PBMCs, glioblastoma treated with Panobinostat, lupus under IFN-beta stimulation, and Statefate progenitor fate mapping. CFM-GP consistently outperforms state-of-the-art baselines in R-squared and Spearman correlation, and pathway enrichment analysis confirms recovery of key biological pathways. These results demonstrate the robustness and biological fidelity of CFM-GP as a scalable solution for cross-cell type gene perturbation prediction.2025-08-09T00:00:17Z28 Pages, 19 Tables, 8 Figures. The first two authors contributed equallyAbrar Rahman AbirSajib Acharjee DipLiqing Zhanghttp://arxiv.org/abs/2411.11158v2Leveraging genomic deep learning models for the prediction of non-coding variant effects2025-11-22T18:24:01ZCharacterizing non-coding variant function remains an important challenge in human genetics. Genomic deep learning models have emerged as a promising approach to enable in silico prediction of variant effects. These include supervised sequence-to-activity models, which predict molecular phenotypes such as genome-wide chromatin states or gene expression levels directly from DNA sequence, and self-supervised genomic language models. Here, we review progress in leveraging these models for non-coding variant effect prediction. We describe practical considerations for making such predictions and categorize the types of ground truth data used to evaluate variant effect predictions, providing insight into the settings in which current models are most useful. Our Review highlights key considerations for practitioners and opportunities for improvement in model development and evaluation.2024-11-17T19:15:42ZPooja KathailAyesha BajwaNilah M. Ioannidishttp://arxiv.org/abs/2507.04981v4Classification of autoimmune diseases from Peripheral blood TCR repertoires by multimodal multi-instance learning2025-11-22T08:09:08ZT cell receptor (TCR) repertoires encode critical immunological signatures for autoimmune diseases, yet their clinical application remains limited by sequence sparsity and low witness rates. We developed EAMil, a multi-instance deep learning framework that leverages TCR sequencing data to diagnose systemic lupus erythematosus (SLE) and rheumatoid arthritis (RA) with exceptional accuracy. By integrating PrimeSeq feature extraction with ESMonehot encoding and enhanced gate attention mechanisms, our model achieved state-of-the-art performance with AUCs of 98.95% for SLE and 97.76% for RA. EAMil successfully identified disease-associated genes with over 90% concordance with established differential analyses and effectively distinguished disease-specific TCR genes. The model demonstrated robustness in classifying multiple disease categories, utilizing the SLEDAI score to stratify SLE patients by disease severity as well as to diagnose the site of damage in SLE patients, and effectively controlling for confounding factors such as age and gender. This interpretable framework for immune receptor analysis provides new insights for autoimmune disease detection and classification with broad potential clinical applications across immune-mediated conditions.2025-07-07T13:24:41Z4 figures, 3 tabels, 8 pagesRuihao ZhangMao chenFei YeDandan MengYixuan HuangXiao Liuhttp://arxiv.org/abs/2401.11858v3Approximating a gene regulatory network from non-sequential data2025-11-21T17:40:50ZGiven non-sequential snapshots from instances of a dynamical system, we design a compressed sensing based algorithm that reconstructs the dynamical system. On the theoretical side, we show that: (1) successful reconstruction is possible under the assumption that we can construct an approximate clock from a subset of the coordinates of the underlying system, and (2) computing the minimal Lyapunov exponent of the dynamical system, where the minimum is taken over all subsets of coordinates of the dynamical system, equates to computing a min-max equilibrium. We design an efficient randomized algorithm for computing the above equilibrium.
As an application of our theoretical results, we reconstruct the underlying dynamical system from publicly available RNA-seq data to: (1) predict the underlying gene regulatory networks (as opposed to individual genes) that may help differentiate between metastatic vs non-metastatic breast cancer (and also colorectal cancer), and (2) identify candidate genes that could be used as target biomarkers for basket trials. In particular, our in silico analysis suggests that RORC agonists, which are already used in colorectal cancer therapies, may be worth investigating for breast cancers.2024-01-22T11:27:42ZCliff SteinPratik Worah