https://arxiv.org/api/0mAw7x6THwPl+xl3klyB0A1+C4k2026-06-14T15:31:39Z384839015http://arxiv.org/abs/2508.08014v1ShortCake: An integrated platform for efficient and reproducible single-cell analysis2025-08-11T14:21:11ZMotivation: Recent advances in single-cell analysis have introduced new computational challenges. Researchers often need to use multiple analysis tools written in different programming languages while managing version conflicts between related packages within a single workflow. For the research community, minimizing the time spent on environment setup and installation issues is essential. Results: We present ShortCake, a containerized platform that integrates a suite of single-cell analysis tools written in R and Python. ShortCake isolates competing Python tools into separate virtual environments that can be easily accessed within a Jupyter notebook. This enables users to effortlessly transition between various environments, including R, even within a single notebook. Additionally, ShortCake offers multiple ``flavors,'' enabling users to select container images tailored to their specific needs. ShortCake provides a unified environment with fixed versions of various tools, thus streamlining workflows, reducing setup time, and improving reproducibility. Availability and implementation: The ShortCake image is available on DockerHub (https://hub.docker.com/r/rnakato/shortcake). The source code is available on GitHub (https://github.com/rnakato/ShortCake).2025-08-11T14:21:11ZRyuichiro NakatoLuis Augusto Eijy Nagaihttp://arxiv.org/abs/2508.09212v1Deep Generative Models for Discrete Genotype Simulation2025-08-11T11:56:03ZDeep generative models open new avenues for simulating realistic genomic data while preserving privacy and addressing data accessibility constraints. While previous studies have primarily focused on generating gene expression or haplotype data, this study explores generating genotype data in both unconditioned and phenotype-conditioned settings, which is inherently more challenging due to the discrete nature of genotype data. In this work, we developed and evaluated commonly used generative models, including Variational Autoencoders (VAEs), Diffusion Models, and Generative Adversarial Networks (GANs), and proposed adaptation tailored to discrete genotype data. We conducted extensive experiments on large-scale datasets, including all chromosomes from cow and multiple chromosomes from human. Model performance was assessed using a well-established set of metrics drawn from both deep learning and quantitative genetics literature. Our results show that these models can effectively capture genetic patterns and preserve genotype-phenotype association. Our findings provide a comprehensive comparison of these models and offer practical guidelines for future research in genotype simulation. We have made our code publicly available at https://github.com/SihanXXX/DiscreteGenoGen.2025-08-11T11:56:03ZSihan XieGABIThierry TriboutGABIDidier BoichardGABIBlaise HanczarIBISCJulien ChiquetMIA Paris-SaclayEric BarreyGABIhttp://arxiv.org/abs/2406.13839v4RNA-FrameFlow: Flow Matching for de novo 3D RNA Backbone Design2025-08-11T08:09:33ZWe introduce RNA-FrameFlow, the first generative model for 3D RNA backbone design. We build upon SE(3) flow matching for protein backbone generation and establish protocols for data preparation and evaluation to address unique challenges posed by RNA modeling. We formulate RNA structures as a set of rigid-body frames and associated loss functions which account for larger, more conformationally flexible RNA backbones (13 atoms per nucleotide) vs. proteins (4 atoms per residue). Toward tackling the lack of diversity in 3D RNA datasets, we explore training with structural clustering and cropping augmentations. Additionally, we define a suite of evaluation metrics to measure whether the generated RNA structures are globally self-consistent (via inverse folding followed by forward folding) and locally recover RNA-specific structural descriptors. The most performant version of RNA-FrameFlow generates locally realistic RNA backbones of 40-150 nucleotides, over 40% of which pass our validity criteria as measured by a self-consistency TM-score >= 0.45, at which two RNAs have the same global fold. Open-source code: https://github.com/rish-16/rna-backbone-design2024-06-19T21:06:44ZPublished in Transactions on Machine Learning Research (https://openreview.net/forum?id=wOc1Yx5s09). Also presented as an Oral at Machine Learning in Computational Biology 2024, ICML 2024 Structured Probabilistic Inference & Generative Modeling Workshop, and a Spotlight at ICML 2024 AI4Science WorkshopRishabh AnandChaitanya K. JoshiAlex MoreheadArian R. JamasbCharles HarrisSimon V. MathisKieran DidiRex YingBryan HooiPietro Liòhttp://arxiv.org/abs/2508.07127v1How Effectively Can Large Language Models Connect SNP Variants and ECG Phenotypes for Cardiovascular Risk Prediction?2025-08-10T00:19:29ZCardiovascular disease (CVD) prediction remains a tremendous challenge due to its multifactorial etiology and global burden of morbidity and mortality. Despite the growing availability of genomic and electrophysiological data, extracting biologically meaningful insights from such high-dimensional, noisy, and sparsely annotated datasets remains a non-trivial task. Recently, LLMs has been applied effectively to predict structural variations in biological sequences. In this work, we explore the potential of fine-tuned LLMs to predict cardiac diseases and SNPs potentially leading to CVD risk using genetic markers derived from high-throughput genomic profiling. We investigate the effect of genetic patterns associated with cardiac conditions and evaluate how LLMs can learn latent biological relationships from structured and semi-structured genomic data obtained by mapping genetic aspects that are inherited from the family tree. By framing the problem as a Chain of Thought (CoT) reasoning task, the models are prompted to generate disease labels and articulate informed clinical deductions across diverse patient profiles and phenotypes. The findings highlight the promise of LLMs in contributing to early detection, risk assessment, and ultimately, the advancement of personalized medicine in cardiac care.2025-08-10T00:19:29ZNiranjana Arun MenonIqra FarooqYulong LiSara AhmedYutong XieMuhammad AwaisImran Razzakhttp://arxiv.org/abs/2507.03718v2Finding easy regions for short-read variant calling from pangenome data2025-08-08T03:59:01ZBackground: While benchmarks on short-read variant calling suggest low error rate below 0.5%, they are only applicable to predefined confident regions. For a human sample without such regions, the error rate could be 10 times higher. Although multiple sets of easy regions have been identified to alleviate the issue, they fail to consider non-reference samples or are biased towards existing short-read data or aligners.
Results: Here, using hundreds of high-quality human assemblies, we derived a set of sample-agnostic easy regions where short-read variant calling reaches high accuracy. These regions cover 88.2% of GRCh38, 92.2% of coding regions and 96.3% of ClinVar pathogenic variants. They achieve a good balance between coverage and easiness and can be generated for other human assemblies or species with multiple well assembled genomes.
Conclusion: This resource provides a convient and powerful way to filter spurious variant calls for clinical or research human samples.2025-07-04T17:11:15ZHeng Li10.1093/gigascience/giaf103http://arxiv.org/abs/2506.15671v2Quantum-inspired algorithm for simulating viral response2025-08-06T18:20:12ZUnderstanding the properties of biological systems is an exciting avenue for applying advanced approaches to solving corresponding computational tasks. A specific class of problems that arises in the resolution of biological challenges is optimization. In this work, we present the results of a proof-of-concept study that applies a quantum-inspired optimization algorithm to simulate a viral response. We formulate an Ising-type model to describe the patterns of gene activity in host responses. Reducing the problem to the Ising form allows the use of available quantum and quantum-inspired optimization tools. We demonstrate the application of a quantum-inspired optimization algorithm to this problem. Our study paves the way for exploring the full potential of quantum and quantum-inspired optimization tools in biological applications.2025-06-18T17:51:21Z35 pages, 6 figuresDaria O. KoninaDmitry I. KorbashovIlya V. KovalchukAygul A. NizamievaDmitry A. ChermoshentsevAleksey K. Fedorovhttp://arxiv.org/abs/2508.04757v1Embedding Is (Almost) All You Need: Retrieval-Augmented Inference for Generalizable Genomic Prediction Tasks2025-08-06T14:15:48ZLarge pre-trained DNA language models such as DNABERT-2, Nucleotide Transformer, and HyenaDNA have demonstrated strong performance on various genomic benchmarks. However, most applications rely on expensive fine-tuning, which works best when the training and test data share a similar distribution. In this work, we investigate whether task-specific fine-tuning is always necessary. We show that simple embedding-based pipelines that extract fixed representations from these models and feed them into lightweight classifiers can achieve competitive performance. In evaluation settings with different data distributions, embedding-based methods often outperform fine-tuning while reducing inference time by 10x to 20x. Our results suggest that embedding extraction is not only a strong baseline but also a more generalizable and efficient alternative to fine-tuning, especially for deployment in diverse or unseen genomic contexts. For example, in enhancer classification, HyenaDNA embeddings combined with zCurve achieve 0.68 accuracy (vs. 0.58 for fine-tuning), with an 88% reduction in inference time and over 8x lower carbon emissions (0.02 kg vs. 0.17 kg CO2). In non-TATA promoter classification, DNABERT-2 embeddings with zCurve or GC content reach 0.85 accuracy (vs. 0.89 with fine-tuning) with a 22x lower carbon footprint (0.02 kg vs. 0.44 kg CO2). These results show that embedding-based pipelines offer over 10x better carbon efficiency while maintaining strong predictive performance. The code is available here: https://github.com/NIRJHOR-DATTA/EMBEDDING-IS-ALMOST-ALL-YOU-NEED.2025-08-06T14:15:48ZNirjhor DattaSwakkhar ShatabdaM Sohel Rahmanhttp://arxiv.org/abs/2508.05692v1SiCmiR Atlas: Single-Cell miRNA Landscapes Reveals Hub-miRNA and Network Signatures in Human Cancers2025-08-06T12:23:08ZmicroRNA are pivotal post-transcriptional regulators whose single-cell behavior has remained largely inaccessible owing to technical barriers in single-cell small-RNA profiling. We present SiCmiR, a two-layer neural network that predicts miRNA expression profile from only 977 LINCS L1000 landmark genes reducing sensitivity to dropout of single-cell RNA-seq data. Proof-of-concept analyses illustrate how SiCmiR can uncover candidate hub-miRNAs in bulk-seq cell lines and hepatocellular carcinoma, scRNA-seq pancreatic ductal carcinoma and ACTH-secreting pituitary adenoma and extracellular-vesicle-mediated crosstalk in glioblastoma. Trained on 6462 TCGA paired miRNA-mRNA samples, SiCmiR attains state-of-the-art accuracy on held-out cancers and generalizes to unseen cancer types, drug perturbations and scRNA-seq. We next constructed SiCmiR-Atlas, containing 632 public datasets, 9.36 million cells, 726 cell types, which is the first dedicated database of single-cell mature miRNA expression--providing interactive visualization, biomarker identification and cell-type-resolved miRNA-target networks. SiCmiR transforms bulk-derived statistical power into a single-cell view of miRNA biology and provides a community resource SiCmiR Atlas for biomarker discovery. SiCmiR Atlas is avilable at https://awi.cuhk.edu.cn/~SiCmiR/.2025-08-06T12:23:08ZXiao-Xuan CaiJing-Shan LiaoJia-Jun MaYu-Xuan PangYi-Gang ChenYang-Chi-Dung LinYi-Dan ChenXin CaoYi-Cheng ZhangTao-Sheng XuTzong-Yi LeeHsi-Yuan HuangHsien-Da Huang10.1002/advs.202514446http://arxiv.org/abs/2508.04743v1Alz-QNet: A Quantum Regression Network for Studying Alzheimer's Gene Interactions2025-08-06T04:31:49ZUnderstanding the molecular-level mechanisms underpinning Alzheimer's disease (AD) by studying crucial genes associated with the disease remains a challenge. Alzheimer's, being a multifactorial disease, requires understanding the gene-gene interactions underlying it for theranostics and progress. In this article, a novel attempt has been made using a quantum regression to decode how some crucial genes in the AD Amyloid Beta Precursor Protein ($APP$), Sterol regulatory element binding transcription factor 14 ($FGF14$), Yin Yang 1 ($YY1$), and Phospholipase D Family Member 3 ($PLD3$) etc. become influenced by other prominent switching genes during disease progression, which may help in gene expression-based therapy for AD. Our proposed Quantum Regression Network (Alz-QNet) introduces a pioneering approach with insights from the state-of-the-art Quantum Gene Regulatory Networks (QGRN) to unravel the gene interactions involved in AD pathology, particularly within the Entorhinal Cortex (EC), where early pathological changes occur. Using the proposed Alz-QNet framework, we explore the interactions between key genes ($APP$, $FGF14$, $YY1$, $EGR1$, $GAS7$, $AKT3$, $SREBF2$, and $PLD3$) within the CE microenvironment of AD patients, studying genetic samples from the database $GSE138852$, all of which are believed to play a crucial role in the progression of AD. Our investigation uncovers intricate gene-gene interactions, shedding light on the potential regulatory mechanisms that underlie the pathogenesis of AD, which help us to find potential gene inhibitors or regulators for theranostics.2025-08-06T04:31:49ZComputers in Biology and Medicine, Volume 196, Part C, September 2025, 110837Debanjan KonarNeerav SreekumarRichard JiangVaneet Aggarwal10.1016/j.compbiomed.2025.110837http://arxiv.org/abs/2508.04742v1Discovery of Disease Relationships via Transcriptomic Signature Analysis Powered by Agentic AI2025-08-06T04:25:40ZModern disease classification often overlooks molecular commonalities hidden beneath divergent clinical presentations. This study introduces a transcriptomics-driven framework for discovering disease relationships by analyzing over 1300 disease-condition pairs using GenoMAS, a fully automated agentic AI system. Beyond identifying robust gene-level overlaps, we develop a novel pathway-based similarity framework that integrates multi-database enrichment analysis to quantify functional convergence across diseases. The resulting disease similarity network reveals both known comorbidities and previously undocumented cross-category links. By examining shared biological pathways, we explore potential molecular mechanisms underlying these connections-offering functional hypotheses that go beyond symptom-based taxonomies. We further show how background conditions such as obesity and hypertension modulate transcriptomic similarity, and identify therapeutic repurposing opportunities for rare diseases like autism spectrum disorder based on their molecular proximity to better-characterized conditions. In addition, this work demonstrates how biologically grounded agentic AI can scale transcriptomic analysis while enabling mechanistic interpretation across complex disease landscapes. All results are publicly accessible at github.com/KeeeeChen/Pathway_Similarity_Network.2025-08-06T04:25:40ZKe ChenHaohan Wanghttp://arxiv.org/abs/2508.04739v1CodonMoE: DNA Language Models for mRNA Analyses2025-08-06T01:40:12ZGenomic language models (gLMs) face a fundamental efficiency challenge: either maintain separate specialized models for each biological modality (DNA and RNA) or develop large multi-modal architectures. Both approaches impose significant computational burdens - modality-specific models require redundant infrastructure despite inherent biological connections, while multi-modal architectures demand massive parameter counts and extensive cross-modality pretraining. To address this limitation, we introduce CodonMoE (Adaptive Mixture of Codon Reformative Experts), a lightweight adapter that transforms DNA language models into effective RNA analyzers without RNA-specific pretraining. Our theoretical analysis establishes CodonMoE as a universal approximator at the codon level, capable of mapping arbitrary functions from codon sequences to RNA properties given sufficient expert capacity. Across four RNA prediction tasks spanning stability, expression, and regulation, DNA models augmented with CodonMoE significantly outperform their unmodified counterparts, with HyenaDNA+CodonMoE series achieving state-of-the-art results using 80% fewer parameters than specialized RNA models. By maintaining sub-quadratic complexity while achieving superior performance, our approach provides a principled path toward unifying genomic language modeling, leveraging more abundant DNA data and reducing computational overhead while preserving modality-specific performance advantages.2025-08-06T01:40:12ZShiyi DuLitian LiangJiayi LiCarl Kingsfordhttp://arxiv.org/abs/2508.02743v1A Novel cVAE-Augmented Deep Learning Framework for Pan-Cancer RNA-Seq Classification2025-08-02T16:57:31ZPan-cancer classification using transcriptomic (RNA-Seq) data can inform tumor subtyping and therapy selection, but is challenging due to extremely high dimensionality and limited sample sizes. In this study, we propose a novel deep learning framework that uses a class-conditional variational autoencoder (cVAE) to augment training data for pan-cancer gene expression classification. Using 801 tumor RNA-Seq samples spanning 5 cancer types from The Cancer Genome Atlas (TCGA), we first perform feature selection to reduce 20,531 gene expression features to the 500 most variably expressed genes. A cVAE is then trained on this data to learn a latent representation of gene expression conditioned on cancer type, enabling the generation of synthetic gene expression samples for each tumor class. We augment the training set with these cVAE-generated samples (doubling the dataset size) to mitigate overfitting and class imbalance. A two-layer multilayer perceptron (MLP) classifier is subsequently trained on the augmented dataset to predict tumor type. The augmented framework achieves high classification accuracy (~98%) on a held-out test set, substantially outperforming a classifier trained on the original data alone. We present detailed experimental results, including VAE training curves, classifier performance metrics (ROC curves and confusion matrix), and architecture diagrams to illustrate the approach. The results demonstrate that cVAE-based synthetic augmentation can significantly improve pan-cancer prediction performance, especially for underrepresented cancer classes.2025-08-02T16:57:31ZVinil Polepallihttp://arxiv.org/abs/2504.19034v3On learning functions over biological sequence space: relating Gaussian process priors, regularization, and gauge fixing2025-08-01T20:40:09ZMappings from biological sequences (DNA, RNA, protein) to quantitative measures of sequence functionality play an important role in contemporary biology. We are interested in the related tasks of (i) inferring predictive sequence-to-function maps and (ii) decomposing sequence-function maps to elucidate the contributions of individual subsequences. Because each sequence-function map can be written as a weighted sum over subsequences in multiple ways, meaningfully interpreting these weights requires ``gauge-fixing,'' i.e., defining a unique representation for each map. Recent work has established that most existing gauge-fixed representations arise as the unique solutions to $L_2$-regularized regression in an overparameterized ``weight space'' where the choice of regularizer defines the gauge. Here, we establish the relationship between regularized regression in overparameterized weight space and Gaussian process approaches that operate in ``function space,'' i.e.~the space of all real-valued functions on a finite set of sequences. We disentangle how weight space regularizers both impose an implicit prior on the learned function and restrict the optimal weights to a particular gauge. We show how to construct regularizers that correspond to arbitrary explicit Gaussian process priors combined with a wide variety of gauges and characterize the implicit function space priors associated with the most common weight space regularizers. Finally, we derive the posterior distribution of a broad class of sequence-to-function statistics, including gauge-fixed weights and multiple systems for expressing higher-order epistatic coefficients. We show that such distributions can be efficiently computed for product-kernel priors using a kernel trick.2025-04-26T22:00:42ZSamantha PettiCarlos Martí-GómezJustin B. KinneyJuannan ZhouDavid M. McCandlishhttp://arxiv.org/abs/2507.23048v1CARTEpigenoQC: A Quality Control Toolkit for CAR-T Single-Cell Epigenomic Data2025-07-30T19:22:23ZCARTEpigenoQC is an R-based toolkit designed to streamline quality control (QC) for single-cell epigenomic datasets involving Chimeric Antigen Receptor (CAR)-engineered T cells. With the growing application of scATAC-seq, scCUT&Tag, and scBS-seq to characterize CAR-T cell states, it has become critical to perform customized QC that not only addresses standard metrics like FRiP (Fraction of Reads in Peaks) and TSS enrichment, but also directly detects signal from CAR vector insertion sites. CARTEpigenoQC supports both 10x Genomics and non-10x data formats and produces HTML and PNG summary outputs suited for exploratory analysis and regulatory-grade preclinical reporting. It is intended to assist researchers, core facilities, and translational immunologists in ensuring the validity of single-cell epigenomic profiling of engineered T cells.2025-07-30T19:22:23Z5 pages, 2 figures, 2 tablesKaitao Laihttp://arxiv.org/abs/2507.21706v1EnTao-GPM: DNA Foundation Model for Predicting the Germline Pathogenic Mutations2025-07-29T11:34:41ZDistinguishing pathogenic mutations from benign polymorphisms remains a critical challenge in precision medicine. EnTao-GPM, developed by Fudan University and BioMap, addresses this through three innovations: (1) Cross-species targeted pre-training on disease-relevant mammalian genomes (human, pig, mouse), leveraging evolutionary conservation to enhance interpretation of pathogenic motifs, particularly in non-coding regions; (2) Germline mutation specialization via fine-tuning on ClinVar and HGMD, improving accuracy for both SNVs and non-SNVs; (3) Interpretable clinical framework integrating DNA sequence embeddings with LLM-based statistical explanations to provide actionable insights. Validated against ClinVar, EnTao-GPM demonstrates superior accuracy in mutation classification. It revolutionizes genetic testing by enabling faster, more accurate, and accessible interpretation for clinical diagnostics (e.g., variant assessment, risk identification, personalized treatment) and research, advancing personalized medicine.2025-07-29T11:34:41ZZekai LinHaoran SunYucheng GuoYujie YangYanwen WangBozhen HuChonghang YeQirong YangFan ZhongXiaoming ZhangLei Liu