https://arxiv.org/api/uF0tTOH+gaHPKABhHLy+cCY74wo2026-06-13T16:18:26Z38486015http://arxiv.org/abs/2605.22838v1Detecting and Correcting Sample-by-Sample Scale Distortion in RNA Sequencing Data2026-05-10T21:00:07ZRNA sequencing (RNA-seq) is the conventional genome-scale approach used to capture the expression levels of all detectable genes in a biological sample. This is now regularly used for population-based studies designed to identify genetic determinants of various diseases. Naturally, the accuracy of these tests should be verified and improved if possible. In this study, we aimed to detect and correct for expression level-dependent errors which vary from sample to sample, and are not corrected by conventional normalization techniques . We examined several RNA-seq datasets from the Cancer Genome Atlas (TCGA), Stand Up 2 Cancer (SU2C), and GTEx databases with various types of preprocessing. By applying local averaging, we found sample by sample expression-level dependent biases in all datasets studied. Using simulations, we show that these biases corrupt gene-gene correlation estimations and $t$ tests between subpopulations. To mitigate these biases, we introduce two different nonlinear transforms based on statistical considerations that correct these observed biases. We demonstrate that that these transforms effectively remove the observed per-sample biases, reduce sample-to-sample variance, and improve the characteristics of gene-gene correlation distributions. Using a novel simulation methodology that creates controlled differences between subpopulations, we show that these transforms reduce variability and increase sensitivity of two population tests. The improvements in sensitivity and specificity were of the order of 3-5\% in most instances after the data was corrected for bias. Altogether, these results improve our capacity to understand gene-gene relationships, and may lead to novel ways to utilize the information derived from clinical tests.2026-05-10T21:00:07Z25 pages, 17 figuresBMC bioinformatics 26.1 (2025): 32Christopher ThronFarhad Jafari10.1186/s12859-025-06041http://arxiv.org/abs/2605.11022v1SCOPE: Siamese Contrastive Operon Pair Embeddings for Functional Sequence Representation and Classification2026-05-10T16:52:39ZIdentifying operons is a fundamental step in understanding prokaryotic gene regulation, as classifying genes into operons supports the reconstruction of regulatory networks, functional annotation of unannotated genes, and drug candidate development. Experimental approaches such as RT-PCR and RNA-seq provide precise evidence of operon structure, but are laborious and largely limited to well-studied model organisms, making scalable computational methods essential for genome-wide operon identification. Prior computational approaches have employed traditional classifiers such as logistic regression and decision trees, motivating our use of these as physicochemical baselines. The DGEB benchmark evaluates operonic pair classification by embedding each sequence independently with a pre-trained protein language model and computing pairwise cosine similarity. In contrast, our Siamese MLP learns a classifier over the fused embedding space, which is theoretically better motivated for binary classification, as cosine similarity can yield meaningless scores depending on the regularization of the embedding model. While protein language model embeddings substantially outperform physicochemical features in ROC-AUC, a learned Siamese MLP head does not significantly improve over unsupervised cosine similarity in Average Precision, suggesting that the geometry of the embedding space already captures the functional relationships needed for this task. Nonetheless, our Siamese MLP achieves a ROC-AUC of 0.71, competitive with state-of-the-art models on the DGEB leaderboard. These findings indicate that protein language model embeddings are a viable, scalable foundation for operonic pair classification across diverse microbial genomes, with implications for automated genome annotation, regulatory network reconstruction, and characterization of organisms lacking experimental operon annotations.2026-05-10T16:52:39ZAkarsh GuptaKenneth RodriguesSagnik Chatterjeehttp://arxiv.org/abs/2605.06226v2A Versatile AI Agent for Rare Disease Diagnosis and Risk Gene Prioritization2026-05-09T22:50:37ZAccurate and timely diagnosis is essential for effective treatment, particularly in the context of rare diseases. However, current diagnostic workflows often lead to prolonged assessment times and low accuracy. To address these limitations, we introduce Hygieia, a multi-modal AI agent system designed to support precision disease diagnosis by integrating diverse data sources, including phenotypic features, genetic profiles, and clinical records. Hygieia features a router-based and knowledge-enhanced framework that mitigates hallucination and tailors diagnostic strategies to different disease categories. Notably, it prioritizes risk-related genomic factors for rare diseases and provides confidence scores to assist clinical decision-making. We conducted a comprehensive evaluation demonstrating that Hygieia achieves state-of-the-art performance across multiple diagnostic benchmarks. In collaboration with clinical experts from Yale School of Medicine and Duke-NUS Medical School, we further validated its practical utility by showing (1) Hygieia's superior diagnostic performance compared to physicians with an improvement from 12%-60% and (2) its effectiveness in assisting clinicians with medical records for handling real-world cases. Our findings indicate that Hygieia not only enhances diagnostic accuracy and interpretability but also significantly reduces clinician workload, highlighting its potential as a valuable tool in clinical decision support systems.2026-05-07T13:19:42Z32 pages, 6 figuresTianyu LiuWangjie ZhengRui YangBenny Kai Guo LooHui ZhangJeffries LauranJianlei GuBotao YuWeihao XuanKexin HuangNan LiuJames ZouYonghui JiangHua XuHongyu Zhaohttp://arxiv.org/abs/2605.08815v1MicroFuse: Protein-to-Genome Expert Fusion for Microbial Operon Reasoning2026-05-09T09:07:11ZPredicting microbial operon co-membership requires integrating two complementary biological signals: protein-scale molecular identity and genome-context organization. While recent biological foundation models provide powerful representations of each view independently, naive concatenation of these modalities ignores a key biological property -- protein identity and genomic context may agree when adjacent genes form a coherent functional module, or conflict when sequence similarity is misleading but genomic layout indicates independent regulation. We present MicroFuse, a protein-to-genome expert fusion framework that integrates structure-aware protein representations from ProstT5 with genome-context representations from Bacformer through a four-expert Mixture-of-Experts module (protein, genome-context, agreement, and conflict experts) with a learned soft router. Training combines binary cross-entropy with symmetric cross-modal InfoNCE alignment and disagreement-weighted supervised contrastive shaping. We further construct OG-Operon100K, a 100,000-pair scaffold-level benchmark from the OMG metagenomic corpus with biologically grounded positive and negative criteria. On OG-Operon100K, MicroFuse achieves the strongest AUROC, AUPRC, mAP, and mAR among ProstT5-only, Bacformer-only, and Concat MLP baselines. Ablations identify cross-modal contrastive alignment as the dominant component, and a hard sequence-conflict subset reveals MicroFuse's largest gains precisely in biologically ambiguous cases where protein identity alone is misleading.2026-05-09T09:07:11ZSeungik Chohttp://arxiv.org/abs/2602.12286v2Mind the Gap No More: Achieving Zero-Gap Multimodal Integration via One Tokenizer2026-05-08T11:20:50ZA central challenge in developing Multimodal Large Language Models (MLLMs) is effectively integrating heterogeneous inputs into a cohesive reasoning engine. Current paradigms predominantly rely on modular architectures that introduce modality-specific encoders and cross-modal fusion mechanisms. However, these designs are fundamentally bottlenecked by a geometric modality gap, forcing the LLM to expend significant computational capacity on geometric reconciliation rather than deep cross-modal reasoning. In this work, we formally characterize this modality gap and theoretically demonstrate that native architectures, specifically those employing a unified vocabulary, intrinsically maintain a zero-gap state across all hidden layers. Guided by these theoretical findings, we propose \textit{One Tokenizer}, a native architecture that maps all modalities directly into a shared token space. We empirically validate this framework on a DNA--text multimodal testbed. Our extensive evaluations reveal that by achieving seamless integration within the LLM's native latent space, One Tokenizer consistently outperforms encoder-based modular counterparts, providing a fundamentally superior framework for deep biological reasoning.2026-01-21T07:46:36ZUnder review at NeurIPS 2026Yanan LiChristina Yi JinYuan JinManli LuoTie XuShuai JiaoWei HeQing Zhanghttp://arxiv.org/abs/2605.06762v1A Linear-Transformer Hybrid for SNP-Based Genotype-to-Phenotype Prediction in Grapevine2026-05-07T17:32:48ZRobust genotype-to-phenotype (G2P) prediction is essential for accelerating breeding decisions and genetic gain. However, it remains challenging to measure complex traits under variable field conditions and across years. In this study, we propose a linear-Transformer approach, LiT-G2P (Linear-Transformer Genotype-to-Phenotype), an automated predictive framework that integrates additive genetic variance effects with Transformer-based nonlinear interactions using genome-wide single-nucleotide polymorphisms (SNPs) data. We evaluated LiT-G2P on a panel of diverse grape accessions, genotyped with SNP markers and measured for phenotypes across two consecutive years. Target phenotypic traits include leaf hair density and trichome density of grapevines. Across both single-year and cross-year testing scenarios, LiT-G2P consistently improves prediction performance compared with baseline models. For hair density, LiT-G2P achieves the lowest error in both single-year and cross-year evaluations, with RMSEs of 0.469 and 0.454, respectively, while maintaining strong tolerance accuracies of 79.2% and 74.6%, respectively. For trichome density, LiT-G2P also presents the best overall G2P performance. In addition, we extract model-prioritized SNPs from attention weights and apply genotype-stratified analysis to provide interpretable candidate marker for downstream validation. These results demonstrate that integrating stable additive effects with learned interaction patterns can enhance cross-year robustness and support practical SNP-based predictive modeling for genomic selection.2026-05-07T17:32:48Z15 pages, 4 FiguresYibin WangMurukarthick JayakodiSilvas KirubakaranAmbika ChandraAzlan Zahidhttp://arxiv.org/abs/2605.06562v1Feature Dimensionality Outweighs Model Complexity in Breast Cancer Subtype Classification Using TCGA-BRCA Gene Expression Data2026-05-07T16:55:46ZAccurate classification of breast cancer subtypes from gene expression data is critical for diagnosis and treatment selection. However, such datasets are characterized by high dimensionality and limited sample size, posing challenges for machine learning models.
In this study, we evaluate the impact of model complexity and feature selection on subtype classification performance using TCGA-BRCA gene expression data. Logistic regression, random forest, and support vector machine (SVM) models were trained using varying numbers of highly variable genes (50 to 20,518). Performance was evaluated using stratified 5-fold cross-validation and assessed with accuracy and macro F1 score. While all models achieved high accuracy, macro F1 analysis revealed substantial differences in subtype-level performance. Logistic regression demonstrated the most stable and balanced performance across subtypes, including improved detection of rare classes. Random forest underperformed on minority subtypes despite strong overall accuracy, while SVM showed sensitivity to feature dimensionality. These findings highlight the importance of model simplicity, evaluation metrics, and feature selection in high-dimensional biological classification tasks.2026-05-07T16:55:46Z8 pages, 4 figures, 3 tables. Independent research study using TCGA-BRCA RNA-seq dataMeena Al Hasanihttp://arxiv.org/abs/2602.01839v2DOGMA: Weaving Structural Information into Data-centric Single-cell Transcriptomics Analysis2026-05-07T14:37:44ZRecently, data-centric AI methodology has been a dominant paradigm in single-cell transcriptomics analysis, which treats data representation rather than model complexity as the fundamental bottleneck. In the review of current studies, earlier sequence methods treat cells as independent entities and adapt prevalent ML models to analyze their directly inherited sequence data. Despite their simplicity and intuition, these methods overlook the latent intercellular relationships driven by the functional mechanisms of biological systems and the inherent quality issues of the raw sequencing data. Therefore, a series of structured methods has emerged. Although they employ various heuristic rules to capture intricate intercellular relationships and enhance the raw sequencing data, these methods often neglect biological prior knowledge. This omission incurs substantial overhead and yields suboptimal graph representations, hindering the utility of ML models.
To address these issues, we propose DOGMA, a data-centric framework designed for the structural reshaping and semantic enhancement of raw data through multi-level biological prior knowledge. Transcending reliance on purely data-driven heuristics, DOGMA provides a prior-guided graph construction pipeline that integrates statistical alignment with Cell Ontology and phylogenetic structure for biologically grounded cell-graph construction and robust cross-species alignment. Furthermore, Gene Ontology is utilized to bridge the feature-level semantic gap by incorporating functional priors. In complex multi-species and multi-organ benchmarks, DOGMA exhibits strong robustness in strict zero-shot cell-type evaluation and sample efficiency while using substantially lower GPU memory and inference time in downstream evaluation.2026-02-02T09:10:09Z34 pages, 4 figuresRu ZhangXunkai LiYaxin DengSicheng LiuDaohan SuQiangqiang DaiHongchao QinRong-Hua LiGuoren WangJia Lihttp://arxiv.org/abs/2605.06728v1OmicsLM: A Multimodal Large Language Model for Multi-Sample Omics Reasoning2026-05-07T11:27:11ZInterpreting transcriptomic data is one of the most common analytical tasks in modern biology. Yet most current models either consume expression profiles without producing natural-language biological explanations, or reason in language without direct access to quantitative omics measurements. We introduce OmicsLM, a multimodal LLM that connects quantitative omics profiles with natural-language biological tasks. OmicsLM represents each transcriptomic profile as a compact continuous representation within the LLM context. This interface preserves quantitative expression signal while allowing natural-language instructions, explicit gene mentions, and multiple interleaved biological samples to be processed together in one model context. We train OmicsLM on more than 5.5 million instruction-following examples spanning over 70 task types, combining continuous transcriptomic inputs, experimental data rendered through diverse language templates, and free-text biological knowledge and question-answering data. This mixture covers cell type annotation, perturbation prediction, clinical prediction, pathway reasoning, and open-ended biological question answering. Existing benchmarks evaluate either profile-level prediction or text-only biological QA, leaving language-guided, multi-sample reasoning over real expression profiles unmeasured. To close this gap, we introduce GEO-OmicsQA, a benchmark for multi-sample biological question answering built from real Gene Expression Omnibus (GEO) studies. We demonstrate that OmicsLM can use expression profiles directly and perform comparably to specialized omics models on profile-level tasks, while outperforming both omics-specialized models and general LLMs on language-guided biological reasoning over expression data.2026-05-07T11:27:11Z13 pages (main text), 14 pages (appendix), 1 figure, 10 tablesMaciej SypetkowskiJoanna KrawczykŁukasz SmolińskiRemigiusz KinasPrzemysław PietrzakTomasz JetkaRafał Powalskihttp://arxiv.org/abs/2601.22866v2Classification of SARS-CoV-2 Variants through The Epistatical Circos Plots with Convolutional Neural Networks2026-05-06T17:25:58ZThe COVID-19 pandemic has profoundly affected global health, driven by the remarkable transmissibility and mutational adaptability of the SARS-CoV-2 virus. Although five variants of concern, Alpha, Beta, Gamma, Delta, and Omicron, have been identified, the classification task in this study is formulated using four classes: Alpha, Delta, Omicron, and Else, reflecting the sequence availability and temporal coverage of the dataset. Here, we develop an integrative framework that combines direct coupling analysis (DCA), Circos-based visualization, and convolutional neural networks (CNNs) to characterize lineage-specific epistatic signatures from large-scale SARS-CoV-2 genomic sequences. DCA-inferred pairwise mutational couplings were transformed into Circos images, which were then used as inputs for CNN-based classification models. The proposed framework achieved robust variant classification, with the best-performing model reaching a weighted-average F1-score of $98.68\pm 0.75\%$ and an AUC close to 1.2026-01-30T11:44:53ZBo JingKai-Rui ZhangHong-Li ZengErik Aurellhttp://arxiv.org/abs/2605.04930v1When Does Gene Regulatory Network Inference Break? A Controlled Diagnostic Study of Causal and Correlational Methods on Single-Cell Data2026-05-06T13:55:59ZDespite theoretical advantages, causal methods for Gene Regulatory Network (GRN) inference from single-cell RNA-seq data consistently fail to match or outperform correlation-based baselines in many realistic benchmarks, a persistent puzzle which casts doubt on the value of causality for this task. We argue that existing benchmarks are insufficiently controlled to answer this question because they evaluate on real or semi-real data where multiple pathologies co-occur, confounding failure modes, and obscuring the specific conditions under which different inference methods excel or fail. To address this gap, we introduce a controlled diagnostic framework that isolates seven biologically motivated pathologies (dropout, latent confounders, cell-type mixing, feedback loops, network density, sample size, and pseudotime drift) and measure how six representative methods spanning three inference paradigms degrade as each pathology intensifies. Across 6,120 controlled experiments, we find that causal methods genuinely dominate in clean and structurally favorable regimes, but specific pathologies (notably dropout and latent confounders) selectively neutralize their advantages. We further introduce an error-type decomposition that reveals methods with similar aggregate accuracy commit qualitatively different errors. To probe whether single-pathology effects persist when multiple stressors co-occur, we perform an interaction sweep over the three most impactful pathologies and find that their joint effects are sub-additive, while also exposing density-conditional cross-overs invisible to single-dial analysis. Our findings offer a nuanced understanding of when and why different methods succeed or fail for GRN inference, providing actionable insights for method development and practical guidance for practitioners.2026-05-06T13:55:59Z19 pages, 10 figuresMiguel Fernandez-de-RetanaRuben Sanchez-CorcueraUnai ZulaikaAritz Bilbao-JayoAitor Almeidahttp://arxiv.org/abs/2603.25762v2QHap: Quantum-Inspired Haplotype Phasing2026-05-06T07:06:26ZHaplotype phasing, the process of resolving parental allele inheritance patterns in diploid genomes, is critical for precision medicine and population genetics, yet the underlying optimization is NP-hard, posing a scalability challenge. To address this, we introduce QHap, a haplotype phasing algorithm that leverages quantum-annealing-inspired optimization. By reformulating haplotype phasing as a Max-Cut problem and deploying a GPU-accelerated ballistic simulated bifurcation solver, QHap accelerates phasing while maintaining accuracy comparable to established phasing tools. On the highly polymorphic human major histocompatibility complex region, QHap demonstrates 4- to 20-fold acceleration over HapCUT2 and WhatsHap with zero switch error across multiple long-read sequencing platforms. The framework implements two strategies: a read-based method for regional phasing, and a single nucleotide polymorphism-based method that, through quality-weighted probabilistic edge construction, efficiently scales to chromosome-scale tasks. Integration of Pore-C chromatin conformation capture data increases the haplotype N50 by up to 15-fold, enabling near-chromosome-scale haplotype reconstruction. QHap demonstrates that quantum-inspired algorithms operating on classical hardware offer a promising approach to addressing the growing computational demands of sequencing data, establishing a new paradigm for applying physics-inspired optimization to fundamental challenges in computational genomics.2026-03-26T03:06:08Z15 pages, 6 figuresRui ZhangXian-Zhe TaoYibo ChenJiawei ZhangLei HeDongming FangLin YangYuhui SunQinyuan ZhengXinmeng ShiYang ZhouWanyi ChenChentao YangMan-Hong YungJun-Han Huanghttp://arxiv.org/abs/2605.02248v1Statistics of a multi-factor function from its Fourier transform2026-05-04T05:44:30ZFor a phenomenon $\boldsymbol{f}$ that is a function of $n$ factors, defined on a finite abelian group $G$, we derive its population statistics solely from its Fourier transform $\hat{\boldsymbol{f}}$. Our main result is an $m$-Coefficient/Index Annihilation Theorem: the $m$th moment of $\boldsymbol{f}$ becomes a series of terms, each with precisely $m$ Fourier coefficients --- and surprisingly, the coefficient indices in each term sum to zero under group addition. This condition acts like a filter, limiting which terms appear in the Fourier domain, and can reveal deeper relationships between the variables driving $\boldsymbol{f}$. These techniques can also be used as an analytical/design tool, or as a feasibility constraint in search algorithms. For functions defined on $\mathbb{Z}_2^n$, we show how the skew, kurtosis, etc. of a binomial distribution can be derived from the Fourier domain. Several other examples are presented.2026-05-04T05:44:30ZSubmitted to the Journal of Fourier Analysis and Applications. 42 pages, 6 figuresMatthew A. HermanStephen Dorohttp://arxiv.org/abs/2605.02142v1ORBIT: Learning Gene Program Co-Activation Structure for Cell-Type-Stratified Pathway Rewiring Analysis in Single-Cell Transcriptomics2026-05-04T01:50:51ZGene programs co-activate within cells, but existing single-cell methods either treat programs independently or require experimental perturbation data to model their interactions. We introduce ORBIT, a self-supervised transformer that learns asymmetric dependencies among gene programs from observational single-cell RNA-sequencing data alone, quantifying how strongly each program influences every other program. The key mechanism is an intervention-consistent training objective: the model learns each program's directional influence on every other program by predicting how the others change when that program is removed, yielding attention weights that reflect asymmetric influence rather than symmetric co-occurrence. Applied to 191,890 prefrontal cortex nuclei across three pathway vocabularies, ORBIT recovers co-activation structure consistent with established Alzheimer's disease vulnerability signatures, identifies cell-type-specific rewiring invisible to differential expression, and achieves 0.984 macro F1 on cell-type classification from 220 pathway scores, which is within 0.3 points of a state-of-the-art classifier using all 22,088 genes.2026-05-04T01:50:51Z20 pages, 7 figuresYuechen WangLina JiaQinglong WangFeng Tianhttp://arxiv.org/abs/2605.02954v1EFGPP: Exploratory framework for genotype-phenotype prediction2026-05-02T12:42:40ZPredicting complex human traits from genetic data is challenging because different genetic, clinical, and molecular data sources often contain different parts of the signal. Here, we present EFGPP, a reproducible framework for generating, ranking, and combining multiple types of data for genotype-to-phenotype prediction. We applied EFGPP to migraine prediction using UK Biobank data from 733 individuals. The framework combined genotype-derived features, principal components, clinical and metabolomic covariates, and polygenic risk scores generated from migraine and depression GWAS using PLINK, PRSice-2, AnnoPred, and LDAK-GWAS. The best single data type achieved a test AUC of 0.644, while combining multiple data types improved performance to 0.688 using migraine-focused inputs and 0.663 using cross-trait depression-derived inputs. Genetic features alone did not outperform the covariates-only baseline, but genotype-derived features performed better than PRS alone, and depression-derived PRS showed useful predictive signal. Overall, EFGPP provides a practical proof-of-concept framework for prioritising and integrating heterogeneous genetic data sources for complex phenotype prediction.2026-05-02T12:42:40Zhttps://github.com/MuhammadMuneeb007/EFGPPMuhammad MuneebDavid B. Ascher