https://arxiv.org/api/L6bmeXslZsX8z3F/WAsCObjX20E 2026-06-13T17:30:57Z 3848 75 15 http://arxiv.org/abs/2605.01378v1 PhenotypeToGeneDownloaderR: automated multi-source retrieval and validation of phenotype-associated genes 2026-05-02T11:01:35Z

Identifying phenotype-associated genes is a common first step in polygenic risk score construction, enrichment testing, target prioritisation and variant interpretation, but relevant evidence is distributed across heterogeneous databases with different interfaces, formats and evidence models. Here, we present PhenotypeToGeneDownloaderR, a phenotype-guided R/Python pipeline for automated gene retrieval, harmonisation, symbol validation and cross-source summary analysis. Given a phenotype term, the pipeline queries integrated biological databases, standardises per-source outputs, combines gene lists, validates retrieved symbols against the NCBI human gene reference and generates summary tables and visualisations. Across 13 clinically relevant phenotypes and 13 databases, PhenotypeToGeneDownloaderR generated 136,487 raw gene retrievals, with at least one source returning genes for every phenotype. Across all 13 phenotypes, 100,175 of 114,345 combined input symbols were retained after direct or synonym-based validation, corresponding to an 87.6\% validation rate. Cross-source overlap was low, supporting the complementarity of integrated evidence sources. Against an HPO/ClinVar/OMIM-derived gold standard, the pipeline recovered 1,039 of 1,056 known phenotype-associated genes, corresponding to 98.4\% recall. PhenotypeToGeneDownloaderR provides a lightweight, reproducible upstream framework for generating candidate gene sets for downstream prioritisation and interpretation. The pipeline is implemented in R and Python, released under the MIT licence, and available at https://github.com/MuhammadMuneeb007/PhenotypeToGeneDownloaderR.

2026-05-02T11:01:35Z https://github.com/MuhammadMuneeb007/PhenotypeToGeneDownloaderR Muhammad Muneeb David B. Ascher http://arxiv.org/abs/2603.06768v2 Benchmarking end-to-end genotype-to-phenotype prediction workflows across 80 openSNP phenotypes 2026-05-02T10:59:05Z

Genotype-to-phenotype prediction is a central goal of statistical genetics, yet practical comparisons of prediction workflows remain limited in small, heterogeneous, participant-shared genomic datasets. Here, we benchmarked end-to-end case-control prediction across 80 curated binary phenotypes from openSNP using machine learning, deep learning, and polygenic score workflows. We evaluated 29 machine-learning algorithms, 80 deep-learning model variants, and 3 polygenic score tools across 675 clumping and pruning configurations. No workflow family dominated universally. Polygenic score workflows achieved the highest observed discrimination for 53 phenotypes, whereas machine-learning or deep-learning workflows achieved the highest for 27. However, many apparent phenotype-level wins were modest, with 41.2\% of comparisons representing practical ties within five discrimination points. Performance was strongly phenotype-dependent and sensitive to modeling and preprocessing choices. Distinct workflow-specific failure modes were also observed, including unstable behaviour in PRSice and non-informative collapse in lassosum for 13 phenotypes. Higher peak performance was concentrated in smaller phenotypes, reinforcing the need for cautious interpretation in limited-data settings. The cohort was predominantly of European ancestry, restricting generalisability. Together, these results position openSNP as a useful stress-test environment for genomic prediction and support benchmark-guided workflow selection under realistic conditions of data scarcity, phenotype heterogeneity, and ancestry imbalance.

2026-03-06T16:36:42Z Muhammad Muneeb David B. Ascher YooChan Myung Samuel F. Feng Andreas Henschel http://arxiv.org/abs/2605.00930v1 CellxPert: Inference-Time MCMC Steering of a Multi-Omics Single-Cell Foundation Model for In-Silico Perturbation 2026-04-30T21:34:16Z

In this work, we introduce CellxPert, a scalable multimodal foundation model that unifies single-cell and spatial multi-omics within a common representation space. CellxPert jointly encodes transcriptomic (scRNA-seq), chromatin-accessibility (ATAC-seq), and surface-proteomic (CITE-seq) measurements, while directly incorporating MERFISH and imaging mass-cytometry data as 2D or 3D spatial-visual layers. CellxPert facilitates four key downstream tasks out of the box: (i) cell-type annotation across a broad ontology of 154 largely overlapping identities -- the largest label space addressed to date and a stringent test of fine-grained discrimination, (ii) efficient fine-tuning using Low Rank Adaptation (LoRA), (iii) genome-wide transcriptomic response prediction to in-silico perturbations (ISP), and (iv) seamless multi-omic integration across various assays and platforms. Unlike current single-cell foundation models, which approximate gene perturbations by deleting or reordering tokenized gene expression ranks, CellxPert employs a Metropolis-Hastings sampler whose proposal kernel uses the model's masked conditional distributions to transition to new transcriptomic states conditioned on the perturbed genes. This Markov-chain procedure mitigates out-of-distribution artifacts introduced by abrupt token manipulation and produces trajectories that are biologically interpretable. Evaluations on PBMC68K, Replogle Perturb-seq, Systema, and BMMC benchmarks show that CellxPert surpasses classical and state-of-the-art baselines in cell-type annotation, perturbation response prediction, and multi-omic integration.

2026-04-30T21:34:16Z ICLR Machine Learning for Genomics Explorations Workshop 2026 Andac Demir Erik W. Anderson Jeremy L. Jenkins Srayanta Mukherjee http://arxiv.org/abs/2605.00074v1 CRC-Screen: Certified DNA-Synthesis Hazard Screening Under Taxonomic Shift 2026-04-30T13:13:16Z

DNA-synthesis providers screen incoming orders by searching the requested sequence against curated hazard lists. We show that this baseline collapses to a 100% false-flag rate when the hazardous sequence comes from a taxonomic family absent from the reference set: under Conformal Risk Control's certified miss-rate constraint, a low-discrimination signal forces the threshold below the entire test-benign mass. We compose three signals derived from a synthesis order's public annotation: $k$-mer Jaccard similarity to known toxins, the trimmed-mean score of a five-LLM judge panel, and cosine similarity to clustered embedding centroids. Fused under a monotone logistic aggregator and calibrated by Conformal Risk Control, the resulting screener certifies $\mathbb{E}[\mathrm{FNR}] \le α$. Across ten leave-one-taxonomic-family-out folds at $α=0.05$ on UniProt KW-0800 reviewed toxins, the calibrated screener achieves 0% test miss rate on every fold and 0% test false-flag rate on nine of ten folds. The bound's finite-sample slack $1/(n_{\mathrm{cal}}+1)$ caps the certifiable miss rate at 1.77% on our 200-hazard subsample; reaching procurement-grade $α=10^{-3}$ requires an $18\times$ larger calibration set, which the full reviewed UniProt KW-0800 corpus is large enough to deliver. The binding constraint on certifiable DNA-synthesis screening is calibration data, not algorithms. Code: https://github.com/najmulhasan-code/crc-screen

2026-04-30T13:13:16Z 12 pages, 5 figures, 1 table. Code: https://github.com/najmulhasan-code/crc-screen Najmul Hasan http://arxiv.org/abs/2604.26942v1 Hyper Input Convex Neural Networks for Shape Constrained Learning and Optimal Transport 2026-04-29T17:52:05Z

We introduce Hyper Input Convex Neural Networks (HyCNNs), a novel neural network architecture designed for learning convex functions. HyCNNs combine the principles of Maxout networks with input convex neural networks (ICNNs) to create a neural network that is always convex in the input, theoretically capable of leveraging depth, and performs reliable when trained at scale compared to ICNNs. Concretely, we prove that HyCNNs require exponentially fewer parameters than ICNNs to approximate quadratic functions up to a given precision. Throughout a series of synthetic experiments, we demonstrate that HyCNNs outperform existing ICNNs and MLPs in terms of predictive performance for convex regression and interpolation tasks. We further apply HyCNNs to learn high-dimensional optimal transport maps for synthetic examples and for single-cell RNA sequencing data, where they oftentimes outperform ICNN-based neural optimal transport methods and other baselines across a wide range of settings.

2026-04-29T17:52:05Z 65 pages, 13 figures, the first two authors contributed equally Shayan Hundrieser Insung Kong Johannes Schmidt-Hieber http://arxiv.org/abs/2604.25986v1 Robust Clustering Analysis of Genes Related to Age-related Macular Degeneration using RNA-Seq 2026-04-28T17:56:36Z

Identifying genes associated with diseases is crucial to understanding disease mechanisms and developing therapies. However, identification of individual genes associated with a disease often needs to be supplemented with clustering analysis to understand the relationships between genes and identify gene modules beyond individual gene-level relationships. Gene co-expression networks are widely used as a graph theoretic approach to the clustering analysis of genes. In our work, we perform robust clustering analysis on RNA-Seq data of Age-related Macular Degeneration (AMD) patients and controls by generalizing one such framework, Multiscale Embedded Gene Co-Expression Network Analysis (MEGENA). We propose a carefully curated set of module quality evaluation metrics to choose appropriate statistical distance-based or information theoretic similarity measures over simple linear correlation to represent the similarities between genes. Furthermore, we design and implement a stability test to ensure the robustness of the detected hub genes in the presence of noise. Finally, we propose differential module eigengene analysis for a deeper understanding of upregulation and downregulation of each module with respect to the disease and control groups for a comprehensive understanding of the clustering analysis. Besides detecting robust hub genes and modules that are supported by prior findings, we also identify previously undiscovered hub genes that can potentially lead to further biomedical research into understanding the AMD disease mechanism and developing new treatments.

2026-04-28T17:56:36Z Brayan Gutierrez Rinki Ratnapriya Arko Barman http://arxiv.org/abs/2604.25233v1 A Combinatorial Optimisation Approach to Multi-factorial Gap-filling in Genome-scale Metabolic Models (GEMs) 2026-04-28T05:29:03Z

Genome-Scale Metabolic Models (GEMs) describe the interactions between genes, proteins, and the biochemical reactions that underpin an organism's metabolism aiming to computationally simulate functions at the cellular level. While many metabolic reactions can be inferred from genome analysis, constructing GEMs often involves incorporating reactions unsupported by genomic data to improve prediction accuracy. This is known as gap-filling, a process that can be performed manually (a time-consuming task) or computationally. Traditional computational gap-filling approaches aim to correct GEM predictions for a single environmental condition (medium) by solving a large Integer Linear Programming problem. Sequential application across multiple media can produce a more robust model, but often introduces unrealistic predictions in other media. They are also slow to run. In this paper, we study multi-factorial gap filling, which aims to gap-fill GEMs across typically 10 or more input media simultaneously, while improving their overall predictive accuracy and minimising unrealistic behaviour. We view the selection of the set of reactions as a combinatorial optimisation problem, and describe a method based on classic metaheuristic approaches which requires the solution of continuous Linear Programming problems only. This paper provides an introduction of this problem to an audience whose speciality lies outside biology, and suggests a practical first-cut solution method. We demonstrate the method gap-filling GEMs for three bacteria strains, selecting 3000 to 4000 reactions from a database of more than 11000 reactions, while attempting to match the empirically measured performance on 9 to 28 separate media conditions. We show that our method outperforms conventional approaches on multiple metrics, including Kendal Tau and RMS Error by an average of 7.3% and 13.3%, respectively.

2026-04-28T05:29:03Z Philip Kilby Sevvandi Kandanaarachchi Matthew J. Morgan Amy M. Paten Mariana Velasque Andrew C. Warden Juan P. Molina Ortiz http://arxiv.org/abs/2604.26975v1 T-cell repertoire response in individuals with post-acute sequelae of COVID-19 2026-04-28T01:41:58Z

T-cells are central to SARS-CoV-2 clearance and immunological memory, yet their contribution to the persistence of post-acute sequelae of COVID-19 (PASC) remains poorly understood. The immunological features that distinguish individuals who develop PASC from those who recover fully are unresolved, in part due to the phenotypic heterogeneity of the condition and the likely multiplicity of its underlying mechanisms. Here, we profiled longitudinal bulk TCR$β$ repertoires from 120 individuals in the INCOV cohort--71 with PASC and 49 without--sampled at two to three time points spanning the acute and post-acute phases of infection. Using robust statistical modeling of repertoire composition and clonal dynamics, we found that global statistics such as V, J gene usage and CDR3 length do not differ between groups, but that locally enriched sequence motifs and differentially dynamic clones reveal distinct T-cell signatures associated with PASC status. Clones contracting following the peak of the acute response were significantly enriched for SARS-CoV-2 specificity in both groups. Interestingly, Influenza A-specific TCRs were disproportionately enriched among contracting clones in PASC{$^+$} repertoires, implicating viral co-infection as a potential contributor to early disease severity and, possibly, PASC pathogenesis. Rare public TCR clones were markedly enriched for SARS-CoV-2 specificity, with PASC{$^+$} individuals harboring a modestly but significantly higher proportion than PASC{$^-$} individuals. Together, we identified over 1,000 candidate TCR$β$ receptors potentially discriminating PASC{$^+$} from PASC{$^-$} immune responses, opening a path toward the identification of disease-relevant T-cell specificities and the development of T-cell-based immunological biomarkers for long COVID.

2026-04-28T01:41:58Z Zachary Montague Rhea M Grover Andrew Baumgartner Assya Trofimov Jennifer Hadlock Armita Nourmohammad http://arxiv.org/abs/2604.00763v2 Non-ignorable fuzziness in granular counts: the case of RNA-seq data 2026-04-27T11:21:59Z

RNA-seq count data are often affected by read-to-gene alignment ambiguity, especially in high-dimensional transcriptomics. This type of ambiguity can be conveniently expressed through granular counts, namely fuzzy-valued observations of latent discrete quantities. We study a class of fuzzy-reporting mechanisms and show that, when reporting exploits graded membership, ignorability fails generically, leading to a coarsening-not-at-random structure. A hierarchical model is then introduced as a tractable instance of this construction and illustrated using RNA-seq data.

2026-04-01T11:26:59Z 10 pages, 1 figure, 0 tables. Note: The compressed source folder contains the Supplementary Materials Statistics & Probability Letters, Elsevier, 2026 Antonio Calcagnì Arianna Consiglio Przemyslaw Grzegorzewski Corrado Mencar 10.1016/j.spl.2026.110808 http://arxiv.org/abs/2604.24201v1 CMGL: Confidence-guided Multi-omics Graph Learning for Cancer Subtype Classification 2026-04-27T09:02:50Z

Motivation: Multi-omics integration can improve cancer subtyping, but modality informativeness and noise vary across cancer types and patients. Existing graph-based methods optimize modality weights jointly with the classification objective and therefore lack independent reliability estimates, so low-quality omics distort patient similarity graphs and amplify noise through message passing. Results: We propose CMGL, a two-stage framework that estimates per-sample modality reliability through evidential deep learning and uses the frozen confidence scores to guide cross-omics fusion and graph construction. On four MLOmics cancer-subtype tasks and the 32-class pan-cancer task, CMGL consistently improves over the strongest baseline, surpassing it by 4.03% in average accuracy on the four single-cancer tasks. Its representations recover the PAM50 intrinsic subtypes of breast invasive carcinoma (BRCA), and the BRCA-trained model transfers without fine-tuning to kidney renal clear cell carcinoma (KIRC), stratifying patients into prognostically distinct groups.

2026-04-27T09:02:50Z 24 pages, 15 figures, 13 tables, 2 algorithms (main paper + supplementary materials) Boyang Fan Hengchuang Yin Siyu Yi Yifan Wang Zhicheng Li Leijiyu Zhou Jiancheng Lv Wei Ju http://arxiv.org/abs/2602.04058v2 RareCollab: an LLM-powered framework for multimodal reasoning in Mendelian disease diagnosis 2026-04-27T03:09:26Z

Rare disease diagnosis increasingly relies on integrating genomic, phenotypic and transcriptomic evidence, yet these signals remain difficult to reconcile within a common interpretive framework. Here we present RareCollab, an LLM-powered framework for multimodal reasoning in Mendelian disease diagnosis that integrates more than 100 diagnostic evidence signals across DNA, RNA, phenotype, curated variant-level knowledge, and in-silico pathogenicity evidence. This design enables large language models to operate as calibrated, interpretable reasoning modules rather than as a single end-to-end ranker. We applied RareCollab to 890 patients from three cohorts, including 119 Undiagnosed Diseases Network probands with paired DNA and RNA data, constituting a large systematic benchmark for multimodal rare disease diagnosis under paired genomic and transcriptomic evaluation. In this real-world multimodal benchmark, RareCollab prioritized 94% of diagnostic genes within the top 10. Across recall thresholds from top 1 to top 10, it consistently outperformed proprietary phenotype-driven LLM baselines including Claude Sonnet 4.6 and GPT-5-mini by more than 25% on average and surpassed established state-of-the-art variant prioritization methods by 11%-24%. RareCollab also reshapes the diagnostic contribution of RNA evidence, which contributes to prioritization of the diagnostic gene in 35% of cases (42/119). Together, these results establish RareCollab as a scalable and interpretable framework for multimodal rare disease diagnosis.

2026-02-03T22:45:39Z Guantong Qi Jiasheng Wang Mei Ling Chong Zahid Shaik Shenglan Li Shinya Yamamoto Maura R. Z. Ruzhnikov Devon E. Bonner Jennefer N. Carter Kevin S. Smith Matthew T. Wheeler Stephen B. Montgomery Jonathan A. Bernstein Sasidhar Pasupuleti Undiagnosed Diseases Network Pengfei Liu Hu Chen Zhandong Liu http://arxiv.org/abs/2509.25346v2 SynthPert: Enhancing LLM Biological Reasoning via Synthetic Reasoning Traces for Cellular Perturbation Prediction 2026-04-26T19:07:19Z

Predicting cellular responses to genetic perturbations represents a fundamental challenge in systems biology, critical for advancing therapeutic discovery and virtual cell modeling. While large language models (LLMs) show promise for biological reasoning, their application to perturbation prediction remains underexplored due to challenges in adapting them to structured experimental data. We present SynthPert, a novel method that enhances LLM performance through supervised fine-tuning on synthetic reasoning traces generated by frontier models. Using the PerturbQA benchmark, we demonstrate that our approach not only achieves state-of-the-art performance but surpasses the capabilities of the frontier model that generated the training data. Our results reveal three key insights: (1) Synthetic reasoning traces effectively distill biological knowledge even when partially inaccurate, (2) This approach enables cross-cell-type generalization with 87% accuracy on unseen RPE1 cells, and (3) Performance gains persist despite using only 2% of quality-filtered training data. This work shows the effectiveness of synthetic reasoning distillation for enhancing domain-specific reasoning in LLMs.

2025-09-29T18:02:41Z Lawrence Phillips Marc Boubnovski Martell Aditya Misra Josefa Lia Stoisser Cesar A. Prada-Medina Rory Donovan-Maiye Kaspar Märtens http://arxiv.org/abs/2604.23679v1 Imaging Exploration of Molecular Subtypes in Tongue Squamous Cell Carcinoma 2026-04-26T12:39:52Z

Tongue squamous cell carcinoma (TSCC) is an aggressive malignancy with marked biological heterogeneity and variable clinical outcomes. Although molecular profiling has improved understanding of TSCC heterogeneity, its clinical use remains constrained by invasive tissue sampling and limited representation of whole-tumor spatial complexity. Meanwhile, most radiomics studies in TSCC have focused on downstream clinical endpoints, and whether imaging can non-invasively reflect intrinsic molecular subtypes remains unclear. In this study, an integrated transcriptomic-radiomics framework was used to investigate the relationship between preoperative imaging phenotypes and molecular subtypes in TSCC. Transcriptomic data from 60 TSCC cases in The Cancer Genome Atlas were analyzed using unsupervised consensus clustering, followed by differential expression and functional enrichment analyses. Matched preoperative imaging data from The Cancer Imaging Archive were manually annotated for primary tumor regions, and radiomic features were extracted using PyRadiomics; group differences were assessed with the U-test. Two stable molecular subtypes, C1 and C2, were identified. Their biological differences were mainly associated with squamous epithelial differentiation, inflammatory signaling, and lipid metabolism, with C2 showing greater enrichment of immune-related pathways. In addition, 10 radiomic features differed significantly between the two subtypes, mainly wavelet-derived texture features from gray-level size zone, dependence, co-occurrence, and run length matrices (P=0.00202-0.0162). These findings support the potential of radiomics as a non-invasive approach for characterizing molecular heterogeneity in TSCC and provide an initial radiogenomic framework for biologically informed preoperative assessment.

2026-04-26T12:39:52Z 15 pages,5 figures Hao Pan Peipei Wang Yajie Chang Bingyi Lu Yunyan Jiang Mengfan Wang Xinyue Wang Xinrou Yang Jiyuan Zhang Yu Liu Andrei Velichko Yuanjun Wang http://arxiv.org/abs/2603.00678v2 From Syntax to Semantics: Geometric Stability as the Missing Axis of Perturbation Biology 2026-04-25T00:20:45Z

The capacity to precisely edit genomes has outpaced our ability to predict the consequences. A cell can be genetically perfect and therapeutically useless: edited exactly as intended, yet unstable, drifting toward unintended fates, or selected for properties that compromise safety. This paradox reflects a deeper gap in how we evaluate biological intervention. Current frameworks excel at measuring what was done to a cell but remain blind to what the cell has become. We argue that this blindness stems from treating cells as collections of independent variables rather than as dynamical systems occupying positions on high-dimensional state manifolds. Drawing on Waddington's epigenetic landscape, we propose geometric stability as a missing axis of evaluation: the directional coherence of cellular responses to perturbation. This metric distinguishes interventions that guide cells coherently toward stable states from those that scatter them across the state manifold. Validation across diverse perturbation datasets reveals that geometric stability captures regulatory architecture invisible to conventional metrics, discriminating pleiotropic master regulators from lineage-specific factors without prior biological annotation. As precision medicine increasingly relies on cellular reprogramming, the question shifts from ``did the intervention occur?'' to ``is the resulting state stable?'' Geometric stability provides a framework for answering.

2026-02-28T14:42:50Z Prashant C. Raju http://arxiv.org/abs/2508.02061v2 A Bayesian approach to model uncertainty in single-cell genomic data 2026-04-24T16:01:54Z

Network models provide a powerful framework for analysing single-cell count data, facilitating the characterisation of cellular identities, disease mechanisms, and developmental trajectories. However, uncertainty modeling in unsupervised learning with genomic data remains insufficiently explored. Conventional clustering methods assign a singular identity to each cell, potentially obscuring transitional states during differentiation or mutation. This study introduces a variational Bayesian framework for clustering and analysing single-cell genomic data, employing a Bayesian Gaussian mixture model to estimate the probabilistic association of cells with distinct clusters. This approach captures cellular transitions, yielding biologically coherent insights into neurogenesis and breast cancer progression. The inferred clustering probabilities enable further analyses, including Differential Expression Analysis and pseudotime analysis. Furthermore, we propose utilising the misclustering rate and Area Under the Curve in clustering scRNA-seq data as an innovative metric to quantitatively evaluate overall clustering performance. This methodological advancement enhances the resolution of single-cell data analysis, enabling a more nuanced characterisation of dynamic cellular identities in development and disease.

2025-08-04T05:00:15Z Shanshan Ren Thomas E. Bartlett Lina Gerontogianni Swati Chandna