https://arxiv.org/api/qCs8+q5XScKoYfFyvEvgM9aLe/s2026-06-15T02:41:46Z384954015http://arxiv.org/abs/2503.13925v1Reconstructing Cell Lineage Trees from Phenotypic Features with Metric Learning2025-03-18T05:41:03ZHow a single fertilized cell gives rise to a complex array of specialized cell types in development is a central question in biology. The cells grow, divide, and acquire differentiated characteristics through poorly understood molecular processes. A key approach to studying developmental processes is to infer the tree graph of cell lineage division and differentiation histories, providing an analytical framework for dissecting individual cells' molecular decisions during replication and differentiation. Although genetically engineered lineage-tracing methods have advanced the field, they are either infeasible or ethically constrained in many organisms. In contrast, modern single-cell technologies can measure high-content molecular profiles (e.g., transcriptomes) in a wide range of biological systems.
Here, we introduce CellTreeQM, a novel deep learning method based on transformer architectures that learns an embedding space with geometric properties optimized for tree-graph inference. By formulating lineage reconstruction as a tree-metric learning problem, we have systematically explored supervised, weakly supervised, and unsupervised training settings and present a Lineage Reconstruction Benchmark to facilitate comprehensive evaluation of our learning method. We benchmarked the method on (1) synthetic data modeled via Brownian motion with independent noise and spurious signals and (2) lineage-resolved single-cell RNA sequencing datasets. Experimental results show that CellTreeQM recovers lineage structures with minimal supervision and limited data, offering a scalable framework for uncovering cell lineage relationships in challenging animal models. To our knowledge, this is the first method to cast cell lineage inference explicitly as a metric learning task, paving the way for future computational models aimed at uncovering the molecular dynamics of cell lineage.2025-03-18T05:41:03ZDa KuangGuanwen QiuJunhyong Kimhttp://arxiv.org/abs/2503.13189v1Causes of evolutionary divergence in prostate cancer2025-03-17T14:00:02ZCancer progression involves the sequential accumulation of genetic alterations that cumulatively shape the tumour phenotype. In prostate cancer, tumours can follow divergent evolutionary trajectories that lead to distinct subtypes, but the causes of this divergence remain unclear. While causal inference could elucidate the factors involved, conventional methods are unsuitable due to the possibility of unobserved confounders and ambiguity in the direction of causality. Here, we propose a method that circumvents these issues and apply it to genomic data from 829 prostate cancer patients. We identify several genetic alterations that drive divergence as well as others that prevent this transition, locking tumours into one trajectory. Further analysis reveals that these genetic alterations may cause each other, implying a positive-feedback loop that accelerates divergence. Our findings provide insights into how cancer subtypes emerge and offer a foundation for genomic surveillance strategies aimed at monitoring the progression of prostate cancer.2025-03-17T14:00:02ZEmre EsenturkAtef SahliValeriia HaberlandAleksandra ZiuboniewiczChristopher WirthG. Steven BovaRobert G BristowMark N BrookBenedikt BrorsAdam ButlerGéraldine Cancel-TassinKevin CL ChengColin S CooperNiall M CorcoranOlivier CussenotRos A EelesFrancesco FaveroClarissa GerhauserAbraham GihawiEtsehiwot G GirmaVincent J GnanapragasamAndreas J GruberAnis HamidVanessa M HayesHousheng Hansen HeChristopher M HovensEddie Luidy ImadaG. Maria JakobsdottirChol-hee JungFrancesca KhaniZsofia Kote-JaraiPhilippe LamyGregory LeemanMassimo LodaPavlo LutsikLuigi MarchionniRamyar MolaniaAnthony T PapenfussDiogo PellegrinaBernard PopeLucio R QueirozTobias RauschJüri ReimandBrian RobinsonThorsten SchlommKarina D SørensenSebastian UhrigJoachim WeischenfeldtYaobo XuTakafumi N YamaguchiClaudio ZanettiniAndy G LynchDavid C WedgeDaniel S BrewerDan J Woodcockhttp://arxiv.org/abs/2503.13078v1Bayesian Cox model with graph-structured variable selection priors for multi-omics biomarker identification2025-03-17T11:33:21ZAn important goal in cancer research is the survival prognosis of a patient based on a minimal panel of genomic and molecular markers such as genes or proteins. Purely data-driven models without any biological knowledge can produce non-interpretable results. We propose a penalized semiparametric Bayesian Cox model with graph-structured selection priors for sparse identification of multi-omics features by making use of a biologically meaningful graph via a Markov random field (MRF) prior to capturing known relationships between multi-omics features. Since the fixed graph in the MRF prior is for the prior probability distribution, it is not a hard constraint to determine variable selection, so the proposed model can verify known information and has the potential to identify new and novel biomarkers for drawing new biological knowledge. Our simulation results show that the proposed Bayesian Cox model with graph-based prior knowledge results in more trustable and stable variable selection and non-inferior survival prediction, compared to methods modeling the covariates independently without any prior knowledge. The results also indicate that the performance of the proposed model is robust to a partially correct graph in the MRF prior, meaning that in a real setting where not all the true network information between covariates is known, the graph can still be useful. The proposed model is applied to the primary invasive breast cancer patients data in The Cancer Genome Atlas project.2025-03-17T11:33:21ZTobias Østmo HermansenManuela ZucknickZhi Zhaohttp://arxiv.org/abs/2503.12377v1GCBLANE: A graph-enhanced convolutional BiLSTM attention network for improved transcription factor binding site prediction2025-03-16T06:52:03ZIdentifying transcription factor binding sites (TFBS) is crucial for understanding gene regulation, as these sites enable transcription factors (TFs) to bind to DNA and modulate gene expression. Despite advances in high-throughput sequencing, accurately identifying TFBS remains challenging due to the vast genomic data and complex binding patterns. GCBLANE, a graph-enhanced convolutional bidirectional Long Short-Term Memory (LSTM) attention network, is introduced to address this issue. It integrates convolutional, multi-head attention, and recurrent layers with a graph neural network to detect key features for TFBS prediction. On 690 ENCODE ChIP-Seq datasets, GCBLANE achieved an average AUC of 0.943, and on 165 ENCODE datasets, it reached an AUC of 0.9495, outperforming advanced models that utilize multimodal approaches, including DNA shape information. This result underscores GCBLANE's effectiveness compared to other methods. By combining graph-based learning with sequence analysis, GCBLANE significantly advances TFBS prediction.2025-03-16T06:52:03ZJonas Chris FerraoDickson DiasSweta MorajkarManisha Gokuldas Fal Dessaihttp://arxiv.org/abs/2503.12330v1Computational identification of ketone metabolism as a key regulator of sleep stability and circadian dynamics via real-time metabolic profiling2025-03-16T02:57:32ZMetabolism plays a crucial role in sleep regulation, yet its effects are challenging to track in real time. This study introduces a machine learning-based framework to analyze sleep patterns and identify how metabolic changes influence sleep at specific time points. We first established that sleep periods in Drosophila melanogaster function independently, with no causal relationship between different sleep episodes. Using gradient boosting models and explainable artificial intelligence techniques, we quantified the influence of time-dependent sleep features. Causal inference and autocorrelation analyses further confirmed that sleep states at different times are statistically independent, providing a robust foundation for exploring metabolic effects on sleep. Applying this framework to flies with altered monocarboxylate transporter 2 expression, we found that changes in ketone transport modified sleep stability and disrupted transitions between day and night sleep. In an Alzheimers disease model, metabolic interventions such as beta hydroxybutyrate supplementation and intermittent fasting selectively influenced the timing of day to night transitions rather than uniformly altering sleep duration. Autoencoder based similarity scoring and wavelet analysis reinforced that metabolic effects on sleep were highly time dependent. This study presents a novel approach to studying sleep-metabolism interactions, revealing that metabolic states exert their strongest influence at distinct time points, shaping sleep stability and circadian transitions.2025-03-16T02:57:32ZHao HuangKaijing XuMichael Lardellihttp://arxiv.org/abs/2309.07261v5Simultaneous inference for generalized linear models with unmeasured confounders2025-03-15T13:59:11ZTens of thousands of simultaneous hypothesis tests are routinely performed in genomic studies to identify differentially expressed genes. However, due to unmeasured confounders, many standard statistical approaches may be substantially biased. This paper investigates the large-scale hypothesis testing problem for multivariate generalized linear models in the presence of confounding effects. Under arbitrary confounding mechanisms, we propose a unified statistical estimation and inference framework that harnesses orthogonal structures and integrates linear projections into three key stages. It begins by disentangling marginal and uncorrelated confounding effects to recover the latent coefficients. Subsequently, latent factors and primary effects are jointly estimated through lasso-type optimization. Finally, we incorporate projected and weighted bias-correction steps for hypothesis testing. Theoretically, we establish the identification conditions of various effects and non-asymptotic error bounds. We show effective Type-I error control of asymptotic $z$-tests as sample and response sizes approach infinity. Numerical experiments demonstrate that the proposed method controls the false discovery rate by the Benjamini-Hochberg procedure and is more powerful than alternative methods. By comparing single-cell RNA-seq counts from two groups of samples, we demonstrate the suitability of adjusting confounding effects when significant covariates are absent from the model.2023-09-13T18:53:11ZMain text: 23 pages and 6 figures; appendix: 50 pages and 12 figuresJin-Hong DuLarry WassermanKathryn Roederhttp://arxiv.org/abs/2503.14520v1From de Bruijn graphs to variation graphs-relationships between pangenome models2025-03-14T15:23:52ZPangenomes serve as a framework for joint analysis of genomes of related organisms. Several pangenome models were proposed, offering different functionalities, applications provided by available tools, their efficiency etc. Among them, two graph-based models are particularly widely used: variation graphs and de Bruijn graphs. In the current paper we propose an axiomatization of the desirable properties of a graph representation of a collection of strings. We show the relationship between variation graphs satisfying these criteria and de Bruijn graphs. This relationship can be used to efficiently build a variation graph representing a given set of genomes, transfer annotations between both models, compare the results of analyzes based on each model etc.2025-03-14T15:23:52ZSubmitted to International Symposium on String Processing and Information Retrieval. Cham: Springer Nature Switzerland, 2023 (SPIRE2023). This preprint has not undergone peer review or any post-submission improvements or corrections. The Version of Record of this contribution is published in LNCS, volume 14240, and is available online at https://doi.org/10.1007/978-3-031-43980-3_10Adam CicherskiNorbert Dojer10.1007/978-3-031-43980-3_10http://arxiv.org/abs/2410.17708v2Proteome-wide prediction of mode of inheritance and molecular mechanism underlying genetic diseases using structural interactomics2025-03-14T13:40:46ZGenetic diseases can be classified according to their modes of inheritance and their underlying molecular mechanisms. Autosomal dominant disorders often result from DNA variants that cause loss-of-function, gain-of-function, or dominant-negative effects, while autosomal recessive diseases are primarily linked to loss-of-function variants. In this study, we introduce a graph-of-graphs approach that leverages protein-protein interaction networks and high-resolution protein structures to predict the mode of inheritance of diseases caused by variants in autosomal genes, and to classify dominant-associated proteins based on their functional effect. Our approach integrates graph neural networks, structural interactomics and topological network features to provide proteome-wide predictions, thus offering a scalable method for understanding genetic disease mechanisms.2024-10-23T09:33:13ZAli SaadatJacques Fellay10.1016/j.isci.2025.112812http://arxiv.org/abs/2410.10919v2Fine-tuning the ESM2 protein language model to understand the functional impact of missense variants2025-03-14T13:26:57ZElucidating the functional effect of missense variants is of crucial importance, yet challenging. To understand the impact of such variants, we fine-tuned the ESM2 protein language model to classify 20 protein features at amino acid resolution. We used the resulting models to: 1) identify protein features that are enriched in either pathogenic or benign missense variants, 2) compare the characteristics of proteins with reference or alternate alleles to understand how missense variants affect protein functionality. We show that our model can be used to reclassify some variants of unknown significance. We also demonstrate the usage of our models for understanding the potential effect of variants on protein features.2024-10-14T09:37:27ZAli SaadatJacques Fellay10.1016/j.csbj.2025.05.022http://arxiv.org/abs/2503.11180v1Learnable Group Transform: Enhancing Genotype-to-Phenotype Prediction for Rice Breeding with Small, Structured Datasets2025-03-14T08:27:19ZGenotype-to-Phenotype (G2P) prediction plays a pivotal role in crop breeding, enabling the identification of superior genotypes based on genomic data. Rice (Oryza sativa), one of the most important staple crops, faces challenges in improving yield and resilience due to the complex genetic architecture of agronomic traits and the limited sample size in breeding datasets. Current G2P prediction methods, such as GWAS and linear models, often fail to capture complex non-linear relationships between genotypes and phenotypes, leading to suboptimal prediction accuracy. Additionally, population stratification and overfitting are significant obstacles when models are applied to small datasets with diverse genetic backgrounds. This study introduces the Learnable Group Transform (LGT) method, which aims to overcome these challenges by combining the advantages of traditional linear models with advanced machine learning techniques. LGT utilizes a group-based transformation of genotype data to capture spatial relationships and genetic structures across diverse rice populations, offering flexibility to generalize even with limited data. Through extensive experiments on the Rice529 dataset, a panel of 529 rice accessions, LGT demonstrated substantial improvements in prediction accuracy for multiple agronomic traits, including yield and plant height, compared to state-of-the-art baselines such as linear models and recent deep learning approaches. Notably, LGT achieved an R^2 improvement of up to 15\% for yield prediction, significantly reducing error and demonstrating its ability to extract meaningful signals from high-dimensional, noisy genomic data. These results highlight the potential of LGT as a powerful tool for genomic prediction in rice breeding, offering a promising solution for accelerating the identification of high-yielding and resilient rice varieties.2025-03-14T08:27:19ZYunxuan DongSiyuan ChenJisen Zhanghttp://arxiv.org/abs/2502.13785v2Helix-mRNA: A Hybrid Foundation Model For Full Sequence mRNA Therapeutics2025-03-11T14:21:27ZmRNA-based vaccines have become a major focus in the pharmaceutical industry. The coding sequence as well as the Untranslated Regions (UTRs) of an mRNA can strongly influence translation efficiency, stability, degradation, and other factors that collectively determine a vaccine's effectiveness. However, optimizing mRNA sequences for those properties remains a complex challenge. Existing deep learning models often focus solely on coding region optimization, overlooking the UTRs. We present Helix-mRNA, a structured state-space-based and attention hybrid model to address these challenges. In addition to a first pre-training, a second pre-training stage allows us to specialise the model with high-quality data. We employ single nucleotide tokenization of mRNA sequences with codon separation, ensuring prior biological and structural information from the original mRNA sequence is not lost. Our model, Helix-mRNA, outperforms existing methods in analysing both UTRs and coding region properties. It can process sequences 6x longer than current approaches while using only 10% of the parameters of existing foundation models. Its predictive capabilities extend to all mRNA regions. We open-source the model (https://github.com/helicalAI/helical) and model weights (https://huggingface.co/helical-ai/helix-mRNA).2025-02-19T14:51:41Z8 pages, 3 figures, 3 tablesMatthew WoodMathieu KlopMaxime Allardhttp://arxiv.org/abs/2503.07981v1Regulatory DNA sequence Design with Reinforcement Learning2025-03-11T02:33:33ZCis-regulatory elements (CREs), such as promoters and enhancers, are relatively short DNA sequences that directly regulate gene expression. The fitness of CREs, measured by their ability to modulate gene expression, highly depends on the nucleotide sequences, especially specific motifs known as transcription factor binding sites (TFBSs). Designing high-fitness CREs is crucial for therapeutic and bioengineering applications. Current CRE design methods are limited by two major drawbacks: (1) they typically rely on iterative optimization strategies that modify existing sequences and are prone to local optima, and (2) they lack the guidance of biological prior knowledge in sequence optimization. In this paper, we address these limitations by proposing a generative approach that leverages reinforcement learning (RL) to fine-tune a pre-trained autoregressive (AR) model. Our method incorporates data-driven biological priors by deriving computational inference-based rewards that simulate the addition of activator TFBSs and removal of repressor TFBSs, which are then integrated into the RL process. We evaluate our method on promoter design tasks in two yeast media conditions and enhancer design tasks for three human cell types, demonstrating its ability to generate high-fitness CREs while maintaining sequence diversity. The code is available at https://github.com/yangzhao1230/TACO.2025-03-11T02:33:33ZZhao YangBing SuChuan CaoJi-Rong Wenhttp://arxiv.org/abs/2503.04347v1Large Language Models for Zero-shot Inference of Causal Structures in Biology2025-03-06T11:43:30ZGenes, proteins and other biological entities influence one another via causal molecular networks. Causal relationships in such networks are mediated by complex and diverse mechanisms, through latent variables, and are often specific to cellular context. It remains challenging to characterise such networks in practice. Here, we present a novel framework to evaluate large language models (LLMs) for zero-shot inference of causal relationships in biology. In particular, we systematically evaluate causal claims obtained from an LLM using real-world interventional data. This is done over one hundred variables and thousands of causal hypotheses. Furthermore, we consider several prompting and retrieval-augmentation strategies, including large, and potentially conflicting, collections of scientific articles. Our results show that with tailored augmentation and prompting, even relatively small LLMs can capture meaningful aspects of causal structure in biological systems. This supports the notion that LLMs could act as orchestration tools in biological discovery, by helping to distil current knowledge in ways amenable to downstream analysis. Our approach to assessing LLMs with respect to experimental data is relevant for a broad range of problems at the intersection of causal learning, LLMs and scientific discovery.2025-03-06T11:43:30ZICLR 2025 Workshop on Machine Learning for Genomics ExplorationsIzzy NewshamLuka KovačevićRichard MoulangeNan Rosemary KeSach Mukherjeehttp://arxiv.org/abs/2503.02997v1Enabling Fast, Accurate, and Efficient Real-Time Genome Analysis via New Algorithms and Techniques2025-03-04T20:44:37ZThe advent of high-throughput sequencing technologies has revolutionized genome analysis by enabling the rapid and cost-effective sequencing of large genomes. Despite these advancements, the increasing complexity and volume of genomic data present significant challenges related to accuracy, scalability, and computational efficiency. These challenges are mainly due to various forms of unwanted and unhandled variations in sequencing data, collectively referred to as noise. In this dissertation, we address these challenges by providing a deep understanding of different types of noise in genomic data and developing techniques to mitigate the impact of noise on genome analysis.
First, we introduce BLEND, a noise-tolerant hashing mechanism that quickly identifies both exactly matching and highly similar sequences with arbitrary differences using a single lookup of their hash values. Second, to enable scalable and accurate analysis of noisy raw nanopore signals, we propose RawHash, a novel mechanism that effectively reduces noise in raw nanopore signals and enables accurate, real-time analysis by proposing the first hash-based similarity search technique for raw nanopore signals. Third, we extend the capabilities of RawHash with RawHash2, an improved mechanism that 1) provides a better understanding of noise in raw nanopore signals to reduce it more effectively and 2) improves the robustness of mapping decisions. Fourth, we explore the broader implications and new applications of raw nanopore signal analysis by introducing Rawsamble, the first mechanism for all-vs-all overlapping of raw signals using hash-based search. Rawsamble enables the construction of de novo assemblies directly from raw signals without basecalling, which opens up new directions and uses for raw nanopore signal analysis.2025-03-04T20:44:37ZPhD Thesis submitted to ETH ZurichCan Firtina10.3929/ethz-b-000725492http://arxiv.org/abs/2405.05998v3Whole Genome Transformer for Gene Interaction Effects in Microbiome Habitat Specificity2025-03-03T21:31:23ZLeveraging the vast genetic diversity within microbiomes offers unparalleled insights into complex phenotypes, yet the task of accurately predicting and understanding such traits from genomic data remains challenging. We propose a framework taking advantage of existing large models for gene vectorization to predict habitat specificity from entire microbial genome sequences. Based on our model, we develop attribution techniques to elucidate gene interaction effects that drive microbial adaptation to diverse environments. We train and validate our approach on a large dataset of high quality microbiome genomes from different habitats. We not only demonstrate solid predictive performance, but also how sequence-level information of entire genomes allows us to identify gene associations underlying complex phenotypes. Our attribution recovers known important interaction networks and proposes new candidates for experimental follow up.2024-05-09T09:34:51Zpublished at AAAI 2025Zhufeng LiSandeep S CranganoreNicholas YoungblutNiki Kilbertus