https://arxiv.org/api/qCs8+q5XScKoYfFyvEvgM9aLe/s 2026-06-15T02:41:46Z 3849 540 15 http://arxiv.org/abs/2503.13925v1 Reconstructing Cell Lineage Trees from Phenotypic Features with Metric Learning 2025-03-18T05:41:03Z

How a single fertilized cell gives rise to a complex array of specialized cell types in development is a central question in biology. The cells grow, divide, and acquire differentiated characteristics through poorly understood molecular processes. A key approach to studying developmental processes is to infer the tree graph of cell lineage division and differentiation histories, providing an analytical framework for dissecting individual cells' molecular decisions during replication and differentiation. Although genetically engineered lineage-tracing methods have advanced the field, they are either infeasible or ethically constrained in many organisms. In contrast, modern single-cell technologies can measure high-content molecular profiles (e.g., transcriptomes) in a wide range of biological systems. Here, we introduce CellTreeQM, a novel deep learning method based on transformer architectures that learns an embedding space with geometric properties optimized for tree-graph inference. By formulating lineage reconstruction as a tree-metric learning problem, we have systematically explored supervised, weakly supervised, and unsupervised training settings and present a Lineage Reconstruction Benchmark to facilitate comprehensive evaluation of our learning method. We benchmarked the method on (1) synthetic data modeled via Brownian motion with independent noise and spurious signals and (2) lineage-resolved single-cell RNA sequencing datasets. Experimental results show that CellTreeQM recovers lineage structures with minimal supervision and limited data, offering a scalable framework for uncovering cell lineage relationships in challenging animal models. To our knowledge, this is the first method to cast cell lineage inference explicitly as a metric learning task, paving the way for future computational models aimed at uncovering the molecular dynamics of cell lineage.

2025-03-18T05:41:03Z Da Kuang Guanwen Qiu Junhyong Kim http://arxiv.org/abs/2503.13189v1 Causes of evolutionary divergence in prostate cancer 2025-03-17T14:00:02Z

Cancer progression involves the sequential accumulation of genetic alterations that cumulatively shape the tumour phenotype. In prostate cancer, tumours can follow divergent evolutionary trajectories that lead to distinct subtypes, but the causes of this divergence remain unclear. While causal inference could elucidate the factors involved, conventional methods are unsuitable due to the possibility of unobserved confounders and ambiguity in the direction of causality. Here, we propose a method that circumvents these issues and apply it to genomic data from 829 prostate cancer patients. We identify several genetic alterations that drive divergence as well as others that prevent this transition, locking tumours into one trajectory. Further analysis reveals that these genetic alterations may cause each other, implying a positive-feedback loop that accelerates divergence. Our findings provide insights into how cancer subtypes emerge and offer a foundation for genomic surveillance strategies aimed at monitoring the progression of prostate cancer.

2025-03-17T14:00:02Z Emre Esenturk Atef Sahli Valeriia Haberland Aleksandra Ziuboniewicz Christopher Wirth G. Steven Bova Robert G Bristow Mark N Brook Benedikt Brors Adam Butler Géraldine Cancel-Tassin Kevin CL Cheng Colin S Cooper Niall M Corcoran Olivier Cussenot Ros A Eeles Francesco Favero Clarissa Gerhauser Abraham Gihawi Etsehiwot G Girma Vincent J Gnanapragasam Andreas J Gruber Anis Hamid Vanessa M Hayes Housheng Hansen He Christopher M Hovens Eddie Luidy Imada G. Maria Jakobsdottir Chol-hee Jung Francesca Khani Zsofia Kote-Jarai Philippe Lamy Gregory Leeman Massimo Loda Pavlo Lutsik Luigi Marchionni Ramyar Molania Anthony T Papenfuss Diogo Pellegrina Bernard Pope Lucio R Queiroz Tobias Rausch Jüri Reimand Brian Robinson Thorsten Schlomm Karina D Sørensen Sebastian Uhrig Joachim Weischenfeldt Yaobo Xu Takafumi N Yamaguchi Claudio Zanettini Andy G Lynch David C Wedge Daniel S Brewer Dan J Woodcock http://arxiv.org/abs/2503.13078v1 Bayesian Cox model with graph-structured variable selection priors for multi-omics biomarker identification 2025-03-17T11:33:21Z

An important goal in cancer research is the survival prognosis of a patient based on a minimal panel of genomic and molecular markers such as genes or proteins. Purely data-driven models without any biological knowledge can produce non-interpretable results. We propose a penalized semiparametric Bayesian Cox model with graph-structured selection priors for sparse identification of multi-omics features by making use of a biologically meaningful graph via a Markov random field (MRF) prior to capturing known relationships between multi-omics features. Since the fixed graph in the MRF prior is for the prior probability distribution, it is not a hard constraint to determine variable selection, so the proposed model can verify known information and has the potential to identify new and novel biomarkers for drawing new biological knowledge. Our simulation results show that the proposed Bayesian Cox model with graph-based prior knowledge results in more trustable and stable variable selection and non-inferior survival prediction, compared to methods modeling the covariates independently without any prior knowledge. The results also indicate that the performance of the proposed model is robust to a partially correct graph in the MRF prior, meaning that in a real setting where not all the true network information between covariates is known, the graph can still be useful. The proposed model is applied to the primary invasive breast cancer patients data in The Cancer Genome Atlas project.

2025-03-17T11:33:21Z Tobias Østmo Hermansen Manuela Zucknick Zhi Zhao http://arxiv.org/abs/2503.12377v1 GCBLANE: A graph-enhanced convolutional BiLSTM attention network for improved transcription factor binding site prediction 2025-03-16T06:52:03Z

Identifying transcription factor binding sites (TFBS) is crucial for understanding gene regulation, as these sites enable transcription factors (TFs) to bind to DNA and modulate gene expression. Despite advances in high-throughput sequencing, accurately identifying TFBS remains challenging due to the vast genomic data and complex binding patterns. GCBLANE, a graph-enhanced convolutional bidirectional Long Short-Term Memory (LSTM) attention network, is introduced to address this issue. It integrates convolutional, multi-head attention, and recurrent layers with a graph neural network to detect key features for TFBS prediction. On 690 ENCODE ChIP-Seq datasets, GCBLANE achieved an average AUC of 0.943, and on 165 ENCODE datasets, it reached an AUC of 0.9495, outperforming advanced models that utilize multimodal approaches, including DNA shape information. This result underscores GCBLANE's effectiveness compared to other methods. By combining graph-based learning with sequence analysis, GCBLANE significantly advances TFBS prediction.

2025-03-16T06:52:03Z Jonas Chris Ferrao Dickson Dias Sweta Morajkar Manisha Gokuldas Fal Dessai http://arxiv.org/abs/2503.12330v1 Computational identification of ketone metabolism as a key regulator of sleep stability and circadian dynamics via real-time metabolic profiling 2025-03-16T02:57:32Z

Metabolism plays a crucial role in sleep regulation, yet its effects are challenging to track in real time. This study introduces a machine learning-based framework to analyze sleep patterns and identify how metabolic changes influence sleep at specific time points. We first established that sleep periods in Drosophila melanogaster function independently, with no causal relationship between different sleep episodes. Using gradient boosting models and explainable artificial intelligence techniques, we quantified the influence of time-dependent sleep features. Causal inference and autocorrelation analyses further confirmed that sleep states at different times are statistically independent, providing a robust foundation for exploring metabolic effects on sleep. Applying this framework to flies with altered monocarboxylate transporter 2 expression, we found that changes in ketone transport modified sleep stability and disrupted transitions between day and night sleep. In an Alzheimers disease model, metabolic interventions such as beta hydroxybutyrate supplementation and intermittent fasting selectively influenced the timing of day to night transitions rather than uniformly altering sleep duration. Autoencoder based similarity scoring and wavelet analysis reinforced that metabolic effects on sleep were highly time dependent. This study presents a novel approach to studying sleep-metabolism interactions, revealing that metabolic states exert their strongest influence at distinct time points, shaping sleep stability and circadian transitions.

2025-03-16T02:57:32Z Hao Huang Kaijing Xu Michael Lardelli http://arxiv.org/abs/2309.07261v5 Simultaneous inference for generalized linear models with unmeasured confounders 2025-03-15T13:59:11Z

Tens of thousands of simultaneous hypothesis tests are routinely performed in genomic studies to identify differentially expressed genes. However, due to unmeasured confounders, many standard statistical approaches may be substantially biased. This paper investigates the large-scale hypothesis testing problem for multivariate generalized linear models in the presence of confounding effects. Under arbitrary confounding mechanisms, we propose a unified statistical estimation and inference framework that harnesses orthogonal structures and integrates linear projections into three key stages. It begins by disentangling marginal and uncorrelated confounding effects to recover the latent coefficients. Subsequently, latent factors and primary effects are jointly estimated through lasso-type optimization. Finally, we incorporate projected and weighted bias-correction steps for hypothesis testing. Theoretically, we establish the identification conditions of various effects and non-asymptotic error bounds. We show effective Type-I error control of asymptotic $z$-tests as sample and response sizes approach infinity. Numerical experiments demonstrate that the proposed method controls the false discovery rate by the Benjamini-Hochberg procedure and is more powerful than alternative methods. By comparing single-cell RNA-seq counts from two groups of samples, we demonstrate the suitability of adjusting confounding effects when significant covariates are absent from the model.

2023-09-13T18:53:11Z Main text: 23 pages and 6 figures; appendix: 50 pages and 12 figures Jin-Hong Du Larry Wasserman Kathryn Roeder http://arxiv.org/abs/2503.14520v1 From de Bruijn graphs to variation graphs-relationships between pangenome models 2025-03-14T15:23:52Z

Pangenomes serve as a framework for joint analysis of genomes of related organisms. Several pangenome models were proposed, offering different functionalities, applications provided by available tools, their efficiency etc. Among them, two graph-based models are particularly widely used: variation graphs and de Bruijn graphs. In the current paper we propose an axiomatization of the desirable properties of a graph representation of a collection of strings. We show the relationship between variation graphs satisfying these criteria and de Bruijn graphs. This relationship can be used to efficiently build a variation graph representing a given set of genomes, transfer annotations between both models, compare the results of analyzes based on each model etc.

2025-03-14T15:23:52Z Submitted to International Symposium on String Processing and Information Retrieval. Cham: Springer Nature Switzerland, 2023 (SPIRE2023). This preprint has not undergone peer review or any post-submission improvements or corrections. The Version of Record of this contribution is published in LNCS, volume 14240, and is available online at https://doi.org/10.1007/978-3-031-43980-3_10 Adam Cicherski Norbert Dojer 10.1007/978-3-031-43980-3_10 http://arxiv.org/abs/2410.17708v2 Proteome-wide prediction of mode of inheritance and molecular mechanism underlying genetic diseases using structural interactomics 2025-03-14T13:40:46Z

Genetic diseases can be classified according to their modes of inheritance and their underlying molecular mechanisms. Autosomal dominant disorders often result from DNA variants that cause loss-of-function, gain-of-function, or dominant-negative effects, while autosomal recessive diseases are primarily linked to loss-of-function variants. In this study, we introduce a graph-of-graphs approach that leverages protein-protein interaction networks and high-resolution protein structures to predict the mode of inheritance of diseases caused by variants in autosomal genes, and to classify dominant-associated proteins based on their functional effect. Our approach integrates graph neural networks, structural interactomics and topological network features to provide proteome-wide predictions, thus offering a scalable method for understanding genetic disease mechanisms.

2024-10-23T09:33:13Z Ali Saadat Jacques Fellay 10.1016/j.isci.2025.112812 http://arxiv.org/abs/2410.10919v2 Fine-tuning the ESM2 protein language model to understand the functional impact of missense variants 2025-03-14T13:26:57Z

Elucidating the functional effect of missense variants is of crucial importance, yet challenging. To understand the impact of such variants, we fine-tuned the ESM2 protein language model to classify 20 protein features at amino acid resolution. We used the resulting models to: 1) identify protein features that are enriched in either pathogenic or benign missense variants, 2) compare the characteristics of proteins with reference or alternate alleles to understand how missense variants affect protein functionality. We show that our model can be used to reclassify some variants of unknown significance. We also demonstrate the usage of our models for understanding the potential effect of variants on protein features.

2024-10-14T09:37:27Z Ali Saadat Jacques Fellay 10.1016/j.csbj.2025.05.022 http://arxiv.org/abs/2503.11180v1 Learnable Group Transform: Enhancing Genotype-to-Phenotype Prediction for Rice Breeding with Small, Structured Datasets 2025-03-14T08:27:19Z

Genotype-to-Phenotype (G2P) prediction plays a pivotal role in crop breeding, enabling the identification of superior genotypes based on genomic data. Rice (Oryza sativa), one of the most important staple crops, faces challenges in improving yield and resilience due to the complex genetic architecture of agronomic traits and the limited sample size in breeding datasets. Current G2P prediction methods, such as GWAS and linear models, often fail to capture complex non-linear relationships between genotypes and phenotypes, leading to suboptimal prediction accuracy. Additionally, population stratification and overfitting are significant obstacles when models are applied to small datasets with diverse genetic backgrounds. This study introduces the Learnable Group Transform (LGT) method, which aims to overcome these challenges by combining the advantages of traditional linear models with advanced machine learning techniques. LGT utilizes a group-based transformation of genotype data to capture spatial relationships and genetic structures across diverse rice populations, offering flexibility to generalize even with limited data. Through extensive experiments on the Rice529 dataset, a panel of 529 rice accessions, LGT demonstrated substantial improvements in prediction accuracy for multiple agronomic traits, including yield and plant height, compared to state-of-the-art baselines such as linear models and recent deep learning approaches. Notably, LGT achieved an R^2 improvement of up to 15\% for yield prediction, significantly reducing error and demonstrating its ability to extract meaningful signals from high-dimensional, noisy genomic data. These results highlight the potential of LGT as a powerful tool for genomic prediction in rice breeding, offering a promising solution for accelerating the identification of high-yielding and resilient rice varieties.

2025-03-14T08:27:19Z Yunxuan Dong Siyuan Chen Jisen Zhang http://arxiv.org/abs/2502.13785v2 Helix-mRNA: A Hybrid Foundation Model For Full Sequence mRNA Therapeutics 2025-03-11T14:21:27Z

mRNA-based vaccines have become a major focus in the pharmaceutical industry. The coding sequence as well as the Untranslated Regions (UTRs) of an mRNA can strongly influence translation efficiency, stability, degradation, and other factors that collectively determine a vaccine's effectiveness. However, optimizing mRNA sequences for those properties remains a complex challenge. Existing deep learning models often focus solely on coding region optimization, overlooking the UTRs. We present Helix-mRNA, a structured state-space-based and attention hybrid model to address these challenges. In addition to a first pre-training, a second pre-training stage allows us to specialise the model with high-quality data. We employ single nucleotide tokenization of mRNA sequences with codon separation, ensuring prior biological and structural information from the original mRNA sequence is not lost. Our model, Helix-mRNA, outperforms existing methods in analysing both UTRs and coding region properties. It can process sequences 6x longer than current approaches while using only 10% of the parameters of existing foundation models. Its predictive capabilities extend to all mRNA regions. We open-source the model (https://github.com/helicalAI/helical) and model weights (https://huggingface.co/helical-ai/helix-mRNA).

2025-02-19T14:51:41Z 8 pages, 3 figures, 3 tables Matthew Wood Mathieu Klop Maxime Allard http://arxiv.org/abs/2503.07981v1 Regulatory DNA sequence Design with Reinforcement Learning 2025-03-11T02:33:33Z

Cis-regulatory elements (CREs), such as promoters and enhancers, are relatively short DNA sequences that directly regulate gene expression. The fitness of CREs, measured by their ability to modulate gene expression, highly depends on the nucleotide sequences, especially specific motifs known as transcription factor binding sites (TFBSs). Designing high-fitness CREs is crucial for therapeutic and bioengineering applications. Current CRE design methods are limited by two major drawbacks: (1) they typically rely on iterative optimization strategies that modify existing sequences and are prone to local optima, and (2) they lack the guidance of biological prior knowledge in sequence optimization. In this paper, we address these limitations by proposing a generative approach that leverages reinforcement learning (RL) to fine-tune a pre-trained autoregressive (AR) model. Our method incorporates data-driven biological priors by deriving computational inference-based rewards that simulate the addition of activator TFBSs and removal of repressor TFBSs, which are then integrated into the RL process. We evaluate our method on promoter design tasks in two yeast media conditions and enhancer design tasks for three human cell types, demonstrating its ability to generate high-fitness CREs while maintaining sequence diversity. The code is available at https://github.com/yangzhao1230/TACO.

2025-03-11T02:33:33Z Zhao Yang Bing Su Chuan Cao Ji-Rong Wen http://arxiv.org/abs/2503.04347v1 Large Language Models for Zero-shot Inference of Causal Structures in Biology 2025-03-06T11:43:30Z

Genes, proteins and other biological entities influence one another via causal molecular networks. Causal relationships in such networks are mediated by complex and diverse mechanisms, through latent variables, and are often specific to cellular context. It remains challenging to characterise such networks in practice. Here, we present a novel framework to evaluate large language models (LLMs) for zero-shot inference of causal relationships in biology. In particular, we systematically evaluate causal claims obtained from an LLM using real-world interventional data. This is done over one hundred variables and thousands of causal hypotheses. Furthermore, we consider several prompting and retrieval-augmentation strategies, including large, and potentially conflicting, collections of scientific articles. Our results show that with tailored augmentation and prompting, even relatively small LLMs can capture meaningful aspects of causal structure in biological systems. This supports the notion that LLMs could act as orchestration tools in biological discovery, by helping to distil current knowledge in ways amenable to downstream analysis. Our approach to assessing LLMs with respect to experimental data is relevant for a broad range of problems at the intersection of causal learning, LLMs and scientific discovery.

2025-03-06T11:43:30Z ICLR 2025 Workshop on Machine Learning for Genomics Explorations Izzy Newsham Luka Kovačević Richard Moulange Nan Rosemary Ke Sach Mukherjee http://arxiv.org/abs/2503.02997v1 Enabling Fast, Accurate, and Efficient Real-Time Genome Analysis via New Algorithms and Techniques 2025-03-04T20:44:37Z

The advent of high-throughput sequencing technologies has revolutionized genome analysis by enabling the rapid and cost-effective sequencing of large genomes. Despite these advancements, the increasing complexity and volume of genomic data present significant challenges related to accuracy, scalability, and computational efficiency. These challenges are mainly due to various forms of unwanted and unhandled variations in sequencing data, collectively referred to as noise. In this dissertation, we address these challenges by providing a deep understanding of different types of noise in genomic data and developing techniques to mitigate the impact of noise on genome analysis. First, we introduce BLEND, a noise-tolerant hashing mechanism that quickly identifies both exactly matching and highly similar sequences with arbitrary differences using a single lookup of their hash values. Second, to enable scalable and accurate analysis of noisy raw nanopore signals, we propose RawHash, a novel mechanism that effectively reduces noise in raw nanopore signals and enables accurate, real-time analysis by proposing the first hash-based similarity search technique for raw nanopore signals. Third, we extend the capabilities of RawHash with RawHash2, an improved mechanism that 1) provides a better understanding of noise in raw nanopore signals to reduce it more effectively and 2) improves the robustness of mapping decisions. Fourth, we explore the broader implications and new applications of raw nanopore signal analysis by introducing Rawsamble, the first mechanism for all-vs-all overlapping of raw signals using hash-based search. Rawsamble enables the construction of de novo assemblies directly from raw signals without basecalling, which opens up new directions and uses for raw nanopore signal analysis.

2025-03-04T20:44:37Z PhD Thesis submitted to ETH Zurich Can Firtina 10.3929/ethz-b-000725492 http://arxiv.org/abs/2405.05998v3 Whole Genome Transformer for Gene Interaction Effects in Microbiome Habitat Specificity 2025-03-03T21:31:23Z

Leveraging the vast genetic diversity within microbiomes offers unparalleled insights into complex phenotypes, yet the task of accurately predicting and understanding such traits from genomic data remains challenging. We propose a framework taking advantage of existing large models for gene vectorization to predict habitat specificity from entire microbial genome sequences. Based on our model, we develop attribution techniques to elucidate gene interaction effects that drive microbial adaptation to diverse environments. We train and validate our approach on a large dataset of high quality microbiome genomes from different habitats. We not only demonstrate solid predictive performance, but also how sequence-level information of entire genomes allows us to identify gene associations underlying complex phenotypes. Our attribution recovers known important interaction networks and proposes new candidates for experimental follow up.

2024-05-09T09:34:51Z published at AAAI 2025 Zhufeng Li Sandeep S Cranganore Nicholas Youngblut Niki Kilbertus