https://arxiv.org/api/I28nj+DTJZOqOqBUi0OrKZMx0CA2026-06-15T00:16:45Z384851015http://arxiv.org/abs/2410.01795v2Knowledge-Driven Feature Selection and Engineering for Genotype Data with Large Language Models2025-04-16T05:30:34ZPredicting phenotypes with complex genetic bases based on a small, interpretable set of variant features remains a challenging task. Conventionally, data-driven approaches are utilized for this task, yet the high dimensional nature of genotype data makes the analysis and prediction difficult. Motivated by the extensive knowledge encoded in pre-trained LLMs and their success in processing complex biomedical concepts, we set to examine the ability of LLMs in feature selection and engineering for tabular genotype data, with a novel knowledge-driven framework. We develop FREEFORM, Free-flow Reasoning and Ensembling for Enhanced Feature Output and Robust Modeling, designed with chain-of-thought and ensembling principles, to select and engineer features with the intrinsic knowledge of LLMs. Evaluated on two distinct genotype-phenotype datasets, genetic ancestry and hereditary hearing loss, we find this framework outperforms several data-driven methods, particularly on low-shot regimes. FREEFORM is available as open-source framework at GitHub: https://github.com/PennShenLab/FREEFORM.2024-10-02T17:53:08Zaccepted by AMIA-IS'25: AMIA Informatics Summit [Marco Ramoni Distinguished Paper Award for Translational Bioinformatics]Joseph LeeShu YangJae Young BaikXiaoxi LiuZhen TanDawei LiZixuan WenBojian HouDuy Duong-TranTianlong ChenLi Shenhttp://arxiv.org/abs/2504.12353v1TransST: Transfer Learning Embedded Spatial Factor Modeling of Spatial Transcriptomics Data2025-04-15T22:03:38ZBackground: Spatial transcriptomics have emerged as a powerful tool in biomedical research because of its ability to capture both the spatial contexts and abundance of the complete RNA transcript profile in organs of interest. However, limitations of the technology such as the relatively low resolution and comparatively insufficient sequencing depth make it difficult to reliably extract real biological signals from these data. To alleviate this challenge, we propose a novel transfer learning framework, referred to as TransST, to adaptively leverage the cell-labeled information from external sources in inferring cell-level heterogeneity of a target spatial transcriptomics data.
Results: Applications in several real studies as well as a number of simulation settings show that our approach significantly improves existing techniques. For example, in the breast cancer study, TransST successfully identifies five biologically meaningful cell clusters, including the two subgroups of cancer in situ and invasive cancer; in addition, only TransST is able to separate the adipose tissues from the connective issues among all the studied methods.
Conclusions: In summary, the proposed method TransST is both effective and robust in identifying cell subclusters and detecting corresponding driving biomarkers in spatial transcriptomics data.2025-04-15T22:03:38ZShuo Shuo LiuShikun WangYuxuan ChenAnil K. RustgiMing YuanJianhua Huhttp://arxiv.org/abs/2504.10388v1Inferring genotype-phenotype maps using attention models2025-04-14T16:32:17ZPredicting phenotype from genotype is a central challenge in genetics. Traditional approaches in quantitative genetics typically analyze this problem using methods based on linear regression. These methods generally assume that the genetic architecture of complex traits can be parameterized in terms of an additive model, where the effects of loci are independent, plus (in some cases) pairwise epistatic interactions between loci. However, these models struggle to analyze more complex patterns of epistasis or subtle gene-environment interactions. Recent advances in machine learning, particularly attention-based models, offer a promising alternative. Initially developed for natural language processing, attention-based models excel at capturing context-dependent interactions and have shown exceptional performance in predicting protein structure and function. Here, we apply attention-based models to quantitative genetics. We analyze the performance of this attention-based approach in predicting phenotype from genotype using simulated data across a range of models with increasing epistatic complexity, and using experimental data from a recent quantitative trait locus mapping study in budding yeast. We find that our model demonstrates superior out-of-sample predictions in epistatic regimes compared to standard methods. We also explore a more general multi-environment attention-based model to jointly analyze genotype-phenotype maps across multiple environments and show that such architectures can be used for "transfer learning" - predicting phenotypes in novel environments with limited training data.2025-04-14T16:32:17ZKrishna RijalCaroline M. HolmesSamantha PettiGautam ReddyMichael M. DesaiPankaj Mehtahttp://arxiv.org/abs/2504.10338v1Classifying Copy Number Variations Using State Space Modeling of Targeted Sequencing Data: A Case Study in Thalassemia2025-04-14T15:47:41ZThalassemia, a blood disorder and one of the most prevalent hereditary genetic disorders worldwide, is often caused by copy number variations (CNVs) in the hemoglobin genes. This disorder has incredible diversity, with a large number of distinct profiles corresponding to alterations of different regions in the genes. Correctly classifying an individual's profile is critical as it impacts treatment, prognosis, and genetic counseling. However, genetic classification is challenging due to the large number of profiles worldwide, and often requires a large number of sequential tests. Targeted next generation sequencing (NGS), which characterizes segments of an individual's genome, has the potential to dramatically reduce the cost of testing and increase accuracy. In this work, we introduce a probabilistic state space model for profiling thalassemia from targeted NGS data, which naturally characterize the spatial ordering of the genes along the chromosome. We then use decision theory to choose the best profile among the different options. Due to our use of Bayesian methodology, we are also able to detect low-quality samples to be excluded from consideration, an important component of clinical screening. We evaluate our model on a dataset of 57 individuals, including both controls and cases with a variety of thalassemia profiles. Our model has a sensitivity of 0.99 and specificity of 0.93 for thalassemia detection, and accuracy of 91.5\% for characterizing subtypes. Furthermore, the specificity and accuracy rise to $0.96$ and 93.9\% when low-quality samples are excluded using our automated quality control method. This approach outperforms alternative methods, particularly in specificity, and is broadly applicable to other disorders.2025-04-14T15:47:41ZAustin TalbotAlex KotlarLavanya RishishiwarYue Kehttp://arxiv.org/abs/2504.07881v2An LLM-Driven Multi-Agent Debate System for Mendelian Diseases2025-04-11T07:26:43ZAccurate diagnosis of Mendelian diseases is crucial for precision therapy and assistance in preimplantation genetic diagnosis. However, existing methods often fall short of clinical standards or depend on extensive datasets to build pretrained machine learning models. To address this, we introduce an innovative LLM-Driven multi-agent debate system (MD2GPS) with natural language explanations of the diagnostic results. It utilizes a language model to transform results from data-driven and knowledge-driven agents into natural language, then fostering a debate between these two specialized agents. This system has been tested on 1,185 samples across four independent datasets, enhancing the TOP1 accuracy from 42.9% to 66% on average. Additionally, in a challenging cohort of 72 cases, MD2GPS identified potential pathogenic genes in 12 patients, reducing the diagnostic time by 90%. The methods within each module of this multi-agent debate system are also replaceable, facilitating its adaptation for diagnosing and researching other complex diseases.2025-04-10T15:55:34Z21 pages, 5 figures, 1 tableXinyang ZhouYongyong RenQianqian ZhaoDaoyi HuangXinbo WangTingting ZhaoZhixing ZhuWenyuan HeShuyuan LiYan XuYu SunYongguo YuShengnan WuJian WangGuangjun YuDake HeBo BanHui Luhttp://arxiv.org/abs/2504.03976v2OLAF: An Open Life Science Analysis Framework for Conversational Bioinformatics Powered by Large Language Models2025-04-10T19:32:47ZOLAF (Open Life Science Analysis Framework) is an open-source platform that enables researchers to perform bioinformatics analyses using natural language. By combining large language models (LLMs) with a modular agent-pipe-router architecture, OLAF generates and executes bioinformatics code on real scientific data, including formats like .h5ad. The system includes an Angular front end and a Python/Firebase backend, allowing users to run analyses such as single-cell RNA-seq workflows, gene annotation, and data visualization through a simple web interface. Unlike general-purpose AI tools, OLAF integrates code execution, data handling, and scientific libraries in a reproducible, user-friendly environment. It is designed to lower the barrier to computational biology for non-programmers and support transparent, AI-powered life science research.2025-04-04T22:41:16ZDylan RiffleNima ShirooniCody HeManush MuraliSovit NayakRishikumar GopalanDiego Gonzalez Lopezhttp://arxiv.org/abs/2504.07734v1On-Chip and Off-Chip TIA Amplifiers for Nanopore Signal Readout Design, Performance and Challenges: A Review2025-04-10T13:29:08ZAdvancements in biomedical research have driven continuous innovations in sensing and diagnostic technologies. Among these, nanopore based single molecule sensing and sequencing is rapidly emerging as a powerful and versatile sensing methodology. Advancements in nanopore based approaches require concomitant improvements in the electronic readout methods employed, from the point of low noise, bandwidth and form factor. This article focuses on current sensing circuits designed and employed for ultra low noise nanopore signal readout, addressing the fundamental limitations of traditional off chip transimpedance amplifiers (TIAs), which suffer from high input parasitic capacitance, bandwidth constraints, and increased noise at high frequencies. This review explores the latest design schemes and circuit structures classified into on-chip and off-chip TIA designs, highlighting their design implementation, performance, respective challenges and explores the interplay between noise performance, capacitance, and bandwidth across diverse transimpedance amplifier (TIA) configurations. Emphasis is placed on characterizing noise response under varying parasitic capacitance and operational frequencies, a systematic evaluation not extensively addressed in prior literature while also considering the allowable input current compliance range limitations. The review also compares the widely used Axopatch 200B system to the designs reported in literature. The findings offer valuable insights into optimizing TIA designs for enhanced signal integrity in high speed and high sensitivity applications focusing on noise reduction, impedance matching, DC blocking, and offset cancellation techniques.2025-04-10T13:29:08Z35 pages , 22 figuresK. Ashoka DeepthiManoj VarmaArup Polleyhttp://arxiv.org/abs/2504.07298v1CiMBA: Accelerating Genome Sequencing through On-Device Basecalling via Compute-in-Memory2025-04-09T21:40:46ZAs genome sequencing is finding utility in a wide variety of domains beyond the confines of traditional medical settings, its computational pipeline faces two significant challenges. First, the creation of up to 0.5 GB of data per minute imposes substantial communication and storage overheads. Second, the sequencing pipeline is bottlenecked at the basecalling step, consuming >40% of genome analysis time. A range of proposals have attempted to address these challenges, with limited success. We propose to address these challenges with a Compute-in-Memory Basecalling Accelerator (CiMBA), the first embedded ($\sim25$mm$^2$) accelerator capable of real-time, on-device basecalling, coupled with AnaLog (AL)-Dorado, a new family of analog focused basecalling DNNs. Our resulting hardware/software co-design greatly reduces data communication overhead, is capable of a throughput of 4.77 million bases per second, 24x that required for real-time operation, and achieves 17x/27x power/area efficiency over the best prior basecalling embedded accelerator while maintaining a high accuracy comparable to state-of-the-art software basecallers.2025-04-09T21:40:46ZAccepted to IEEE Transactions on Parallel and Distributed SystemsIEEE Transactions on Parallel and Distributed Systems, pp. 1-15, 2025William Andrew SimonIrem BoybatRiselda KodraElena FerroGagandeep SinghMohammed AlserShubham JainHsinyu TsaiGeoffrey W. BurrOnur MutluAbu Sebastian10.1109/TPDS.2025.3550811http://arxiv.org/abs/2504.07065v1Enhancing Downstream Analysis in Genome Sequencing: Species Classification While Basecalling2025-04-09T17:30:43ZThe ability to quickly and accurately identify microbial species in a sample, known as metagenomic profiling, is critical across various fields, from healthcare to environmental science. This paper introduces a novel method to profile signals coming from sequencing devices in parallel with determining their nucleotide sequences, a process known as basecalling, via a multi-objective deep neural network for simultaneous basecalling and multi-class genome classification. We introduce a new loss strategy where losses for basecalling and classification are back-propagated separately, with model weights combined for the shared layers, and a pre-configured ranking strategy allowing top-K species accuracy, giving users flexibility to choose between higher accuracy or higher speed at identifying the species. We achieve state-of-the-art basecalling accuracies, while classification accuracies meet and exceed the results of state-of-the-art binary classifiers, attaining an average of 92.5%/98.9% accuracy at identifying the top-1/3 species among a total of 17 genomes in the Wick bacterial dataset. The work presented here has implications for future studies in metagenomic profiling by accelerating the bottleneck step of matching the DNA sequence to the correct genome.2025-04-09T17:30:43ZAccepted as Tiny Paper at MLGenX workshop, ICLR, 2025Riselda KodraHadjer BenmezianeIrem BoybatWilliam Andrew Simonhttp://arxiv.org/abs/2410.21591v2Can Large Language Models Replace Data Scientists in Biomedical Research?2025-04-08T21:48:54ZData science plays a critical role in biomedical research, but it requires professionals with expertise in coding and medical data analysis. Large language models (LLMs) have shown great potential in supporting medical tasks and performing well in general coding tests. However, existing evaluations fail to assess their capability in biomedical data science, particularly in handling diverse data types such as genomics and clinical datasets. To address this gap, we developed a benchmark of data science coding tasks derived from the analyses of 39 published studies. This benchmark comprises 293 coding tasks (128 in Python and 165 in R) performed on real-world TCGA-type genomics and clinical data. Our findings reveal that the vanilla prompting of LLMs yields suboptimal performances due to drawbacks in following input instructions, understanding target data, and adhering to standard analysis practices. Next, we benchmarked six cutting-edge LLMs and advanced adaptation methods, finding two methods to be particularly effective: chain-of-thought prompting, which provides a step-by-step plan for data analysis, which led to a 21% code accuracy improvement (56.6% versus 35.3%); and self-reflection, enabling LLMs to refine the buggy code iteratively, yielding an 11% code accuracy improvement (45.5% versus 34.3%). Building on these insights, we developed a platform that integrates LLMs into the data science workflow for medical professionals. In a user study with five medical professionals, we found that while LLMs cannot fully automate programming tasks, they significantly streamline the programming process. We found that 80% of their submitted code solutions were incorporated from LLM-generated code, with up to 96% reuse in some cases. Our analysis highlights the potential of LLMs to enhance data science efficiency in biomedical research when integrated into expert workflows.2024-10-28T22:48:06ZZifeng WangBenjamin DanekZiwei YangZheng ChenJimeng Sunhttp://arxiv.org/abs/2406.15341v3GenoTEX: An LLM Agent Benchmark for Automated Gene Expression Data Analysis2025-04-08T17:09:04ZRecent advancements in machine learning have significantly improved the identification of disease-associated genes from gene expression datasets. However, these processes often require extensive expertise and manual effort, limiting their scalability. Large Language Model (LLM)-based agents have shown promise in automating these tasks due to their increasing problem-solving abilities. To support the evaluation and development of such methods, we introduce GenoTEX, a benchmark dataset for the automated analysis of gene expression data. GenoTEX provides analysis code and results for solving a wide range of gene-trait association problems, encompassing dataset selection, preprocessing, and statistical analysis, in a pipeline that follows computational genomics standards. The benchmark includes expert-curated annotations from bioinformaticians to ensure accuracy and reliability. To provide baselines for these tasks, we present GenoAgent, a team of LLM-based agents that adopt a multi-step programming workflow with flexible self-correction, to collaboratively analyze gene expression datasets. Our experiments demonstrate the potential of LLM-based methods in analyzing genomic data, while error analysis highlights the challenges and areas for future improvement. We propose GenoTEX as a promising resource for benchmarking and enhancing automated methods for gene expression data analysis. The benchmark is available at https://github.com/Liu-Hy/GenoTEX.2024-06-21T17:55:24Z31 pages, 4 figuresHaoyang LiuShuyu ChenYe ZhangHaohan Wanghttp://arxiv.org/abs/2504.05790v1ViralQC: A Tool for Assessing Completeness and Contamination of Predicted Viral Contigs2025-04-08T08:14:44ZMotivation: Viruses represent the most abundant biological entities on the planet and play vital roles in diverse ecosystems. Cataloging viruses across various environments is essential for understanding their properties and functions. Metagenomic sequencing has emerged as the most comprehensive method for virus discovery, enabling the sequencing of all genetic materials, including viruses, from host or environmental samples. However, distinguishing viral sequences from the vast background of cellular organism-derived reads in metagenomic data remains a significant challenge. While several learning-based tools, such as VirSorter2 and geNomad, have shown promise in identifying viral contigs, they often experience varying degrees of false positive rates due to noise in sequencing and assembly, shared genes between viruses and their hosts, and the formation of proviruses within host genomes. This highlights the urgent need for an accurate and efficient method to evaluate the quality of viral contigs. Results: To address these challenges, we introduce ViralQC, a tool designed to assess the quality of reported viral contigs or bins. ViralQC identifies contamination regions within putative viral sequences using foundation models trained on viral and cellular genomes and estimates viral completeness through protein organization alignment. We evaluate ViralQC on multiple datasets and compare its performance against CheckV, the state-of-the-art in virus quality assessment. Notably, ViralQC correctly identifies 38% more contamination than CheckV, while maintaining a median absolute error of only 3%. In addition, ViralQC delivers more accurate results for medium- to high-quality (>50% completeness) contigs, demonstrating its superior performance in completeness estimation.2025-04-08T08:14:44Z16 pages, 9 figuresCheng PengJiayu ShangJiaojiao GuanYanni Sunhttp://arxiv.org/abs/2405.05734v3On the Coverage Required for Diploid Genome Assembly2025-04-07T17:22:34ZThe repeat content and heterozygosity rate of a target genome are important factors in determining the feasibility of achieving a complete telomere-to-telomere assembly. The mathematical relationship between the required coverage and read length for the purpose of unique reconstruction remains unexplored for diploid genomes. We investigate the information-theoretic conditions that the given set of sequencing reads must satisfy to achieve the complete reconstruction of the true sequence of a diploid genome. We also analyze the standard greedy and de-Bruijn graph-based assembly algorithms. Our results show that the coverage and read length requirements of the assembly algorithms are considerably higher than the lower bound because both algorithms require the double repeats in the genome to be bridged. Finally, we derive the necessary conditions for the overlap graph-based assembly paradigm.2024-05-09T12:50:18ZAccepted at ISIT'24Daanish MahajanChirag JainNavin Kashyaphttp://arxiv.org/abs/2504.03876v1Multiscale Modeling Primer: Focus on Chromatin and Epigenetics2025-04-04T19:01:17ZEssential life processes take place across multiple space and time scales in living organisms but understanding their mechanistic interactions remains an ongoing challenge. Advanced multiscale modeling techniques are providing new opportunities and insights into these complex processes. In cells, meters of chromatin are folded into a nucleus with a diameter on the order of microns. The three-dimensional chromatin structure coupled with biochemical processes that turn genes on or off, specify a given cell type through a complicated set of interactions collectively referred to as epigenetics. Important epigenetic processes include the differential accessibility of genomic loci to transcription factors and chemical modifications to DNA and DNA-binding molecules such as histones. The dynamics of these epigenetic processes span timescales from milliseconds to years. How do chemical modifications consisting of a handful of atoms cooperate to modulate genome folding at the scale of the nucleus and impact organism outcomes? In this review, we highlight the inherently multiscale nature of chromatin organization, with a focus on computational modeling to bridge the gaps in our understanding of biochemical processes across scales. We review relevant chromatin biology, including major types of epigenetic modifications as well as the higher order chromatin structures to present a multiscale view of chromatin. We also review relevant computational methods to simulate chromatin structure, function, and dynamics, as well as experimental techniques that inform and validate said models. Finally, we argue that multiscale modeling provides a path forward towards understanding emergent behavior in this inherently multiscale system.2025-04-04T19:01:17ZAchal MahajanErik J. NavarroWilliam PooleCarlos F Lopezhttp://arxiv.org/abs/2504.03550v1Dimensionality reduction for k-means clustering of large-scale influenza mutation datasets2025-04-04T15:57:48ZViral mutations pose significant threats to public health by increasing infectivity, strengthening vaccine resistance, and altering disease severity. To track these evolving patterns, agencies like the CDC annually evaluate thousands of virus strains, underscoring the urgent need to understand viral mutagenesis and evolution in depth. In this study, we integrate genomic analysis, clustering, and three leading dimensionality reduction approaches, namely, principal component analysis (PCA), t-distributed stochastic neighbor embedding (t-SNE), and uniform manifold approximation and projection (UMAP)-to investigate the effects of COVID-19 on influenza virus propagation. By applying these methods to extensive pre- and post-pandemic influenza datasets, we reveal how selective pressures during the pandemic have influenced the diversity of influenza genetics. Our findings indicate that combining robust dimension reduction with clustering yields critical insights into the complex dynamics of viral mutation, informing both future research directions and strategies for public health intervention.2025-04-04T15:57:48ZEmilee WaldenJiahui ChenGuo-Wei Wei