https://arxiv.org/api/I28nj+DTJZOqOqBUi0OrKZMx0CA 2026-06-15T00:16:45Z 3848 510 15 http://arxiv.org/abs/2410.01795v2 Knowledge-Driven Feature Selection and Engineering for Genotype Data with Large Language Models 2025-04-16T05:30:34Z

Predicting phenotypes with complex genetic bases based on a small, interpretable set of variant features remains a challenging task. Conventionally, data-driven approaches are utilized for this task, yet the high dimensional nature of genotype data makes the analysis and prediction difficult. Motivated by the extensive knowledge encoded in pre-trained LLMs and their success in processing complex biomedical concepts, we set to examine the ability of LLMs in feature selection and engineering for tabular genotype data, with a novel knowledge-driven framework. We develop FREEFORM, Free-flow Reasoning and Ensembling for Enhanced Feature Output and Robust Modeling, designed with chain-of-thought and ensembling principles, to select and engineer features with the intrinsic knowledge of LLMs. Evaluated on two distinct genotype-phenotype datasets, genetic ancestry and hereditary hearing loss, we find this framework outperforms several data-driven methods, particularly on low-shot regimes. FREEFORM is available as open-source framework at GitHub: https://github.com/PennShenLab/FREEFORM.

2024-10-02T17:53:08Z accepted by AMIA-IS'25: AMIA Informatics Summit [Marco Ramoni Distinguished Paper Award for Translational Bioinformatics] Joseph Lee Shu Yang Jae Young Baik Xiaoxi Liu Zhen Tan Dawei Li Zixuan Wen Bojian Hou Duy Duong-Tran Tianlong Chen Li Shen http://arxiv.org/abs/2504.12353v1 TransST: Transfer Learning Embedded Spatial Factor Modeling of Spatial Transcriptomics Data 2025-04-15T22:03:38Z

Background: Spatial transcriptomics have emerged as a powerful tool in biomedical research because of its ability to capture both the spatial contexts and abundance of the complete RNA transcript profile in organs of interest. However, limitations of the technology such as the relatively low resolution and comparatively insufficient sequencing depth make it difficult to reliably extract real biological signals from these data. To alleviate this challenge, we propose a novel transfer learning framework, referred to as TransST, to adaptively leverage the cell-labeled information from external sources in inferring cell-level heterogeneity of a target spatial transcriptomics data. Results: Applications in several real studies as well as a number of simulation settings show that our approach significantly improves existing techniques. For example, in the breast cancer study, TransST successfully identifies five biologically meaningful cell clusters, including the two subgroups of cancer in situ and invasive cancer; in addition, only TransST is able to separate the adipose tissues from the connective issues among all the studied methods. Conclusions: In summary, the proposed method TransST is both effective and robust in identifying cell subclusters and detecting corresponding driving biomarkers in spatial transcriptomics data.

2025-04-15T22:03:38Z Shuo Shuo Liu Shikun Wang Yuxuan Chen Anil K. Rustgi Ming Yuan Jianhua Hu http://arxiv.org/abs/2504.10388v1 Inferring genotype-phenotype maps using attention models 2025-04-14T16:32:17Z

Predicting phenotype from genotype is a central challenge in genetics. Traditional approaches in quantitative genetics typically analyze this problem using methods based on linear regression. These methods generally assume that the genetic architecture of complex traits can be parameterized in terms of an additive model, where the effects of loci are independent, plus (in some cases) pairwise epistatic interactions between loci. However, these models struggle to analyze more complex patterns of epistasis or subtle gene-environment interactions. Recent advances in machine learning, particularly attention-based models, offer a promising alternative. Initially developed for natural language processing, attention-based models excel at capturing context-dependent interactions and have shown exceptional performance in predicting protein structure and function. Here, we apply attention-based models to quantitative genetics. We analyze the performance of this attention-based approach in predicting phenotype from genotype using simulated data across a range of models with increasing epistatic complexity, and using experimental data from a recent quantitative trait locus mapping study in budding yeast. We find that our model demonstrates superior out-of-sample predictions in epistatic regimes compared to standard methods. We also explore a more general multi-environment attention-based model to jointly analyze genotype-phenotype maps across multiple environments and show that such architectures can be used for "transfer learning" - predicting phenotypes in novel environments with limited training data.

2025-04-14T16:32:17Z Krishna Rijal Caroline M. Holmes Samantha Petti Gautam Reddy Michael M. Desai Pankaj Mehta http://arxiv.org/abs/2504.10338v1 Classifying Copy Number Variations Using State Space Modeling of Targeted Sequencing Data: A Case Study in Thalassemia 2025-04-14T15:47:41Z

Thalassemia, a blood disorder and one of the most prevalent hereditary genetic disorders worldwide, is often caused by copy number variations (CNVs) in the hemoglobin genes. This disorder has incredible diversity, with a large number of distinct profiles corresponding to alterations of different regions in the genes. Correctly classifying an individual's profile is critical as it impacts treatment, prognosis, and genetic counseling. However, genetic classification is challenging due to the large number of profiles worldwide, and often requires a large number of sequential tests. Targeted next generation sequencing (NGS), which characterizes segments of an individual's genome, has the potential to dramatically reduce the cost of testing and increase accuracy. In this work, we introduce a probabilistic state space model for profiling thalassemia from targeted NGS data, which naturally characterize the spatial ordering of the genes along the chromosome. We then use decision theory to choose the best profile among the different options. Due to our use of Bayesian methodology, we are also able to detect low-quality samples to be excluded from consideration, an important component of clinical screening. We evaluate our model on a dataset of 57 individuals, including both controls and cases with a variety of thalassemia profiles. Our model has a sensitivity of 0.99 and specificity of 0.93 for thalassemia detection, and accuracy of 91.5\% for characterizing subtypes. Furthermore, the specificity and accuracy rise to $0.96$ and 93.9\% when low-quality samples are excluded using our automated quality control method. This approach outperforms alternative methods, particularly in specificity, and is broadly applicable to other disorders.

2025-04-14T15:47:41Z Austin Talbot Alex Kotlar Lavanya Rishishiwar Yue Ke http://arxiv.org/abs/2504.07881v2 An LLM-Driven Multi-Agent Debate System for Mendelian Diseases 2025-04-11T07:26:43Z

Accurate diagnosis of Mendelian diseases is crucial for precision therapy and assistance in preimplantation genetic diagnosis. However, existing methods often fall short of clinical standards or depend on extensive datasets to build pretrained machine learning models. To address this, we introduce an innovative LLM-Driven multi-agent debate system (MD2GPS) with natural language explanations of the diagnostic results. It utilizes a language model to transform results from data-driven and knowledge-driven agents into natural language, then fostering a debate between these two specialized agents. This system has been tested on 1,185 samples across four independent datasets, enhancing the TOP1 accuracy from 42.9% to 66% on average. Additionally, in a challenging cohort of 72 cases, MD2GPS identified potential pathogenic genes in 12 patients, reducing the diagnostic time by 90%. The methods within each module of this multi-agent debate system are also replaceable, facilitating its adaptation for diagnosing and researching other complex diseases.

2025-04-10T15:55:34Z 21 pages, 5 figures, 1 table Xinyang Zhou Yongyong Ren Qianqian Zhao Daoyi Huang Xinbo Wang Tingting Zhao Zhixing Zhu Wenyuan He Shuyuan Li Yan Xu Yu Sun Yongguo Yu Shengnan Wu Jian Wang Guangjun Yu Dake He Bo Ban Hui Lu http://arxiv.org/abs/2504.03976v2 OLAF: An Open Life Science Analysis Framework for Conversational Bioinformatics Powered by Large Language Models 2025-04-10T19:32:47Z

OLAF (Open Life Science Analysis Framework) is an open-source platform that enables researchers to perform bioinformatics analyses using natural language. By combining large language models (LLMs) with a modular agent-pipe-router architecture, OLAF generates and executes bioinformatics code on real scientific data, including formats like .h5ad. The system includes an Angular front end and a Python/Firebase backend, allowing users to run analyses such as single-cell RNA-seq workflows, gene annotation, and data visualization through a simple web interface. Unlike general-purpose AI tools, OLAF integrates code execution, data handling, and scientific libraries in a reproducible, user-friendly environment. It is designed to lower the barrier to computational biology for non-programmers and support transparent, AI-powered life science research.

2025-04-04T22:41:16Z Dylan Riffle Nima Shirooni Cody He Manush Murali Sovit Nayak Rishikumar Gopalan Diego Gonzalez Lopez http://arxiv.org/abs/2504.07734v1 On-Chip and Off-Chip TIA Amplifiers for Nanopore Signal Readout Design, Performance and Challenges: A Review 2025-04-10T13:29:08Z

Advancements in biomedical research have driven continuous innovations in sensing and diagnostic technologies. Among these, nanopore based single molecule sensing and sequencing is rapidly emerging as a powerful and versatile sensing methodology. Advancements in nanopore based approaches require concomitant improvements in the electronic readout methods employed, from the point of low noise, bandwidth and form factor. This article focuses on current sensing circuits designed and employed for ultra low noise nanopore signal readout, addressing the fundamental limitations of traditional off chip transimpedance amplifiers (TIAs), which suffer from high input parasitic capacitance, bandwidth constraints, and increased noise at high frequencies. This review explores the latest design schemes and circuit structures classified into on-chip and off-chip TIA designs, highlighting their design implementation, performance, respective challenges and explores the interplay between noise performance, capacitance, and bandwidth across diverse transimpedance amplifier (TIA) configurations. Emphasis is placed on characterizing noise response under varying parasitic capacitance and operational frequencies, a systematic evaluation not extensively addressed in prior literature while also considering the allowable input current compliance range limitations. The review also compares the widely used Axopatch 200B system to the designs reported in literature. The findings offer valuable insights into optimizing TIA designs for enhanced signal integrity in high speed and high sensitivity applications focusing on noise reduction, impedance matching, DC blocking, and offset cancellation techniques.

2025-04-10T13:29:08Z 35 pages , 22 figures K. Ashoka Deepthi Manoj Varma Arup Polley http://arxiv.org/abs/2504.07298v1 CiMBA: Accelerating Genome Sequencing through On-Device Basecalling via Compute-in-Memory 2025-04-09T21:40:46Z

As genome sequencing is finding utility in a wide variety of domains beyond the confines of traditional medical settings, its computational pipeline faces two significant challenges. First, the creation of up to 0.5 GB of data per minute imposes substantial communication and storage overheads. Second, the sequencing pipeline is bottlenecked at the basecalling step, consuming >40% of genome analysis time. A range of proposals have attempted to address these challenges, with limited success. We propose to address these challenges with a Compute-in-Memory Basecalling Accelerator (CiMBA), the first embedded ($\sim25$mm$^2$) accelerator capable of real-time, on-device basecalling, coupled with AnaLog (AL)-Dorado, a new family of analog focused basecalling DNNs. Our resulting hardware/software co-design greatly reduces data communication overhead, is capable of a throughput of 4.77 million bases per second, 24x that required for real-time operation, and achieves 17x/27x power/area efficiency over the best prior basecalling embedded accelerator while maintaining a high accuracy comparable to state-of-the-art software basecallers.

2025-04-09T21:40:46Z Accepted to IEEE Transactions on Parallel and Distributed Systems IEEE Transactions on Parallel and Distributed Systems, pp. 1-15, 2025 William Andrew Simon Irem Boybat Riselda Kodra Elena Ferro Gagandeep Singh Mohammed Alser Shubham Jain Hsinyu Tsai Geoffrey W. Burr Onur Mutlu Abu Sebastian 10.1109/TPDS.2025.3550811 http://arxiv.org/abs/2504.07065v1 Enhancing Downstream Analysis in Genome Sequencing: Species Classification While Basecalling 2025-04-09T17:30:43Z

The ability to quickly and accurately identify microbial species in a sample, known as metagenomic profiling, is critical across various fields, from healthcare to environmental science. This paper introduces a novel method to profile signals coming from sequencing devices in parallel with determining their nucleotide sequences, a process known as basecalling, via a multi-objective deep neural network for simultaneous basecalling and multi-class genome classification. We introduce a new loss strategy where losses for basecalling and classification are back-propagated separately, with model weights combined for the shared layers, and a pre-configured ranking strategy allowing top-K species accuracy, giving users flexibility to choose between higher accuracy or higher speed at identifying the species. We achieve state-of-the-art basecalling accuracies, while classification accuracies meet and exceed the results of state-of-the-art binary classifiers, attaining an average of 92.5%/98.9% accuracy at identifying the top-1/3 species among a total of 17 genomes in the Wick bacterial dataset. The work presented here has implications for future studies in metagenomic profiling by accelerating the bottleneck step of matching the DNA sequence to the correct genome.

2025-04-09T17:30:43Z Accepted as Tiny Paper at MLGenX workshop, ICLR, 2025 Riselda Kodra Hadjer Benmeziane Irem Boybat William Andrew Simon http://arxiv.org/abs/2410.21591v2 Can Large Language Models Replace Data Scientists in Biomedical Research? 2025-04-08T21:48:54Z

Data science plays a critical role in biomedical research, but it requires professionals with expertise in coding and medical data analysis. Large language models (LLMs) have shown great potential in supporting medical tasks and performing well in general coding tests. However, existing evaluations fail to assess their capability in biomedical data science, particularly in handling diverse data types such as genomics and clinical datasets. To address this gap, we developed a benchmark of data science coding tasks derived from the analyses of 39 published studies. This benchmark comprises 293 coding tasks (128 in Python and 165 in R) performed on real-world TCGA-type genomics and clinical data. Our findings reveal that the vanilla prompting of LLMs yields suboptimal performances due to drawbacks in following input instructions, understanding target data, and adhering to standard analysis practices. Next, we benchmarked six cutting-edge LLMs and advanced adaptation methods, finding two methods to be particularly effective: chain-of-thought prompting, which provides a step-by-step plan for data analysis, which led to a 21% code accuracy improvement (56.6% versus 35.3%); and self-reflection, enabling LLMs to refine the buggy code iteratively, yielding an 11% code accuracy improvement (45.5% versus 34.3%). Building on these insights, we developed a platform that integrates LLMs into the data science workflow for medical professionals. In a user study with five medical professionals, we found that while LLMs cannot fully automate programming tasks, they significantly streamline the programming process. We found that 80% of their submitted code solutions were incorporated from LLM-generated code, with up to 96% reuse in some cases. Our analysis highlights the potential of LLMs to enhance data science efficiency in biomedical research when integrated into expert workflows.

2024-10-28T22:48:06Z Zifeng Wang Benjamin Danek Ziwei Yang Zheng Chen Jimeng Sun http://arxiv.org/abs/2406.15341v3 GenoTEX: An LLM Agent Benchmark for Automated Gene Expression Data Analysis 2025-04-08T17:09:04Z

Recent advancements in machine learning have significantly improved the identification of disease-associated genes from gene expression datasets. However, these processes often require extensive expertise and manual effort, limiting their scalability. Large Language Model (LLM)-based agents have shown promise in automating these tasks due to their increasing problem-solving abilities. To support the evaluation and development of such methods, we introduce GenoTEX, a benchmark dataset for the automated analysis of gene expression data. GenoTEX provides analysis code and results for solving a wide range of gene-trait association problems, encompassing dataset selection, preprocessing, and statistical analysis, in a pipeline that follows computational genomics standards. The benchmark includes expert-curated annotations from bioinformaticians to ensure accuracy and reliability. To provide baselines for these tasks, we present GenoAgent, a team of LLM-based agents that adopt a multi-step programming workflow with flexible self-correction, to collaboratively analyze gene expression datasets. Our experiments demonstrate the potential of LLM-based methods in analyzing genomic data, while error analysis highlights the challenges and areas for future improvement. We propose GenoTEX as a promising resource for benchmarking and enhancing automated methods for gene expression data analysis. The benchmark is available at https://github.com/Liu-Hy/GenoTEX.

2024-06-21T17:55:24Z 31 pages, 4 figures Haoyang Liu Shuyu Chen Ye Zhang Haohan Wang http://arxiv.org/abs/2504.05790v1 ViralQC: A Tool for Assessing Completeness and Contamination of Predicted Viral Contigs 2025-04-08T08:14:44Z

Motivation: Viruses represent the most abundant biological entities on the planet and play vital roles in diverse ecosystems. Cataloging viruses across various environments is essential for understanding their properties and functions. Metagenomic sequencing has emerged as the most comprehensive method for virus discovery, enabling the sequencing of all genetic materials, including viruses, from host or environmental samples. However, distinguishing viral sequences from the vast background of cellular organism-derived reads in metagenomic data remains a significant challenge. While several learning-based tools, such as VirSorter2 and geNomad, have shown promise in identifying viral contigs, they often experience varying degrees of false positive rates due to noise in sequencing and assembly, shared genes between viruses and their hosts, and the formation of proviruses within host genomes. This highlights the urgent need for an accurate and efficient method to evaluate the quality of viral contigs. Results: To address these challenges, we introduce ViralQC, a tool designed to assess the quality of reported viral contigs or bins. ViralQC identifies contamination regions within putative viral sequences using foundation models trained on viral and cellular genomes and estimates viral completeness through protein organization alignment. We evaluate ViralQC on multiple datasets and compare its performance against CheckV, the state-of-the-art in virus quality assessment. Notably, ViralQC correctly identifies 38% more contamination than CheckV, while maintaining a median absolute error of only 3%. In addition, ViralQC delivers more accurate results for medium- to high-quality (>50% completeness) contigs, demonstrating its superior performance in completeness estimation.

2025-04-08T08:14:44Z 16 pages, 9 figures Cheng Peng Jiayu Shang Jiaojiao Guan Yanni Sun http://arxiv.org/abs/2405.05734v3 On the Coverage Required for Diploid Genome Assembly 2025-04-07T17:22:34Z

The repeat content and heterozygosity rate of a target genome are important factors in determining the feasibility of achieving a complete telomere-to-telomere assembly. The mathematical relationship between the required coverage and read length for the purpose of unique reconstruction remains unexplored for diploid genomes. We investigate the information-theoretic conditions that the given set of sequencing reads must satisfy to achieve the complete reconstruction of the true sequence of a diploid genome. We also analyze the standard greedy and de-Bruijn graph-based assembly algorithms. Our results show that the coverage and read length requirements of the assembly algorithms are considerably higher than the lower bound because both algorithms require the double repeats in the genome to be bridged. Finally, we derive the necessary conditions for the overlap graph-based assembly paradigm.

2024-05-09T12:50:18Z Accepted at ISIT'24 Daanish Mahajan Chirag Jain Navin Kashyap http://arxiv.org/abs/2504.03876v1 Multiscale Modeling Primer: Focus on Chromatin and Epigenetics 2025-04-04T19:01:17Z

Essential life processes take place across multiple space and time scales in living organisms but understanding their mechanistic interactions remains an ongoing challenge. Advanced multiscale modeling techniques are providing new opportunities and insights into these complex processes. In cells, meters of chromatin are folded into a nucleus with a diameter on the order of microns. The three-dimensional chromatin structure coupled with biochemical processes that turn genes on or off, specify a given cell type through a complicated set of interactions collectively referred to as epigenetics. Important epigenetic processes include the differential accessibility of genomic loci to transcription factors and chemical modifications to DNA and DNA-binding molecules such as histones. The dynamics of these epigenetic processes span timescales from milliseconds to years. How do chemical modifications consisting of a handful of atoms cooperate to modulate genome folding at the scale of the nucleus and impact organism outcomes? In this review, we highlight the inherently multiscale nature of chromatin organization, with a focus on computational modeling to bridge the gaps in our understanding of biochemical processes across scales. We review relevant chromatin biology, including major types of epigenetic modifications as well as the higher order chromatin structures to present a multiscale view of chromatin. We also review relevant computational methods to simulate chromatin structure, function, and dynamics, as well as experimental techniques that inform and validate said models. Finally, we argue that multiscale modeling provides a path forward towards understanding emergent behavior in this inherently multiscale system.

2025-04-04T19:01:17Z Achal Mahajan Erik J. Navarro William Poole Carlos F Lopez http://arxiv.org/abs/2504.03550v1 Dimensionality reduction for k-means clustering of large-scale influenza mutation datasets 2025-04-04T15:57:48Z

Viral mutations pose significant threats to public health by increasing infectivity, strengthening vaccine resistance, and altering disease severity. To track these evolving patterns, agencies like the CDC annually evaluate thousands of virus strains, underscoring the urgent need to understand viral mutagenesis and evolution in depth. In this study, we integrate genomic analysis, clustering, and three leading dimensionality reduction approaches, namely, principal component analysis (PCA), t-distributed stochastic neighbor embedding (t-SNE), and uniform manifold approximation and projection (UMAP)-to investigate the effects of COVID-19 on influenza virus propagation. By applying these methods to extensive pre- and post-pandemic influenza datasets, we reveal how selective pressures during the pandemic have influenced the diversity of influenza genetics. Our findings indicate that combining robust dimension reduction with clustering yields critical insights into the complex dynamics of viral mutation, informing both future research directions and strategies for public health intervention.

2025-04-04T15:57:48Z Emilee Walden Jiahui Chen Guo-Wei Wei