https://arxiv.org/api/srWim5Lm6xCpXk41g1dULe4/ON8 2026-06-14T20:48:14Z 3848 465 15 http://arxiv.org/abs/2505.24394v1 Refining Platelet Purification Methods: Enhancing Proteomics for Clinical Applications 2025-05-30T09:24:18Z

Background: Platelet proteomics offers valuable insights for clinical research, yet isolating high-purity platelets remains a challenge. Current methods often lead to contamination or platelet loss, compromising data quality and reproducibility. Objectives: This study aimed to optimize a platelet isolation technique that yields high-purity samples with minimal loss and to identify the most effective mass spectrometry-based proteomic method for analyzing platelet proteins with optimal coverage and sensitivity. Methods: We refined an isolation protocol by adjusting centrifugation time to reduce blood volume requirements while preserving platelet yield and purity. Using this optimized method, we evaluated three proteomic approaches: Label-free Quantification with Data-Independent Acquisition (LFQ-DIA), Label-free Quantification with Data-Dependent Acquisition (LFQ-DDA), and Tandem Mass Tag labeling with DDA (TMT-DDA). Results: LFQ-DIA demonstrated superior protein coverage and sensitivity compared to LFQ-DDA and TMT-DDA. The refined isolation protocol effectively minimized contamination and platelet loss. Additionally, age-related differences in platelet protein composition were observed, highlighting the importance of using age-matched controls in biomarker discovery studies. Conclusions: The optimized platelet isolation protocol provides a cost-effective and reliable method for preparing high-purity samples for proteomics. LFQ-DIA is the most suitable approach for comprehensive platelet protein analysis. Age-related variation in platelet proteomes underscores the need for demographic matching in clinical proteomic research.

2025-05-30T09:24:18Z Vibecke Markhus Katarina Fritz-Wallace Olav Mjaavatten Einar K. Kristoffersen Dorota Goplen Frode Selheim http://arxiv.org/abs/2506.00082v1 An AI-powered Knowledge Hub for Potato Functional Genomics 2025-05-30T03:09:59Z

Potato functional genomics lags due to unsystematic gene information curation, gene identifier inconsistencies across reference genome versions, and the increasing volume of research publications. To address these limitations, we developed the Potato Knowledge Hub (http://www.potato-ai.top), leveraging Large Language Models (LLMs) and a systematically curated collection of over 3,200 high-quality potato research papers spanning over 120 years. This platform integrates two key modules: a functional gene database containing 2,571 literature-reported genes, meticulously mapped to the latest DMv8.1 reference genome with resolved nomenclature discrepancies and links to original publications; and a potato knowledge base. The knowledge base, built using a Retrieval-Augmented Generation (RAG) architecture, accurately answers research queries with literature citations, mitigating LLM "hallucination." Users can interact with the hub via a natural language AI agent, "Potato Research Assistant," for querying specialized knowledge, retrieving gene information, and extracting sequences. The continuously updated Potato Knowledge Hub aims to be a comprehensive resource, fostering advancements in potato functional genomics and supporting breeding programs.

2025-05-30T03:09:59Z 11 pages, 4 figures Jia Yuxin Li Jinye Jia Yudong Li Futing Su Xiaoqi Luo Jilin Dong Yarui Sun Chunyan Cui Qinghan Wang Li Li Axiu Shang Yi Zhu Yujuan Huang Sanwen http://arxiv.org/abs/2402.01942v5 Pairwise Rearrangement is Fixed-Parameter Tractable in the Single Cut-and-Join Model 2025-05-28T19:23:15Z

Genome rearrangement is a common model for molecular evolution. In this paper, we consider the Pairwise Rearrangement problem, which takes as input two genomes and asks for the number of minimum-length sequences of permissible operations transforming the first genome into the second. In the Single Cut-and-Join model (Bergeron, Medvedev, & Stoye, J. Comput. Biol. 2010), Pairwise Rearrangement is $\#\textsf{P}$-complete (Bailey, et. al., COCOON 2023), which implies that exact sampling is intractable. In order to cope with this intractability, we investigate the parameterized complexity of this problem. We exhibit a fixed-parameter tractable algorithm with respect to the number of components in the adjacency graph that are not cycles of length $2$ or paths of length $1$. As a consequence, we obtain that Pairwise Rearrangement in the Single Cut-and-Join model is fixed-parameter tractable by distance. Our results suggest that the number of nontrivial components in the adjacency graph serves as the key obstacle for efficient sampling.

2024-02-02T22:36:21Z Full version of paper that appeared in SWAT 2024; arXiv admin note: text overlap with arXiv:2305.01851 Lora Bailey Heather Smith Blake Garner Cochran Nathan Fox Michael Levet Reem Mahmoud Inne Singgih Grace Stadnyk Alexander Wiedemann http://arxiv.org/abs/2505.23839v1 GeneBreaker: Jailbreak Attacks against DNA Language Models with Pathogenicity Guidance 2025-05-28T13:58:32Z

DNA, encoding genetic instructions for almost all living organisms, fuels groundbreaking advances in genomics and synthetic biology. Recently, DNA Foundation Models have achieved success in designing synthetic functional DNA sequences, even whole genomes, but their susceptibility to jailbreaking remains underexplored, leading to potential concern of generating harmful sequences such as pathogens or toxin-producing genes. In this paper, we introduce GeneBreaker, the first framework to systematically evaluate jailbreak vulnerabilities of DNA foundation models. GeneBreaker employs (1) an LLM agent with customized bioinformatic tools to design high-homology, non-pathogenic jailbreaking prompts, (2) beam search guided by PathoLM and log-probability heuristics to steer generation toward pathogen-like sequences, and (3) a BLAST-based evaluation pipeline against a curated Human Pathogen Database (JailbreakDNABench) to detect successful jailbreaks. Evaluated on our JailbreakDNABench, GeneBreaker successfully jailbreaks the latest Evo series models across 6 viral categories consistently (up to 60\% Attack Success Rate for Evo2-40B). Further case studies on SARS-CoV-2 spike protein and HIV-1 envelope protein demonstrate the sequence and structural fidelity of jailbreak output, while evolutionary modeling of SARS-CoV-2 underscores biosecurity risks. Our findings also reveal that scaling DNA foundation models amplifies dual-use risks, motivating enhanced safety alignment and tracing mechanisms. Our code is at https://github.com/zaixizhang/GeneBreaker.

2025-05-28T13:58:32Z Zaixi Zhang Zhenghong Zhou Ruofan Jin Le Cong Mengdi Wang http://arxiv.org/abs/2505.20836v1 HAD: Hybrid Architecture Distillation Outperforms Teacher in Genomic Sequence Modeling 2025-05-27T07:57:35Z

Inspired by the great success of Masked Language Modeling (MLM) in the natural language domain, the paradigm of self-supervised pre-training and fine-tuning has also achieved remarkable progress in the field of DNA sequence modeling. However, previous methods often relied on massive pre-training data or large-scale base models with huge parameters, imposing a significant computational burden. To address this, many works attempted to use more compact models to achieve similar outcomes but still fell short by a considerable margin. In this work, we propose a Hybrid Architecture Distillation (HAD) approach, leveraging both distillation and reconstruction tasks for more efficient and effective pre-training. Specifically, we employ the NTv2-500M as the teacher model and devise a grouping masking strategy to align the feature embeddings of visible tokens while concurrently reconstructing the invisible tokens during MLM pre-training. To validate the effectiveness of our proposed method, we conducted comprehensive experiments on the Nucleotide Transformer Benchmark and Genomic Benchmark. Compared to models with similar parameters, our model achieved excellent performance. More surprisingly, it even surpassed the distillation ceiling-teacher model on some sub-tasks, which is more than 500 $\times$ larger. Lastly, we utilize t-SNE for more intuitive visualization, which shows that our model can gain a sophisticated understanding of the intrinsic representation pattern in genomic sequences.

2025-05-27T07:57:35Z Hexiong Yang Mingrui Chen Huaibo Huang Junxian Duan Jie Cao Zhen Zhou Ran He http://arxiv.org/abs/2505.20578v1 Ctrl-DNA: Controllable Cell-Type-Specific Regulatory DNA Design via Constrained RL 2025-05-26T23:27:50Z

Designing regulatory DNA sequences that achieve precise cell-type-specific gene expression is crucial for advancements in synthetic biology, gene therapy and precision medicine. Although transformer-based language models (LMs) can effectively capture patterns in regulatory DNA, their generative approaches often struggle to produce novel sequences with reliable cell-specific activity. Here, we introduce Ctrl-DNA, a novel constrained reinforcement learning (RL) framework tailored for designing regulatory DNA sequences with controllable cell-type specificity. By formulating regulatory sequence design as a biologically informed constrained optimization problem, we apply RL to autoregressive genomic LMs, enabling the models to iteratively refine sequences that maximize regulatory activity in targeted cell types while constraining off-target effects. Our evaluation on human promoters and enhancers demonstrates that Ctrl-DNA consistently outperforms existing generative and RL-based approaches, generating high-fitness regulatory sequences and achieving state-of-the-art cell-type specificity. Moreover, Ctrl-DNA-generated sequences capture key cell-type-specific transcription factor binding sites (TFBS), short DNA motifs recognized by regulatory proteins that control gene expression, demonstrating the biological plausibility of the generated sequences.

2025-05-26T23:27:50Z 9 pages, 3 figures Xingyu Chen Shihao Ma Runsheng Lin Jiecong Lin Bo Wang http://arxiv.org/abs/2505.19461v1 Fluctuations in DNA Packing Density Drive the Spatial Segregation between Euchromatin and Heterochromatin 2025-05-26T03:34:57Z

In the crowded eukaryotic nucleus, euchromatin and heterochromatin segregate into distinct compartments, a phenomenon often attributed to homotypic interactions mediated by liquid liquid phase separation of chromatin associated proteins. Here, we revisit genome compartmentalization by examining the role of in vivo DNA packing density fluctuations driven by ATP dependent chromatin remodelers. Leveraging DNA accessibility data, we develop a polymer based model that captures these fluctuations and successfully reproduces genome wide compartment patterns observed in HiC data, without invoking homotypic interactions. Further analysis reveals that density fluctuations in a crowded nuclear environment elevate the system energy, while euchromatin heterochromatin segregation facilitates energy dissipation, offering a thermodynamic advantage for spontaneous compartment formation. These findings suggest that euchromatin heterochromatin segregation may arise through a non equilibrium, self organizing process, providing new insights into genome organization.

2025-05-26T03:34:57Z Luming Meng Boping Liu Qiong Luo http://arxiv.org/abs/2505.20344v1 Genetic Influences on Brain Aging: Analyzing Sex Differences in the UK Biobank using Structural MRI 2025-05-25T03:59:00Z

Brain aging trajectories differ between males and females, yet the genetic factors underlying these differences remain underexplored. Using structural MRI and genotyping data from 40,940 UK Biobank participants (aged 45-83), we computed Brain Age Gap Estimates (BrainAGE) for total brain, hippocampal, and ventricular volumes. We conducted sex-stratified genome-wide association studies (GWAS) and Post-GWAS analyses to identify genetic variants associated with accelerated brain aging. Distinct gene sets emerged by sex: in females, neurotransmitter transport and mitochondrial stress response genes were implicated; in males, immune and inflammation-related genes dominated. Shared genes, including GMNC and OSTN, were consistently linked to brain volumes across sexes, suggesting core roles in neurostructural maintenance. Tissue expression analyses revealed sex-specific enrichment in pathways tied to neurodegeneration. These findings highlight the importance of sex-stratified approaches in aging research and suggest genetic targets for personalized interventions against age-related cognitive decline.

2025-05-25T03:59:00Z 7 pages, 5 figures, conference International Society for Magnetic Resonance in Medicine, ISMRM Annual Meeting, May 2025 Karen Ardila Aashka Mohite Abdoljalil Addeh Amanda V. Tyndall Cindy K. Barha Quan Long M. Ethan MacDonald http://arxiv.org/abs/2506.05361v1 Scalable Generation of Spatial Transcriptomics from Histology Images via Whole-Slide Flow Matching 2025-05-25T01:29:19Z

Spatial transcriptomics (ST) has emerged as a powerful technology for bridging histology imaging with gene expression profiling. However, its application has been limited by low throughput and the need for specialized experimental facilities. Prior works sought to predict ST from whole-slide histology images to accelerate this process, but they suffer from two major limitations. First, they do not explicitly model cell-cell interaction as they factorize the joint distribution of whole-slide ST data and predict the gene expression of each spot independently. Second, their encoders struggle with memory constraints due to the large number of spots (often exceeding 10,000) in typical ST datasets. Herein, we propose STFlow, a flow matching generative model that considers cell-cell interaction by modeling the joint distribution of gene expression of an entire slide. It also employs an efficient slide-level encoder with local spatial attention, enabling whole-slide processing without excessive memory overhead. On the recently curated HEST-1k and STImage-1K4M benchmarks, STFlow substantially outperforms state-of-the-art baselines and achieves over 18% relative improvements over the pathology foundation models.

2025-05-25T01:29:19Z Accepted at ICML 2025 Tinglin Huang Tianyu Liu Mehrtash Babadi Wengong Jin Rex Ying http://arxiv.org/abs/2402.14527v2 Federated Learning in Genetics: Extended Analysis of Accuracy, Performance and Privacy Trade-offs 2025-05-23T08:56:07Z

Machine learning on large-scale genomic or transcriptomic data is important for many novel health applications. For example, precision medicine tailors medical treatments to patients on the basis of individual biomarkers, cellular and molecular states, etc. However, the data required is sensitive, voluminous, heterogeneous, and typically distributed across locations where dedicated machine learning hardware is not available. Due to privacy and regulatory reasons, it is also problematic to aggregate all data at a trusted third party. Federated learning is a promising solution to this dilemma, because it enables decentralized, collaborative machine learning without exchanging raw data. In this paper, we perform comparative experiments with the federated learning frameworks TensorFlow Federated and Flower. Our test case is the training of disease prognosis and cell type classification models. We train the models with distributed transcriptomic data, considering both data heterogeneity and architectural heterogeneity. We measure model quality, robustness against privacy-enhancing noise and computational performance. We evaluate the resource overhead of a federated system from both client and global perspectives and assess benefits and limitations. Each of the federated learning frameworks has different strengths. However, our experiments confirm that both frameworks can readily build models on transcriptomic data, without transferring personal raw data to a third party with abundant computational resources. This paper is the extended version of https://link.springer.com/chapter/10.1007/978-3-031-63772-8_26.

2024-02-22T13:21:26Z This paper is the extended version of https://link.springer.com/chapter/10.1007/978-3-031-63772-8_26 Anika Hannemann Jan Ewald Leo Seeger Erik Buchmann http://arxiv.org/abs/2410.19236v4 SHAP zero Explains Biological Sequence Models with Near-zero Marginal Cost for Future Queries 2025-05-22T14:58:50Z

The growing adoption of machine learning models for biological sequences has intensified the need for interpretable predictions, with Shapley values emerging as a theoretically grounded standard for model explanation. While effective for local explanations of individual input sequences, scaling Shapley-based interpretability to extract global biological insights requires evaluating thousands of sequences--incurring exponential computational cost per query. We introduce SHAP zero, a novel algorithm that amortizes the cost of Shapley value computation across large-scale biological datasets. After a one-time model sketching step, SHAP zero enables near-zero marginal cost for future queries by uncovering an underexplored connection between Shapley values, high-order feature interactions, and the sparse Fourier transform of the model. Applied to models of guide RNA efficacy, DNA repair outcomes, and protein fitness, SHAP zero explains predictions orders of magnitude faster than existing methods, recovering rich combinatorial interactions previously inaccessible at scale. This work opens the door to principled, efficient, and scalable interpretability for black-box sequence models in biology.

2024-10-25T00:58:31Z Darin Tsui Aryan Musharaf Yigit Efe Erginbas Justin Singh Kang Amirali Aghazadeh http://arxiv.org/abs/2505.16680v1 Learning Genomic Structure from $k$-mers 2025-05-22T13:46:18Z

Sequencing a genome to determine an individual's DNA produces an enormous number of short nucleotide subsequences known as reads, which must be reassembled to reconstruct the full genome. We present a method for analyzing this type of data using contrastive learning, in which an encoder model is trained to produce embeddings that cluster together sequences from the same genomic region. The sequential nature of genomic regions is preserved in the form of trajectories through this embedding space. Trained solely to reflect the structure of the genome, the resulting model provides a general representation of $k$-mer sequences, suitable for a range of downstream tasks involving read data. We apply our framework to learn the structure of the $E.\ coli$ genome, and demonstrate its use in simulated ancient DNA (aDNA) read mapping and identification of structural variations. Furthermore, we illustrate the potential of using this type of model for metagenomic species identification. We show how incorporating a domain-specific noise model can enhance embedding robustness, and how a supervised contrastive learning setting can be adopted when a linear reference genome is available, by introducing a distance thresholding parameter $Γ$. The model can also be trained fully self-supervised on read data, enabling analysis without the need to construct a full genome assembly using specialized algorithms. Small prediction heads based on a pre-trained embedding are shown to perform on par with BWA-aln, the current gold standard approach for aDNA mapping, in terms of accuracy and runtime for short genomes. Given the method's favorable scaling properties with respect to total genome size, inference using our approach is highly promising for metagenomic applications and for mapping to genomes comparable in size to the human genome.

2025-05-22T13:46:18Z Filip Thor Carl Nettelblad http://arxiv.org/abs/2505.15866v1 Multi-omic Causal Discovery using Genotypes and Gene Expression 2025-05-21T11:52:23Z

Causal discovery in multi-omic datasets is crucial for understanding the bigger picture of gene regulatory mechanisms, but remains challenging due to high dimensionality, differentiation of direct from indirect relationships, and hidden confounders. We introduce GENESIS (GEne Network inference from Expression SIgnals and SNPs), a constraint-based algorithm that leverages the natural causal precedence of genotypes to infer ancestral relationships in transcriptomic data. Unlike traditional causal discovery methods that start with a fully connected graph, GENESIS initialises an empty ancestrality matrix and iteratively populates it with direct, indirect or non-causal relationships using a series of provably sound marginal and conditional independence tests. By integrating genotypes as fixed causal anchors, GENESIS provides a principled ``head start'' to classical causal discovery algorithms, restricting the search space to biologically plausible edges. We test GENESIS on synthetic and real-world genomic datasets. This framework offers a powerful avenue for uncovering causal pathways in complex traits, with promising applications to functional genomics, drug discovery, and precision medicine.

2025-05-21T11:52:23Z Stephen Asiedu David Watson http://arxiv.org/abs/2505.14402v1 OmniGenBench: A Modular Platform for Reproducible Genomic Foundation Models Benchmarking 2025-05-20T14:16:25Z

The code of nature, embedded in DNA and RNA genomes since the origin of life, holds immense potential to impact both humans and ecosystems through genome modeling. Genomic Foundation Models (GFMs) have emerged as a transformative approach to decoding the genome. As GFMs scale up and reshape the landscape of AI-driven genomics, the field faces an urgent need for rigorous and reproducible evaluation. We present OmniGenBench, a modular benchmarking platform designed to unify the data, model, benchmarking, and interpretability layers across GFMs. OmniGenBench enables standardized, one-command evaluation of any GFM across five benchmark suites, with seamless integration of over 31 open-source models. Through automated pipelines and community-extensible features, the platform addresses critical reproducibility challenges, including data transparency, model interoperability, benchmark fragmentation, and black-box interpretability. OmniGenBench aims to serve as foundational infrastructure for reproducible genomic AI research, accelerating trustworthy discovery and collaborative innovation in the era of genome-scale modeling.

2025-05-20T14:16:25Z Heng Yang Jack Cole Yuan Li Renzhi Chen Geyong Min Ke Li http://arxiv.org/abs/2505.11610v1 Foundation Models for AI-Enabled Biological Design 2025-05-16T18:17:37Z

This paper surveys foundation models for AI-enabled biological design, focusing on recent developments in applying large-scale, self-supervised models to tasks such as protein engineering, small molecule design, and genomic sequence design. Though this domain is evolving rapidly, this survey presents and discusses a taxonomy of current models and methods. The focus is on challenges and solutions in adapting these models for biological applications, including biological sequence modeling architectures, controllability in generation, and multi-modal integration. The survey concludes with a discussion of open problems and future directions, offering concrete next-steps to improve the quality of biological sequence generation.

2025-05-16T18:17:37Z Published as part of the workshop proceedings at AAAI 2025 in the workshop "Foundation Models for Biological Discoveries" Asher Moldwin Amarda Shehu