Temperature reshapes epigenomic diversity in Arabidopsis thaliana -- JSD and Methylator reveal RdDM-CMT2 plasticity

2025-11-03T18:40:21Z

DNA methylation can be associated with phenotypic plasticity, yet how temperature shapes DNA methylation diversity in natural populations is unclear. Analyzing whole-genome bisulfite sequencing from 1075 Arabidopsis thaliana accessions grown at 10°C, 16°C, and 22°C, we quantified single-cytosine diversity using Jensen-Shannon Divergence (JSD). Diversity consistently peaked at intermediate methylation levels across the CpG, CHG, and CHH sequence contexts. Temperature modulated this diversity, primarily impacting intermediately methylated sites, with non-CG contexts (CHG and CHH) exhibiting increased diversity at warmer temperatures. Notably, at 22°C, CHH diversity patterns indicated altered balance between the RdDM and CMT2 pathways that regulate specific transposable element (TE) superfamilies. Furthermore, accessions from Southern Europe displayed higher non-CG diversity at 22°C compared to Northern European accessions. Our findings reveal that temperature influences the epigenomic diversity landscape, highlighting context-dependent plasticity, a dynamic interplay between silencing pathways, and potential geographic adaptation in response to environmental cues.

Fast, memory-efficient genomic interval tokenizers for modern machine learning

2025-11-03T13:18:36Z

Introduction: Epigenomic datasets from high-throughput sequencing experiments are commonly summarized as genomic intervals. As the volume of this data grows, so does interest in analyzing it through deep learning. However, the heterogeneity of genomic interval data, where each dataset defines its own regions, creates barriers for machine learning methods that require consistent, discrete vocabularies. Methods: We introduce gtars-tokenizers, a high-performance library that maps genomic intervals to a predefined universe or vocabulary of regions, analogous to text tokenization in natural language processing. Built in Rust with bindings for Python, R, CLI, and WebAssembly, gtars-tokenizers implements two overlap methods (BITS and AIList) and integrates seamlessly with modern ML frameworks through Hugging Face-compatible APIs. Results: The gtars-tokenizers package achieves top efficiency for large-scale datasets, while enabling genomic intervals to be processed using standard ML workflows in PyTorch and TensorFlow without ad hoc preprocessing. This token-based approach bridges genomics and machine learning, supporting scalable and standardized analysis of interval data across diverse computational environments. Availability: PyPI and GitHub: https://github.com/databio/gtars.

SCUDDO: An unsupervised clustering algorithm for single-cell Hi-C maps using diagonal diffusion operators

2025-10-31T21:57:23Z

Motivation: Advances in high-throughput chromatin conformation capture have provided insight into the three-dimensional structure and organization of chromatin. While bulk Hi-C experiments capture spatio-temporally averaged chromatin interactions across millions of cells, single-cell Hi-C experiments report on the chromatin interactions of individual cells. Supervised and unsupervised algorithms have been developed to embed single-cell Hi-C maps and identify different cell types. However, single-cell Hi-C maps are often difficult to cluster due to their high sparsity, with state-of-the-art algorithms achieving a maximum Adjusted Rand Index (ARI) of only < 0.4 on several datasets while requiring labels for training. Results: We introduce a novel unsupervised algorithm, Single-cell Clustering Using Diagonal Diffusion Operators (SCUDDO), to embed and cluster single-cell Hi-C maps. We evaluate SCUDDO on three previously difficult-to-cluster single-cell Hi-C datasets, and show that it can outperform other current algorithms in ARI by > 0.2. Further, SCUDDO outperforms all other tested algorithms even when we restrict the number of intrachromosomal maps for each cell type and when we use only a small fraction of contacts in each Hi-C map. Thus, SCUDDO can capture the underlying latent features of single-cell Hi-C maps and provide accurate labeling of cell types even when cell types are not known a priori. Availability: SCUDDO is freely available at www.github.com/lmaisuradze/scuddo. The tested datasets are publicly available and can be downloaded from the Gene Expression Omnibus.

AmpliconHunter2: a SIMD-Accelerated In-Silico PCR Engine

2025-10-31T18:19:05Z

Summary: We present AmpliconHunter2 (AHv2), a highly scalable in silico PCR engine written in C that can handle degenerate primers and uses a highly accurate melting temperature model. AHv2 implements a bit-mask IUPAC matcher with AVX2 SIMD acceleration, supports user-specified mismatches and 3' clamp constraints, calls amplicons in all four primer pair orientations (FR/RF/FF/RR), and optionally trims primers and extracts fixed-length flanking barcodes into FASTA headers. The pipeline packs FASTA into 2-bit batches, streams them in 16 MB chunks, writes amplicons to per-thread temp files and concatenates outputs, minimizing peak RSS during amplicon finding. We also summarize updates to the Python reference (AHv1.1). Availability and Implementation: AmpliconHunter2 is available as a freely available webserver at: https://ah2.uconn.engr.edu and source code is available at: https://github.com/rhowardstone/AmpliconHunter2 under an MIT license. AHv2 was implemented in C; AHv1.1 using Python 3 with Hyperscan. Contact: rye.howard-stone@uconn.edu

Discovering Interpretable Biological Concepts in Single-cell RNA-seq Foundation Models

2025-10-29T08:52:55Z

Single-cell RNA-seq foundation models achieve strong performance on downstream tasks but remain black boxes, limiting their utility for biological discovery. Recent work has shown that sparse dictionary learning can extract concepts from deep learning models, with promising applications in biomedical imaging and protein models. However, interpreting biological concepts remains challenging, as biological sequences are not inherently human-interpretable. We introduce a novel concept-based interpretability framework for single-cell RNA-seq models with a focus on concept interpretation and evaluation. We propose an attribution method with counterfactual perturbations that identifies genes that influence concept activation, moving beyond correlational approaches like differential expression analysis. We then provide two complementary interpretation approaches: an expert-driven analysis facilitated by an interactive interface and an ontology-driven method with attribution-based biological pathway enrichment. Applying our framework to two well-known single-cell RNA-seq models from the literature, we interpret concepts extracted by Top-K Sparse Auto-Encoders trained on two immune cell datasets. With a domain expert in immunology, we show that concepts improve interpretability compared to individual neurons while preserving the richness and informativeness of the latent representations. This work provides a principled framework for interpreting what biological knowledge foundation models have encoded, paving the way for their use for hypothesis generation and discovery.

Multimodal 3D Genome Pre-training

2025-10-28T05:01:44Z

Deep learning techniques have driven significant progress in various analytical tasks within 3D genomics in computational biology. However, a holistic understanding of 3D genomics knowledge remains underexplored. Here, we propose MIX-HIC, the first multimodal foundation model of 3D genome that integrates both 3D genome structure and epigenomic tracks, which obtains unified and comprehensive semantics. For accurate heterogeneous semantic fusion, we design the cross-modal interaction and mapping blocks for robust unified representation, yielding the accurate aggregation of 3D genome knowledge. Besides, we introduce the first large-scale dataset comprising over 1 million pairwise samples of Hi-C contact maps and epigenomic tracks for high-quality pre-training, enabling the exploration of functional implications in 3D genomics. Extensive experiments show that MIX-HIC can significantly surpass existing state-of-the-art methods in diverse downstream tasks. This work provides a valuable resource for advancing 3D genomics research.

JanusDNA: A Powerful Bi-directional Hybrid DNA Foundation Model

2025-10-28T03:53:33Z

Large language models (LLMs) have revolutionized natural language processing and are increasingly applied to other sequential data types, including genetic sequences. However, adapting LLMs to genomics presents significant challenges. Capturing complex genomic interactions requires modeling long-range dependencies within DNA sequences, where interactions often span over 10,000 base pairs, even within a single gene, posing substantial computational burdens under conventional model architectures and training paradigms. Moreover, standard LLM training approaches are suboptimal for DNA: autoregressive training, while efficient, supports only unidirectional understanding. However, DNA is inherently bidirectional, e.g., bidirectional promoters regulate transcription in both directions and account for nearly 11% of human gene expression. Masked language models (MLMs) allow bidirectional understanding but are inefficient, as only masked tokens contribute to the loss per step. To address these limitations, we introduce JanusDNA, the first bidirectional DNA foundation model built upon a novel pretraining paradigm that combines the optimization efficiency of autoregressive modeling with the bidirectional comprehension of masked modeling. JanusDNA adopts a hybrid Mamba, Attention and Mixture of Experts (MoE) architecture, combining long-range modeling of Attention with efficient sequential learning of Mamba. MoE layers further scale model capacity via sparse activation while keeping computational cost low. Notably, JanusDNA processes up to 1 million base pairs at single nucleotide resolution on a single 80GB GPU. Extensive experiments and ablations show JanusDNA achieves new SOTA results on three genomic representation benchmarks, outperforming models with 250x more activated parameters. Code: https://github.com/Qihao-Duan/JanusDNA

Integrated Multi-omics Reveals MEF2C as a Direct Regulator of Microglial Immune and Synaptic Programs

2025-10-28T01:14:29Z

Background: Patients carrying MEF2C haploinsufficiency develop a recognizable neurodevelopmental syndrome featuring intellectual disability, treatment-resistant seizures, and autism spectrum behaviors. While MEF2C's critical roles in cardiac development and neuronal function are well-established, its specific transcriptional operations within microglia (the brain's resident immune cells) have remained surprisingly undefined. This knowledge gap is particularly notable given that MEF2C syndrome patients consistently present with neurological symptoms while cardiac abnormalities are rarely observed. Results: We used human iPSC-derived microglia with MEF2C knockout to perform integrated ChIP-seq and RNA-seq analyses. Our data demonstrate that MEF2C directly binds 1,258 genomic loci and regulates 755 differentially expressed genes (FDR < 0.05). Integration identified 69 high-confidence direct targets with statistically significant overlap (p = 8.87 x 10^-5). The most dramatic changes included ADAMDEC1, a microglia-enriched metalloprotease for extracellular matrix remodeling (log2FC = -4.76, adj. p = 3.30 x 10^-19), and CARD11, an NF-kappaB signaling component (log2FC = -5.16, adj. p = 5.95 x 10^-5). Pathway analysis revealed profound disruption of Fc-gamma receptor signaling (p = 3.11 x 10^-7), alongside widespread changes in immune response and synaptic organization pathways. Conclusion: These findings establish MEF2C as a master transcriptional regulator coordinating both immune effector functions and synaptic interaction programs in microglia. The observed changes, particularly in Fc receptor signaling critical for synaptic pruning, likely underlie the neurological manifestations of MEF2C syndrome. Keywords: MEF2C, microglia, ChIP-seq, RNA-seq, neurodevelopmental disorders

PanDelos-plus: A parallel algorithm for computing sequence homology in pangenomic analysis

2025-10-27T13:10:14Z

The identification of homologous gene families across multiple genomes is a central task in bacterial pangenomics traditionally requiring computationally demanding all-against-all comparisons. PanDelos addresses this challenge with an alignment-free and parameter-free approach based on k-mer profiles, combining high speed, ease of use, and competitive accuracy with state-of-the-art methods. However, the increasing availability of genomic data requires tools that can scale efficiently to larger datasets. To address this need, we present PanDelos-plus, a fully parallel, gene-centric redesign of PanDelos. The algorithm parallelizes the most computationally intensive phases (Best Hit detection and Bidirectional Best Hit extraction) through data decomposition and a thread pool strategy, while employing lightweight data structures to reduce memory usage. Benchmarks on synthetic datasets show that PanDelos-plus achieves up to 14x faster execution and reduces memory usage by up to 96%, while maintaining accuracy. These improvements enable population-scale comparative genomics to be performed on standard multicore workstations, making large-scale bacterial pangenome analysis accessible for routine use in everyday research.

ChromFound: Towards A Universal Foundation Model for Single-Cell Chromatin Accessibility Data

2025-10-26T14:29:28Z

The advent of single-cell Assay for Transposase-Accessible Chromatin using sequencing (scATAC-seq) offers an innovative perspective for deciphering regulatory mechanisms by assembling a vast repository of single-cell chromatin accessibility data. While foundation models have achieved significant success in single-cell transcriptomics, there is currently no foundation model for scATAC-seq that supports zero-shot high-quality cell identification and comprehensive multi-omics analysis simultaneously. Key challenges lie in the high dimensionality and sparsity of scATAC-seq data, as well as the lack of a standardized schema for representing open chromatin regions (OCRs). Here, we present ChromFound, a foundation model tailored for scATAC-seq. ChromFound utilizes a hybrid architecture and genome-aware tokenization to effectively capture genome-wide long contexts and regulatory signals from dynamic chromatin landscapes. Pretrained on 1.97 million cells from 30 tissues and 6 disease conditions, ChromFound demonstrates broad applicability across 6 diverse tasks. Notably, it achieves robust zero-shot performance in generating universal cell representations and exhibits excellent transferability in cell type annotation and cross-omics prediction. By uncovering enhancer-gene links undetected by existing computational methods, ChromFound offers a promising framework for understanding disease risk variants in the noncoding genome.

PerturBench: Benchmarking Machine Learning Models for Cellular Perturbation Analysis

2025-10-24T18:49:53Z

We introduce a comprehensive framework for modeling single cell transcriptomic responses to perturbations, aimed at standardizing benchmarking in this rapidly evolving field. Our approach includes a modular and user-friendly model development and evaluation platform, a collection of diverse perturbational datasets, and a set of metrics designed to fairly compare models and dissect their performance. Through extensive evaluation of both published and baseline models across diverse datasets, we highlight the limitations of widely used models, such as mode collapse. We also demonstrate the importance of rank metrics which complement traditional model fit measures, such as RMSE, for validating model effectiveness. Notably, our results show that while no single model architecture clearly outperforms others, simpler architectures are generally competitive and scale well with larger datasets. Overall, this benchmarking exercise sets new standards for model evaluation, supports robust model development, and furthers the use of these models to simulate genetic and chemical screens for therapeutic discovery.

Phenome-Wide Multi-Omics Integration Uncovers Distinct Archetypes of Human Aging

2025-10-23T08:18:20Z

Aging is a highly complex and heterogeneous process that progresses at different rates across individuals, making biological age (BA) a more accurate indicator of physiological decline than chronological age. While previous studies have built aging clocks using single-omics data, they often fail to capture the full molecular complexity of human aging. In this work, we leveraged the Human Phenotype Project, a large-scale cohort of 10,000 adults aged 40-70 years, with extensive longitudinal profiling that includes clinical, behavioral, environmental, and multi-omics datasets spanning transcriptomics, lipidomics, metabolomics, and the microbiome. By employing advanced machine learning frameworks capable of modeling nonlinear biological dynamics, we developed and rigorously validated a multi-omics aging clock that robustly predicts diverse health outcomes and future disease risk. Unsupervised clustering of the integrated molecular profiles from multi-omics uncovered distinct biological subtypes of aging, revealing striking heterogeneity in aging trajectories and pinpointing pathway-specific alterations associated with different aging patterns. These findings demonstrate the power of multi-omics integration to decode the molecular landscape of aging and lay the groundwork for personalized healthspan monitoring and precision strategies to prevent age-related diseases.

Integrative Analysis of Epigenetic, Transcriptomic, and Metabolomic Responses to Arsenic Exposure Using Coupled Matrix Factorization

2025-10-22T06:56:31Z

Arsenic (As), a widespread environmental toxin, poses major health risks due to its inorganic forms (iAs), which are linked to cancer, cardiovascular disease, and endocrine disruption. Although its toxic effects have been extensively studied, the molecular mechanisms underlying arsenic-induced perturbations remain incompletely understood. This complexity arises from its ability to reprogram epigenetic landscapes, alter gene expression, and disrupt metabolic balance through interconnected regulatory networks. Existing studies often analyze epigenomic, transcriptomic, and metabolomic datasets independently, overlooking their interdependence. Here, we present a coupled matrix factorization (CMF) framework based on the PARAFAC2-AOADMM model for joint integration of DNA methylation (RRBS), RNA-seq, and metabolomics data from mouse embryonic stem cells (ESCs) and epiblast-like cells (EpiLCs) exposed to arsenic. By jointly decomposing multi-omics matrices, our approach identifies shared and dataset-specific components that capture coordinated molecular responses to arsenic exposure. This integrative methodology demonstrates the potential of CMF-based models in computational toxicology and offers a generalizable framework for dissecting complex multi-layered biological perturbations.

A Multi-Evidence Framework Rescues Low-Power Prognostic Signals and Rejects Statistical Artifacts in Cancer Genomics

2025-10-21T12:27:18Z

Motivation: Standard genome-wide association studies in cancer genomics rely on statistical significance with multiple testing correction, but systematically fail in underpowered cohorts. In TCGA breast cancer (n=967, 133 deaths), low event rates (13.8%) create severe power limitations, producing false negatives for known drivers and false positives for large passenger genes. Results: We developed a five-criteria computational framework integrating causal inference (inverse probability weighting, doubly robust estimation) with orthogonal biological validation (expression, mutation patterns, literature evidence). Applied to TCGA-BRCA mortality analysis, standard Cox+FDR detected zero genes at FDR<0.05, confirming complete failure in underpowered settings. Our framework correctly identified RYR2 -- a cardiac gene with no cancer function -- as a false positive despite nominal significance (p=0.024), while identifying KMT2C as a complex candidate requiring validation despite marginal significance (p=0.047, q=0.954). Power analysis revealed median power of 15.1% across genes, with KMT2C achieving only 29.8% power (HR=1.55), explaining borderline statistical significance despite strong biological evidence. The framework distinguished true signals from artifacts through mutation pattern analysis: RYR2 showed 29.8% silent mutations (passenger signature) with no hotspots, while KMT2C showed 6.7% silent mutations with 31.4% truncating variants (driver signature). This multi-evidence approach provides a template for analyzing underpowered cohorts, prioritizing biological interpretability over purely statistical significance. Availability: All code and analysis pipelines available at github.com/akarlaraytu/causal-inference-for-cancer-genomics

Topological Sequence Analysis of Genomes: Category theory Approaches

2025-10-21T09:21:44Z

Sequence data, such as DNA, RNA, and protein sequences, exhibit intricate, multi-scale structures that pose significant challenges for conventional analysis methods, particularly those relying on alignment or purely statistical representations. In this work, we introduce category-based topological sequence analysis (CTSA ) of genomes. CTSA models a sequence as a resolution category, capturing its hierarchical structure through a categorical construction. Substructure complexes are then derived from this categorical representation, and their persistent homology is computed to extract multi-scale topological features. Our models depart from traditional alignment-free approaches by incorporating structured mathematical formalisms rooted in sequence topology. The resulting topological signatures provide informative representations across a variety of tasks, including the phylogenetic analysis of SARS-CoV-2 variants and the prediction of protein-nucleic acid binding affinities. Comparative studies were carried out against six state-of-the-art methods. Experimental results demonstrate that CTSA achieves excellent and consistent performance in these tasks, suggesting its general applicability and robustness. Beyond sequence analysis, the proposed framework opens new directions for the integration of categorical and homological theories for biological sequence analysis.