https://arxiv.org/api/O975GUQp+kRSAmO/UTENgpS2e/Q 2026-06-13T13:44:06Z 3848 30 15 http://arxiv.org/abs/2605.18587v2 PACE: Geometry-Aware Bridge Transport for Single-Cell Trajectory Inference 2026-05-28T06:39:10Z

Single-cell trajectory inference from destructive time-course snapshots is fundamentally ill-posed: neither cross-time cell correspondences nor continuous trajectories are observed, so the snapshot distributions alone do not uniquely determine the underlying dynamics. Existing optimal transport and flow-based methods typically couple cells by Euclidean proximity at observed clock times, which can misalign trajectories when development is asynchronous and cells sampled at the same experimental time occupy different latent pseudotime stages. We propose PACE, a trajectory inference framework that recovers geometry-consistent continuous transport dynamics from destructive time-course snapshots through three coupled components. First, PACE constructs a state- and time-dependent anisotropic Riemannian metric that assigns low transport cost along locally supported tangent directions while penalizing normal velocity components. Second, it alternates between refining cross-time couplings under the induced path-action cost and fitting endpoint-preserving neural bridges between adjacent snapshots. Third, it distills the learned bridge dynamics into a global continuous-time velocity field over cellular states. Across seven controlled and biological datasets covering nine held-out reconstruction experiments, PACE achieves the strongest overall reconstruction performance, reducing MMD, Wasserstein-1 distance, and Wasserstein-2 distance by 23.7% on average relative to the strongest competing baseline. PACE also improves RNA-velocity alignment by 15.4% on an embryoid body differentiation benchmark, without requiring explicit cell pairing, lineage tracing, or RNA-velocity supervision during training. Code is available at https://github.com/AI4Science-WestlakeU/PACE.

2026-05-18T16:03:56Z 31 pages, 12 figures Chenglei Yu Chuanrui Wang Bangyan Liao Tailin Wu http://arxiv.org/abs/2605.28200v1 Geometry-First Generative Spatial Single-Cell Reconstruction 2026-05-27T09:24:16Z

Single-cell RNA sequencing (scRNA-seq) profiles large numbers of cells but loses spatial context, whereas spatial transcriptomics (ST) preserves partial spatial structure at lower resolution. Most existing integration methods either deconvolve spot mixtures or map cells onto a measured spot lattice, which ties reconstructions to a fixed grid and slide-specific coordinate systems, a limitation that is especially problematic in unpaired settings. We propose GEARS, a geometry-first framework that reconstructs an intrinsic single-cell spatial geometry guided by ST, without relying on cell-type labels, histological images, or cell-to-spot assignment. GEARS first learns a domain-invariant expression encoder that aligns ST spots and dissociated cells, and then trains a permutation-equivariant generator with a diffusion-based refiner with EDM-style preconditioning to generate local spatial geometries under pose-invariant supervision derived from ST coordinates. At inference, GEARS reconstructs geometry on many overlapping subsets of scRNA-seq cells, aggregates predicted pairwise distances across subsets, and solves a global distance-geometry problem to obtain canonical two-dimensional coordinates and a dense distance matrix. Extensive quantitative and qualitative experiments, including cross-section generalization, show that GEARS consistently improves global distance preservation, local neighborhood fidelity, and spatial distribution alignment compared to strong spatial mapping and deconvolution baselines.

2026-05-27T09:24:16Z 32nd SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026) Ehtesamul Azim Muhtasim Noor Alif Tae Hyun Hwang Yanjie Fu Wei Zhang http://arxiv.org/abs/2602.17162v2 JEPA-DNA: Grounding Genomic Foundation Models through Joint-Embedding Predictive Architectures 2026-05-25T17:29:16Z

Genomic Foundation Models (GFMs) typically rely on Masked Language Modeling (MLM) or Next-Token Prediction (NTP) to learn the "Laws of Nature". While effective at capturing local syntax, these generative paradigms prioritize token-level reconstruction over high-level functional context. We introduce JEPA-DNA, a model-agnostic continual training framework that integrates a Joint-Embedding Predictive Architecture (JEPA) with traditional generative objectives. By supervising global sequence embeddings in a latent space, JEPA-DNA forces models to predict the functional representations of masked genomic segments, shifting the learning signal from token recovery to semantic alignment. We evaluate JEPA-DNA on 17 diverse genomic benchmark tasks, demonstrating consistent gains in linear probing and zero-shot performance regardless of the underlying GFM architecture or generative objective. Our framework establishes a new state-of-the-art for GFMs, surpassing the best existing models by bridging generative precision with latent semantic grounding. Through extensive ablation studies, we further characterize the synergistic interplay between generative and latent objectives. Our code is publicly available at https://github.com/NVIDIA-Digital-Bio/JEPA-DNA.

2026-02-19T08:20:51Z Ariel Larey Elay Dahan Amit Bleiweiss Raizy Kellerman Guy Leib Omri Nayshool Dan Ofer Tal Zinger Dan Dominissini Gideon Rechavi Nicole Bussola Simon Lee Shane O'Connell Dung Hoang Marissa Wirth Alexander W. Charney Nati Daniel Yoli Shavit http://arxiv.org/abs/2605.25242v1 C3P: Contrastive promoter-protein pretraining yields representations capturing bacterial gene regulation 2026-05-24T20:01:33Z

Despite the increasing scale of genome language models (gLMs), their ability to decode the function of regulatory sequences remains unclear. gLM pretraining relies on sequence reconstruction, which may struggle due to the noisy, rapidly evolving nature of regulatory DNA. Self-supervised contrastive approaches provide a promising alternative. Inspired by language-image architectures like CLIP, we introduce contrastive promoter-protein pretraining (C3P). By learning to align promoters to their corresponding proteins, we leverage the rich representations of proteins learned by protein language models as supervisory signal for the learning of promoter representations. After training on 88 million bacterial promoter-protein pairs, we evaluate the predictive power of C3P-learned promoter representations for inference of curated regulatory annotations, finding multi-fold improvement over leading gLMs. We also introduce zero-shot co-regulated gene retrieval, the ability to find co-regulated genes in a genome using no experimental data. We find that compared to a randomly initialized baseline, C3P training consistently provides significant zero-shot performance gains, unlike gLMs. Scaling analysis reveals the potential for further improvement as well as the efficiency of C3P, which achieved strong performance at a fraction of the training cost of leading gLMs. In addition to demonstrating that C3P training is effective for learning representations of bacterial regulatory sequences, our strong zero-shot co-regulated gene retrieval performance suggests the possibility of decoding gene regulation for millions of bacteria from their genomes alone.

2026-05-24T20:01:33Z Cameron Dufault Scott Xu Alan M. Moses http://arxiv.org/abs/2605.20747v2 Multi-Modal Machine Learning for Population- and Subject-Specific lncRNA-Type 2 Diabetes Association Analysis 2026-05-23T19:24:42Z

Long non-coding RNAs (lncRNAs) are emerging regulatory molecules implicated in chronic disease pathogenesis, including Type 2 Diabetes Mellitus (T2D). We investigated ten literature reported lncRNAs associated with T2D: MALAT1, MEG3, MIAT, ANRIL, GAS5, KCNQ1OT1, H19, BCYRN1, XIST, and HOTAIR across two independent population-based RNA-seq cohorts. Single-omics approaches provide an incomplete view of disease biology, therefore, an integrative multi-feature framework was developed, extracting expression, secondary-structure, and sequence features for each lncRNA. Eight machine learning (ML) classifiers were evaluated under stratified k-fold, leave-one-out cross-validation (LOOCV), and repeated hold-out schemes to ensure robust performance estimation. SHAP analysis was applied for subject-level association interpretation. In one cohort, GAS5 and XIST expression features, along with GAS5, MEG3, and ANRIL sequence features, were found to be associated with T2D, while MALAT1 expression and KCNQ1OT1, ANRIL, and MEG3 sequence features were found to be associated in the second cohort. MEG3 was identified by SHAP as the dominant lncRNA in both cohorts. ML results were consistent with established statistical methods while additionally providing population- and subject-level disease association profiles linked to specific molecular feature types. The proposed framework advances mechanistic understanding of T2D and supports lncRNA-based precision medicine.

2026-05-20T05:49:42Z This work has been submitted to the IEEE for possible publication Ashwani Siwach Sanjeev Narayan Sharma Sunil Datt Sharma http://arxiv.org/abs/2605.24520v1 AnnotateMissense: a genome-wide annotation and benchmarking framework for missense pathogenicity prediction 2026-05-23T11:15:53Z

Missense variant interpretation remains challenging because pathogenicity depends on heterogeneous evidence from population frequency, evolutionary conservation, transcript context, amino acid substitution severity, prior pathogenicity predictors and protein-language-model-derived features. We present AnnotateMissense, a scalable annotation, benchmarking and genome-wide prediction framework for missense variant interpretation. AnnotateMissense integrates hg38 missense variants derived from dbNSFP v5.1 with ANNOVAR annotations, dbNSFP transcript/protein descriptors, AlphaMissense scores, ESM-derived features, conservation metrics, population-frequency variables, established pathogenicity predictors and engineered amino acid/codon-context features. Using 132,714 ClinVar-labelled missense variants, we benchmarked machine-learning and deep-learning models under controlled feature configurations. The full 303-feature benchmark set achieved the strongest performance with XGBoost, reaching mean MCC = 0.9411 and ROC-AUC = 0.9950 across stratified five-fold cross-validation. Restricted naive and location-oriented feature sets achieved lower best MCC values of 0.4989 and 0.5113, respectively. Circularity-controlled ablations showed that removing prior-predictor, population-frequency and clinically overlapping evidence reduced performance, whereas excluding AlphaMissense and ESM-derived features alone had minimal effect. Temporal ClinVar validation on newly observed pathogenic/benign variants achieved MCC = 0.7613, accuracy = 0.8798 and F1-score = 0.8750. The final model was applied to 90,643,830 hg38 missense variants to generate AnnotateMissense pathogenicity scores and binary prediction labels. Code and outputs are available at https://github.com/MuhammadMuneeb007/CAGI7_Annotate_All_Missense and https://doi.org/10.5281/zenodo.19981867.

2026-05-23T11:15:53Z Muhammad Muneeb David B. Ascher http://arxiv.org/abs/2602.00156v2 Accelerating De Novo Genome Assembly via Quantum-Assisted Graph Optimization with Bitstring Recovery 2026-05-23T08:36:28Z

Genome sequencing is essential to decode genetic information, identify organisms, understand diseases and advance personalized medicine. A critical step in any genome sequencing technique is genome assembly. However, de novo genome assembly, which involves constructing an entire genome sequence from scratch without a reference genome, presents significant challenges due to its high computational complexity, affecting both time and accuracy. In this study, we propose a hybrid approach utilizing a quantum computing-based optimization algorithm integrated with classical pre-processing to expedite the genome assembly process. Specifically, we present a method to solve the Hamiltonian and Eulerian paths within the genome assembly graph using gate-based quantum computing through a Higher-Order Binary Optimization (HOBO) formulation with the Variational Quantum Eigensolver algorithm (VQE), in addition to a novel bitstring recovery mechanism to improve optimizer traversal of the solution space. A comparative analysis with classical optimization techniques was performed to assess the effectiveness of our quantum-based approach in genome assembly. The results indicate that, as quantum hardware continues to evolve and noise levels diminish, our formulation holds a significant potential to accelerate genome sequencing by offering faster and more accurate solutions to the complex challenges in genomic research.

2026-01-29T19:03:55Z Jaya Vasavi Pamidimukkala Himanshu Sahu Ashwini Kannan Janani Ananthanarayanan Kalyan Dasgupta Sanjib Senapati http://arxiv.org/abs/2605.23521v1 Population-Specific Genetic and Non-Genetic Influences on Sleep Traits and Health Outcomes 2026-05-22T11:36:32Z

Sleep traits are shaped by genetic and environmental factors and may influence many health conditions. The All of Us Research Program, which includes EHR, physical measurements, genomic data, and wearable data across ancestry groups, provides an opportunity to study genetic and non-genetic contributors to sleep-related health outcomes. We examined associations between genetic predispositions to chronotype, sleep duration, and short sleep and health outcomes across ancestries, as well as the role of measured sleep duration. We used All of Us genome-wide association study results, including ancestry-specific and meta-analyses for 3,414 phenotypes, to identify phenotypes associated with 455 sleep-related SNPs. Cross-sectional and longitudinal analyses (n = 212,529) evaluated associations between polygenic risk scores (PRS) and anthropometric and metabolic measures from EHR. A subgroup analysis (n = 7,655) assessed sleep duration using Fitbit data. Across six ancestry groups, SNP analysis identified 61 phenotypes linked to 29 sleep-trait-associated SNPs. The chronotype SNP rs1421085 in FTO showed the strongest associations with obesity, diabetes, and cardiovascular conditions, mainly in European, American, and African groups. PRS analysis showed that higher predisposition to shorter sleep duration was associated with increased risk of obesity and diabetes, with ancestry-specific variation. Measured sleep duration attenuated these associations, with relative contributions of 85.6%-99.9% in cross-sectional analyses and 7.1%-44.0% in longitudinal analyses compared with PRS. This study identified health conditions associated with genetic predispositions to sleep traits and suggests that actual sleep duration may play a prominent role in sleep-related health outcomes. Differences among meta-, pooled-, and ancestry-specific analyses highlight the importance of population-specific research.

2026-05-22T11:36:32Z Jiheum Park Stephanie Y. Shue Rocio Barragan Jeong Yun Yang Tian Gu Chin Hur Marie-Pierre St-Onge http://arxiv.org/abs/2601.12805v3 SciHorizon-GENE: Benchmarking LLM for Life Sciences Inference from Gene Knowledge to Functional Understanding 2026-05-22T04:36:34Z

Large language models (LLMs) have shown growing promise in biomedical research, particularly for knowledge-driven interpretation tasks. However, their ability to reliably reason from gene-level knowledge to functional understanding, a core requirement for knowledge-enhanced cell atlas interpretation, remains largely underexplored. To address this gap, we introduce SciHorizon-GENE, a large-scale gene-centric benchmark constructed from authoritative biological databases. The benchmark integrates curated knowledge for over 190K human genes and comprises more than 540K questions covering diverse gene-to-function reasoning scenarios relevant to cell type annotation, functional interpretation, and mechanism-oriented analysis. Motivated by behavioral patterns observed in preliminary examinations, SciHorizon-GENE evaluates LLMs along four biologically critical perspectives: research attention sensitivity, hallucination tendency, answer completeness, and literature influence, explicitly targeting failure modes that limit the safe adoption of LLMs in biological interpretation pipelines. We systematically evaluate a wide range of state-of-the-art general-purpose and biomedical LLMs, revealing substantial heterogeneity in gene-level reasoning capabilities and persistent challenges in generating faithful, complete, and literature-grounded functional interpretations. Our benchmark establishes a systematic foundation for analyzing LLM behavior at the gene scale and offers insights for model selection and development, with direct relevance to knowledge-enhanced biological interpretation.

2026-01-19T08:06:35Z Accepted by SIGKDD 2026. 12 pages Xiaohan Huang Meng Xiao Chuan Qin Qingqing Long Jinmiao Chen Yuanchun Zhou Hengshu Zhu http://arxiv.org/abs/2605.24034v1 WTKO-CNN: Deep Learning Reveals Sequence Motifs Distinguishing Wild-Type and Knockout ATAC-seq Peaks 2026-05-21T07:46:50Z

Chromatin regulators can alter transcriptional programs by modifying the accessibility of regulatory DNA elements. Understanding how regulatory sequences differ between wild-type (WT) and knockout (KO) conditions is crucial for deciphering transcriptional control. Here, we applied a convolutional neural network, \textbf{WTKO-CNN} with an attention mechanism to classify DNA sequences as WT or KO, achieving high predictive performance. To interpret the model, we generated saliency maps to identify nucleotide positions most influential for the classification decision. From these high-saliency regions, we extracted and clustered k-mers, enabling de novo motif discovery. Sequence logos and consensus motifs derived from the CNN filters revealed biologically meaningful patterns, which are further validated using MEME, TOMTOM, and HOMER against known transcription factor binding sites. Our analysis identified motifs associated with transcription factor families that discriminate WT from KO sequences, demonstrating that CNN-guided saliency mapping is a powerful approach for uncovering functional sequence features.

2026-05-21T07:46:50Z Lopamudra Dey http://arxiv.org/abs/2605.21634v1 bioETH-PRS: Confidential Polygenic Risk Scoring without a Trusted Evaluator via Fully Homomorphic Encryption on a Programmable Blockchain 2026-05-20T18:45:17Z

Polygenic risk scores (PRSs) aggregate genetic effect estimates to predict disease susceptibility, yet clinical deployment often exposes raw genotype data to third-party compute infrastructure. Prior homomorphic-encryption approaches, still require trust in a designated evaluator. We present bioETH-PRS, a protocol that replaces that evaluator role with immutable smart contracts on a blockchain supporting Fully Homomorphic Encryption (fhEVM). Using the integer-exact TFHE scheme, bioETH-PRS computes the PRS dot product entirely within the encrypted domain, keeping both genotype dosage vectors and GWAS weight vectors hidden from external parties throughout execution. We introduce a three-step fixed-point quantisation scheme for representing signed GWAS weights as unsigned 64-bit integers, achieving machine-epsilon reconstruction accuracy on validated fixtures. A four-contract architecture separates data custody, model publication, computation, and output release, and supports both a classic chunked path and a streaming path, with the latter reducing mock-measured gas by 37%. An on-chain noisy output oracle emits an encrypted noisy-score handle and a publicly decryptable ternary category, reducing raw score exposure and probing risk. Prototype evaluation on real GWAS fixtures confirms linear gas scaling and suggests that the approach may be cost-competitive in low-gas deployment environments.

2026-05-20T18:45:17Z 12 pages, 6 figures Kimon Antonios Provatas Christos Galanopoulos Ilias Georgakopoulos-Soares http://arxiv.org/abs/2605.20989v1 Modeling Temporal scRNA-seq Data with Latent Gaussian Process and Optimal Transport 2026-05-20T10:24:51Z

Single-cell RNA sequencing provides insights into gene expression at single-cell resolution, yet inferring temporal processes from these static snapshot measurements remains a fundamental challenge. Current approaches utilizing neural differential equations and flows are sensitive to overfitting and lack careful considerations of biological variability. In this work, we propose a generative framework that models population trends using a latent heteroscedastic Gaussian process (GP) approximated by Hilbert space methods. To address the absence of genuine cell trajectories, we leverage an optimal transport (OT) objective that aligns generated and observed population distributions. Our method explicitly captures biological heterogeneity by incorporating cell-specific latent time and cell type conditioning to disentangle temporal asynchrony and trajectories to different cell types. We demonstrate state-of-the-art performance on complex interpolation and extrapolation benchmarks and introduce a novel gradient-based strategy for inferring perturbation trajectories.

2026-05-20T10:24:51Z Mehmet Yigit Balik Harri Lähdesmäki http://arxiv.org/abs/2601.03019v4 DNACHUNKER: Learnable Tokenization for DNA Language Models 2026-05-20T08:36:36Z

DNA language models are increasingly used to represent genomic sequence, yet their effectiveness depends critically on how raw nucleotides are converted into model inputs. Unlike natural language, DNA offers no canonical boundaries, making fixed tokenizations a brittle design choice under shifts, indels, and local repeats. We introduce DNAChunker, a masked DNA language model that incorporates a learnable adaptive segmentation module to produce context-dependent, variable-length units. Building on a dynamic segmentation procedure, DNAChunker learns to allocate finer granularity to functionally enriched regions while compressing repetitive or redundant sequence. We pretrain DNAChunker on the human reference genome and evaluate it across five benchmarks, where it consistently improves over strong fixed-tokenization baselines. Further analyses and ablations indicate that unlike fixed tokenizations, segmentation is learned in a biologically-informed, mutation-resilient manner.

2026-01-06T13:46:42Z ICML 2026 camera-ready version Taewon Kim Jihwan Shin Hyomin Kim Youngmok Jung Jonghoon Lee Won-Chul Lee Sungsoo Ahn Insu Han http://arxiv.org/abs/2603.20420v2 CRANE: Correcting Errors in Raw Nanopore Signals Using Hidden Markov Models 2026-05-20T05:18:24Z

Nanopore sequencing can read substantially longer sequences of nucleic acid molecules, called reads, than other sequencing methods, which has led to advances in genomic analysis such as the gapless human genome assembly. By analyzing the raw electrical signal reads that nanopore sequencing generates from molecules, existing works can map these reads without translating them into DNA characters (i.e., basecalling), allowing for quick and efficient analysis of sequencing data. However, raw signals often contain errors due to noise and processing errors, which limits the overall accuracy of raw signal analysis. Our goal in this work is to detect and correct errors in raw signals to improve the accuracy of raw signal analyses. To this end, we propose CRANE, a mechanism that trains and utilizes a Hidden Markov Model (HMM) to accurately correct signal errors. Our extensive evaluation on various datasets shows that CRANE 1) consistently improves the overall accuracy of the underlying raw signal analysis tools, 2) minimizes the burden of optimizing analysis pipelines for newer nanopore technologies, and 3) does not introduce substantial computational overhead. We conclude that CRANE provides an effective mechanism to systematically identify and correct the errors in raw nanopore signals before further analysis, which can enable the development of a new class of error correction mechanisms purely designed for raw nanopore signals. Source Code: CRANE is available at https://github.com/STORMgroup/CRANE. We also provide the scripts to fully reproduce our results on our GitHub page

2026-03-20T18:41:07Z Simon Ambrozak Ulysse McConnell Bhargav Srinivasan Burak Ozkan Ernest Zhang Can Firtina http://arxiv.org/abs/2504.05454v2 GraphPINE: Graph Importance Propagation for Interpretable Drug Response Prediction 2026-05-19T17:55:07Z

Explainability is necessary for many tasks in biomedical research. Recent explainability methods have focused on attention, gradient, and Shapley value. These do not handle data with strong associated prior knowledge and fail to constrain explainability results based on known relationships between predictive features. We propose GraphPINE, a graph neural network (GNN) architecture leveraging domain-specific prior knowledge to initialize node importance optimized during training for drug response prediction. Typically, a manual post-prediction step examines literature (i.e., prior knowledge) to understand returned predictive features. While node importance can be obtained for gradient and attention after prediction, node importance from these methods lacks complementary prior knowledge; GraphPINE seeks to overcome this limitation. GraphPINE differs from other GNN gating methods by utilizing an LSTM-like sequential format. We introduce an importance propagation layer that unifies 1) updates for feature matrix and node importance and 2) uses GNN-based graph propagation of feature values. This initialization and updating mechanism allows for informed feature learning and improved graph representation. We apply GraphPINE to cancer drug response prediction using drug screening and gene data collected for over 5,000 gene nodes included in a gene-gene graph with a drug-target interaction (DTI) graph for initial importance. The gene-gene graph and DTIs were obtained from curated sources and weighted by article count discussing relationships between drugs and genes. GraphPINE achieves a PR-AUC of 0.894 and ROC-AUC of 0.796 across 952 drugs. Code is available at https://anonymous.4open.science/r/GraphPINE-40DE.

2025-04-07T19:42:12Z Yoshitaka Inoue Tianfan Fu Augustin Luna