https://arxiv.org/api/ggFj3rLAPhmKitBRuGW1aC2jE2c2026-06-13T11:11:46Z3848015http://arxiv.org/abs/2606.13556v1Is It You or Your Environment? A Bayesian Inference Framework for Genomically-Anchored Personalized Physiological Interpretation2026-06-11T16:38:38ZPersonalized health AI systems face a fundamental cold-start problem: machine learning models for physiological interpretation require weeks of individual behavioral data before they can distinguish constitutional variation from environmentally driven deviation. We propose a solution grounded in causal inference and Bayesian prior design. An individual's genomic profile serves as an exogenous genetic anchor -- a domain-informed, personalized prior that is fixed at conception, immune to reverse causation, and available before a single behavioral observation is collected. The anchor initializes a Bayesian belief state over an individual's physiological set point G-hat = mu + sum(beta_i * g_i), where beta_i are GWAS-derived effect sizes and g_i are risk-allele counts. Each incoming physiological measurement P produces a non-constitutional deviation delta = P - G-hat that separates the signal attributable to environment and state from the constitutionally fixed baseline. As behavioral data accrue, the prior decays according to G-hat_t = w(t)*G-hat_genomic + [1-w(t)]*P-bar_t, transitioning from genome-dominated to empirical-baseline-dominated inference. The same observed HRV of 55 ms generates a suppression hypothesis for a person whose prior predicts 80 ms, and an enhancement hypothesis for a person whose prior predicts 30 ms -- a reversal impossible without a personalized anchor. We develop this architecture across six physiological domains, grading genomic priors by evidence strength, distinguishing robustly replicated anchors (FTO, FADS1/2, FKBP5) from contested candidate genes (SLC6A4, MAOA, DRD2). We address the inference boundary between association, Mendelian randomization, and individual token causation, and define four constraints for deployment: evidence-graded priors, dynamic decay, ancestry-matched effect sizes, and attribution rather than deterministic output.2026-06-11T16:38:38Z24 pages, 8 figures, 3 tables. Conceptual framework paperAruna DeySuraj Biswashttp://arxiv.org/abs/2606.04525v3GENEB: Why Genomic Models Are Hard to Compare2026-06-11T13:24:40ZProgress in genomic foundation models is difficult to assess due to fragmented benchmarks, incompatible evaluation protocols, and task-specific reporting. As a result, claims of superiority or generality across models are often not directly comparable. We introduce GENEB, a large-scale diagnostic benchmark that evaluates frozen representations from 40 genomic foundation models across 100 tasks spanning 13 functional categories under a unified probing-based protocol, including few-shot regimes. GENEB enables controlled comparison across model scale, architecture, tokenization, and pretraining data while explicitly exposing task-level trade-offs. Our analysis shows that aggregate leaderboards are unstable: model rankings vary sharply across task categories, scale provides only modest and inconsistent gains, and architectural and pretraining alignment frequently outweigh parameter count. These results highlight limitations of current evaluation practices and position GENEB as a reference framework for principled comparison and category-aware model selection in genomic machine learning.2026-06-03T07:06:01Zchange first page figure, fix model sizes, add more consistencyDaria LednevaMikhail NuridinovDenis Kuznetsovhttp://arxiv.org/abs/2606.12838v1OCOO-T : A Simple and Scalable Virtual Cell Model for Transcriptional Perturbation Response Prediction2026-06-11T03:04:38ZPredicting single-cell transcriptional responses to genetic, chemical and cytokine perturbations is a fundamental challenge in computational biology and AI Virtual Cell (AIVC) modeling, with direct implications for drug discovery and the elucidation of gene regulatory networks. Existing approaches often rely on auxiliary cell-state encoders, hierarchical variational autoencoders, dedicated Transformer encoder-decoder modules, or gene-interaction priors to compress high-dimensional expression profiles into latent representations. While effective, these designs increase architectural complexity and may limit scalability and generalizability. This paper introduces OCOO-T, a minimalist flow-matching-based AIVC model for transcriptional perturbation response prediction. OCOO-T utilizes a vanilla Transformer stack that operates directly on continuous gene expression profiles and formulates perturbation response prediction as a continuous-time denoising process. Perturbation embeddings, dosage information, and cell-line/cell-type specificity are integrated through adaptive layer normalization and in-context tokens. Comprehensive evaluations on Tahoe100M, Replogle, and PBMC benchmarks demonstrate that OCOO-T achieves state-of-the-art performance across diverse perturbations and cell types while effectively scaling to long transcriptional profiles through patching and depatching of cellular contexts. By leveraging the simplicity of Transformer-based denoising for single-cell omics, OCOO-T provides an effective and scalable framework for in-silico cellular simulation.2026-06-11T03:04:38Z22 pages, 6 figuresDanning JiangZheming AnYalong ZhaoLipeng Laihttp://arxiv.org/abs/2505.23289v2Intermediate State Formation of Topologically Associated Chromatin Domains using Quantum Annealing2026-06-10T21:07:04ZTopologically Associating Chromatin Domains are spatially distinct chromatin regions that regulate transcription by segregating active and inactive genomic elements. Empirical studies show that their formation correlates with local patterns of epigenetic markers, yet the precise mechanisms linking 1D epigenetic landscapes to 3D chromatin folding remain unclear. Recent models represent chromatin as a spin system, where nucleosomes are treated as discrete-state variables coupled by interaction strengths derived from genomic and epigenetic data. Classical samplers struggle with these models due to high frustration and dense couplings. Here, we present a quantum annealing (QA) approach to efficiently sample chromatin states, embedding an epigenetic Ising model into the topology of D-Wave quantum processors. Rather than reconstructing exact TAD size distributions or insulation scores, our method reproduces statistical features, such as mean marker incidences and intra-/inter-nucleosome correlations, while generating configurations that exhibit TAD-like structural motifs. These results demonstrate QA as an alternative to explore the chromatin architecture and provide a foundation in epigenetic modeling.2025-05-29T09:40:39Z16 pages, 19 Figs, Scientific Reports, 2026Tobias KempeS. M. Ali TabeiMohammad H. Ansari10.1038/s41598-026-55306-1http://arxiv.org/abs/2606.12219v1m6A-FORM: A Foundation Model for Decoding N6-methyladenosine Biology2026-06-10T15:32:12ZN6-methyladenosine (m6A) is the most abundant internal modification in eukaryotic mRNA. However, most existing predictors use adenosine-centered formulations that are computationally inefficient and prone to false positives. Here we present m6A-FORM, a transformer-based foundation model for RNA methylation that uses MeRIP-seq peaks as methylation-enriched priors and is pretrained on approximately 22 million peak-derived sequences from 143 human MeRIP-seq studies. After fine-tuning with high-confidence single-nucleotide m6A annotations from m6A-Atlas v2.0 and GLORI, m6A-FORM-sites achieves state-of-the-art m6A site prediction performance, with a PR-AUC of 0.635 and ROC-AUC of 0.988, improving PR-AUC by at least 0.14 over existing methods while enabling substantially faster inference. Task-specific adaptation further supports prediction of binding sites for 19 m6A-associated regulators and identification of YTHDF2-bound m6A sites associated with mRNA degradation. Applying m6A-FORM across 67 datasets from 24 human tissues identifies 19,631 tissue-conserved sites with distinct localization, clustering, methylation, expression, RBP-interaction, and decay-associated signatures.2026-06-10T15:32:12ZTinghe ZhangSumin JoShou-Jiang GaoYufei Huanghttp://arxiv.org/abs/2606.08493v2Querying Counterfactuals on Tissue Graphs with Supervised Disentanglement2026-06-10T08:04:48ZTissue graph counterfactuals ask how a cell's expression would change under altered spatial neighbor contexts. Such queries are central to predicting cell behavior in tissues, but lack a unified definition, with existing methods targeting specific intervention types or treating cells as i.i.d. In this work, we first formalize tissue graph counterfactuals as a class of spatial interventions that either rewire connections between cells (edge perturbation) or modify the expression of their neighbors (node perturbation). We then introduce Cellina (https://cellina.readthedocs.io) - a framework that uses supervised disentanglement to decompose a cell's intrinsic state from its spatial context, using the latter as a conditioning input for counterfactual predictions. Across benchmarks spanning over 2.5 million spatially-resolved cells in colorectal cancer and mouse brain, Cellina outperforms spatially-informed and non-spatial competitors in in-silico graph perturbations, disentanglement, and scalability. Additionally, we show that Cellina reveals biologically distinct cancer subdomains in an unsupervised manner and enables targeted neighbor perturbation simulations.2026-06-07T07:41:00ZAbdul MoeedStefan SchrodMartin RohbeckMarc Jan BonderPavlo LutsikOliver StegleDaniel Dimitrovhttp://arxiv.org/abs/2605.00545v2Beyond Continuity: Simulation-free Reconstruction of Discrete Branching Dynamics from Single-cell Snapshots2026-06-10T02:19:11ZInferring cellular trajectories from destructive snapshots is complicated by the challenges of stochasticity and non-conservative mass dynamics such as cell proliferation and apoptosis. Existing unbalanced Optimal Transport (OT) methods treat mass as a continuous fluid, performing inference at the population level. However, this macroscopic view often fails to capture the discrete, jump-like nature of birth-death events at single-cell resolution, which is essential for understanding lineage branching and fate decisions. We present Unbalanced Schrödinger Bridge (USB), a simulation-free framework for learning underlying dynamics that effectively integrates both stochastic and unbalanced effects which also models the discrete, jump-like birth-death dynamics at single-cell resolution. Theoretically, USB provides a tractable solution to the Branching Schrödinger Bridge (BSB) problem, offering a rigorous microscopic interpretation where individual cells undergo both Brownian motion and discrete birth-death jumps. Technically, the method implements an efficient solver by introducing a simulation-free training objective that effectively scales to high-dimensional omics data. Empirically, we demonstrate on both simulated and real-world datasets that USB not only achieves trajectory reconstruction performance better than or comparable to deterministic baselines but also uniquely enables realistic discrete simulation of birth-death dynamics at single-cell resolution.2026-05-01T09:57:13ZJunda YingYuxuan WangBowen YangPeijie ZhouLei Zhanghttp://arxiv.org/abs/2606.11144v1OncoTraj: a public benchmark for longitudinal resistance prediction in EGFR-mutant non-small-cell lung cancer on osimertinib2026-06-09T17:33:24ZResistance to first-line osimertinib in EGFR-mutant non-small-cell lung cancer (NSCLC) is the canonical example of predictable clonal evolution under therapeutic pressure, yet no public benchmark exists for training or evaluating computational models on the corresponding longitudinal patient trajectories. We introduce OncoTraj, a public benchmark of 813 EGFR-mutant NSCLC patients receiving first-line osimertinib, harmonized from three real-world clinical-genomic sources: MSK-CHORD (672 patients), AACR Project GENIE BPC NSCLC (34 patients), and the FLAURA molecular-resistance supplement (107 patients). OncoTraj defines three locked tasks: (A) binary classification of progression by a fixed 12-month landmark, (B) regression of time-to-first-progression in days, and (C) six-class classification of the dominant resistance mechanism. We release the harmonized dataset, patient-level train/validation/test splits with an audited no-leakage guarantee, an open-source evaluation harness, and six reference baselines spanning a majority-class predictor, logistic regression, random forest, XGBoost, an LSTM, and a multi-task transformer. With v1's single-timepoint snapshot features, no task clears chance on clean within-source evaluation: the uniformity of this ceiling across every model class localizes the limit to the input modality (single-snapshot tissue NGS rather than serial ctDNA), not the algorithm. The benchmark does recover a reproducible literature-consistent association: TP53 co-mutation raises the 12-month progression rate from 29% to 59% cohort-wide. OncoTraj establishes a reproducible, leakage-audited baseline and converts the modality limit into concrete design requirements for a serial-ctDNA-enriched v2.2026-06-09T17:33:24Z24 pages, 7 figures, 4 tables. Code, data, and trained model weights: https://github.com/span-ai-labs/oncotraj. Python package: pip install oncotraj. Dataset: https://huggingface.co/datasets/span-ai-labs/oncotraj-v1Abhijoy SarkarAarchi Singh Thakurhttp://arxiv.org/abs/2601.14653v3Efficient Imputation for Patch-based Missing Single-cell Data via Cluster-regularized Optimal Transport2026-06-09T15:29:22ZMissing data in single-cell sequencing datasets poses significant challenges for extracting meaningful biological insights. However, existing imputation approaches, which often assume uniformity and data completeness, struggle to address cases with large patches of missing data. In this paper, we present CROT (Cluster-Regularized Optimal Transport), an optimal transport-based imputation algorithm designed to handle patch-based missing data in tabular formats. Our approach effectively captures the underlying data structure in the presence of significant missingness. Notably, it achieves superior imputation accuracy while significantly reducing runtime, demonstrating its scalability and efficiency for large-scale datasets. This work introduces a robust solution for imputation in heterogeneous, high-dimensional datasets with structured data absence, addressing critical challenges in both biological and clinical data analysis. Our code is available on GitHub, https://github.com/yuyuliu11037/CROT.2026-01-21T04:58:13ZAccepted to ACM-BCB 2026Yuyu LiuJiannan YangZiyang YuWeishen PanFei WangTengfei Mahttp://arxiv.org/abs/2606.11276v1A mathematical framework for centromere-aware evaluation of human genome assemblies2026-06-09T10:42:49ZAccurate evaluation of genome assemblies within highly repetitive regions, such as centromeres, remains a major open challenge in genomics. Conventional benchmarking relies on sequence alignment, which becomes problematic in regions of high homogeneity and divergence. Here, we framed centromere assembly evaluation as a comparative distribution problem in a compact centeny representation by computing genomic distances between functional motifs, rather than relying on nucleotide sequence. Our distribution-based metric assesses agreement between a query and a target chromosome by comparing their centromeric inter-motif distances rendered by KL divergence. When applied genome-wide to currently available human telomere-to-telomere (T2T) genomes, this approach yields an accuracy ranking for the entire assembly and for each individual chromosome. Altogether, we present a rapid and robust scoring system based on genomes numerical rendering of inter-motif distances, that provides a quantitative standard of assembly integrity in repetitive DNA regions and establishes a bona fide framework for chromosome-level genome-to-genome comparison.2026-06-09T10:42:49ZLuca FrancoMatteo MigliariniMatteo Tommaso UngaroEgnald ÇelaLuca CordaAndreas GiannisEster MondelliFabio GalassoSimona Giuntahttp://arxiv.org/abs/2604.04287v2Entropy, Disagreement, and the Limits of Foundation Models in Genomics2026-06-09T07:51:13ZFoundation models in genomics have shown mixed success compared to their counterparts in natural language processing. Yet, the reasons for their limited effectiveness remain poorly understood. In this work, we investigate the role of entropy as a fundamental factor limiting the capacities of such models to learn from their training data and develop foundational capabilities. We train ensembles of models on text and DNA sequences and analyze their predictions, static embeddings, and empirical Fisher information flow. We show that the high entropy of genomic sequences -- from the point of view of unseen token prediction -- leads to near-uniform output distributions, disagreement across models, and unstable static embeddings, even for models that are matched in architecture, training and data. We then demonstrate that models trained on DNA concentrate Fisher information in embedding layers, seemingly failing to exploit inter-token relationships. Our results suggest that self-supervised training from sequences alone may not be applicable to genomic data, calling into question the assumptions underlying current methodologies for training genomic foundation models.2026-04-05T22:04:01ZAccepted to LMLR Workshop at ICLR 2026Maxime RochkouletsLovro VrčekMile Šikićhttp://arxiv.org/abs/2606.09558v1Integrating gene regulatory priors into Transformer attention with scTransformer for interpretable scRNA-seq analysis2026-06-08T14:32:52ZMotivation: Transformer-based models are increasingly applied to large-scale single-cell transcriptomics, showing strong performance through self-supervised learning on millions of cells. However, most existing approaches treat genes as independent features, and largely ignore prior biological knowledge, which limits interpretability and robustness. In this paper, we explore whether explicitly incorporating gene regulatory information can improve both model performance and biological insight. Results: We present scTransformer, the first Transformer-based approach that builds a priori knowledge of biological mechanisms into the model's attention patterns. By constraining information flow according to known regulatory structures, the model learns representations that are more biologically meaningful. We evaluate scTransformer on a disease-relevant single-nucleus RNA-seq dataset using supervised cell-type classification. Compared to standard Transformers, our approach improves classification accuracy, enhances separation of cell types in embedding space, and produces attention patterns consistent with known regulatory programs. Overall, our results demonstrate that embedding biological structure into Transformer models can enhance interpretability without sacrificing performance, offering a principled step toward biologically grounded foundation models for single-cell omics.2026-06-08T14:32:52ZMikele MiliaLouis Fabrice TshimangaHenning MuellerManfredo AtzoriBarbara Di Camillohttp://arxiv.org/abs/2606.00568v2On the Recoverability of Causal Relations from Bulk Gene Expression Data2026-06-08T08:23:45ZBulk gene expression profiling, which aggregates pooled RNA across cells within a biological sample, remains important in the single-cell era because it is typically less noisy, more sensitive, and more cost-effective than single-cell assays. Accordingly, a growing body of computational methods seeks to recover causal relations among genes from bulk expression data. However, aggregation is a lossy, non-invertible coarsening of the underlying cellular system, and it remains unclear whether and under what conditions causal relations are recoverable from aggregated bulk gene expression data. To answer this, we formalize recoverability under aggregation through two notions of consistency: functional-form consistency and conditional-independence consistency. We then derive necessary and sufficient conditions for recoverability, showing that these properties are preserved only under linear aggregations (e.g., sum/mean) coupled with affine structural equations. To assess the practical plausibility of these conditions, analyses of four bulk and four single-cell gene expression datasets further reveal that the estimated pairwise regulatory functions among genes deviate from linearity in both data types, providing limited empirical support for the linearity assumptions required for recoverability. Together, these results caution against recovering causal relations from aggregated bulk expression data without strong additional assumptions.2026-05-30T06:42:19ZGongxu LuoBoyang SunKun Zhanghttp://arxiv.org/abs/2606.08147v1Biological Reasoning-Informed Regression for Interpretable Regulatory DNA Activity Prediction2026-06-06T12:56:08ZDNA cis-regulatory elements (CREs) such as enhancers control gene expression levels. Accurately predicting regulatory activity from DNA sequences is valuable but challenging, as it requires understanding complex biological regulatory processes. Existing methods typically regress activity scores from sequences in a black-box manner, limiting both interpretability and regression performance. Meanwhile, large language models (LLMs) benefit from explicit reasoning processes, yet directly applying LLMs to raw DNA sequences performs poorly. In this paper, we bridge this gap by introducing R3LM, a framework that teaches LLMs reasoning-informed regression on regulatory DNA through structured biological knowledge. Specifically, we design a biologically grounded data format that structures DNA's regulatory information for improved LLM understanding, and construct CRE-ReasonBench, the first dataset that associates DNA sequences and activity scores with mechanistic reasoning traces. Through two-stage training that first teaches LLMs reasoning over structured biological information then performs regression, R3LM achieves state-of-the-art performance on enhancer prediction across three cell types, outperforming both LLMs with raw sequence input and specialized DNA models while providing interpretable mechanistic explanations. We expect R3LM as an interpretable reward model that can effectively assist biologists in CRE design. Code is available at https://github.com/DuanYi516/R3LM.2026-06-06T12:56:08ZAccepted at KDD 2026 AI4Sciences TrackYi DuanZhao YangJiwei ZhuYing BaChuan CaoBing Suhttp://arxiv.org/abs/2602.15253v2Scaling Laws for Masked-Reconstruction Transformers on Single-Cell Transcriptomics2026-06-05T22:59:16ZNeural scaling laws -- power-law relationships between loss, model size, and data -- have been extensively documented for language and vision transformers, yet their existence in single-cell genomics remains largely unexplored. We present the first systematic study of scaling behaviour for masked-reconstruction transformers trained on single-cell RNA sequencing (scRNA-seq) data. Using expression profiles from the CELLxGENE Census, we construct two experimental regimes: a data-rich regime (512 highly variable genes, 200,000 cells) and a data-limited regime (1,024 genes, 10,000 cells). Across seven model sizes spanning three orders of magnitude in parameter count (533 to 3.4 x 10^8 parameters), we fit the parametric scaling law to validation mean squared error (MSE). The data-rich regime exhibits clear power-law scaling with an irreducible loss floor of c ~ 1.44, while the data-limited regime shows negligible scaling, indicating that model capacity is not the binding constraint when data are scarce. These results establish that scaling laws analogous to those observed in natural language processing do emerge in single-cell transcriptomics when sufficient data are available, and they identify the data-to-parameter ratio as a critical determinant of scaling behaviour. A preliminary conversion of the data-rich asymptotic floor to information-theoretic units yields an estimate of approximately 2.30 bits of entropy per masked gene position. We discuss implications for the design of single-cell foundation models and outline the additional measurements needed to refine this entropy estimate.2026-02-16T23:20:58ZIhor Kendiukhov