https://arxiv.org/api/OSUeBGfNJmBJZu7cP3NGy0YJv982026-06-14T02:32:29Z384819515http://arxiv.org/abs/2508.04747v2GRIT: Graph-Regularized Logit Refinement for Zero-shot Cell Type Annotation2026-02-09T07:59:16ZCell type annotation is a fundamental step in the analysis of single-cell RNA sequencing (scRNA-seq) data. In practice, human experts often rely on the structure revealed by principal component analysis (PCA) followed by $k$-nearest neighbor ($k$-NN) graph construction to guide annotation. While effective, this process is labor-intensive and does not scale to large datasets. Recent advances in CLIP-style models offer a promising path toward automating cell type annotation. By aligning scRNA-seq profiles with natural language descriptions, models like LangCell enable zero-shot annotation. While LangCell demonstrates decent zero-shot performance, its predictions remain suboptimal. In this paper, we propose a principled inference-time paradigm for zero-shot cell type annotation (GRIT) which bridges the scalability of pre-trained foundation models with the structural robustness relied upon in human expert annotation workflows. Specifically, we enforce local consistency of the zero-shot CLIP logits over the task-specific PCA-based $k$-NN graph. We evaluate our approach on 14 annotated human scRNA-seq datasets from 4 distinct studies, spanning 11 organs and over 200,000 single cells. Our method consistently improves zero-shot annotation accuracy, achieving accuracy gains of up to 10\%. Further analysis showcase the mechanism by which GRIT effectively propagates correct signals through the graph, pulling back mislabeled cells toward more accurate predictions. The method is training-free, model-agnostic, and serves as a simple yet effective plug-in for enhancing zero-shot cell type annotation.2025-08-06T07:09:46Z10 pages, 6 figuresTianxiang HuChenyi ZhouJiaxiang LiuJiongxin WangRuizhe ChenHaoxiang XiaGaoang WangJian WuZuozhu Liuhttp://arxiv.org/abs/2602.09063v1scBench: Evaluating AI Agents on Single-Cell RNA-seq Analysis2026-02-09T03:20:31ZAs single-cell RNA sequencing datasets grow in adoption, scale, and complexity, data analysis remains a bottleneck for many research groups. Although frontier AI agents have improved dramatically at software engineering and general data analysis, it remains unclear whether they can extract biological insight from messy, real-world single-cell datasets. We introduce scBench, a benchmark of 394 verifiable problems derived from practical scRNA-seq workflows spanning six sequencing platforms and seven task categories. Each problem provides a snapshot of experimental data immediately prior to an analysis step and a deterministic grader that evaluates recovery of a key biological result. Benchmark data on eight frontier models shows that accuracy ranges from 29-53%, with strong model-task and model-platform interactions. Platform choice affects accuracy as much as model choice, with 40+ percentage point drops on less-documented technologies. scBench complements SpatialBench to cover the two dominant single-cell modalities, serving both as a measurement tool and a diagnostic lens for developing agents that can analyze real scRNA-seq datasets faithfully and reproducibly.2026-02-09T03:20:31ZKenny WorkmanZhen YangHarihara MuralidharanAidan AbdulaliHannah Lehttp://arxiv.org/abs/2601.19002v4Y-Trim: Evidence-gated Adaptase tail trimming for single-stranded bisulfite sequencing2026-02-08T22:53:28ZBackground: Single-stranded whole-genome bisulfite sequencing (ssWGBS) enables DNA methylation profiling in low-input and highly fragmented material, including cell-free DNA. In widely used post-bisulfite protocols, Adaptase-mediated tailing adds stochastic, template-free end sequence. Unlike adapter-defined junctions, these tails lack a fixed sequence template, so trimming must be decided from FASTQ-stage observables under intrinsic uncertainty.
Results: We show that bisulfite-induced compositional degeneracy implies a strictly positive error floor for any fixed per-read boundary rule under a finite nucleotide alphabet. Guided by this limit, we introduce Y-Trim, an evidence-gated framework that separates admission (should we trim) from inference (where to trim). For Read 2, Y-Trim performs per-read adaptive cut placement via a fixed, chemistry-typed matrix-linear texture scoring scheme; for Read 1, it uses automated sample-level anchoring when read-level localization is feasibility-limited. Across modules, Y-Trim is an explicit, chemistry-specific decision rule with interpretable operating points. On a curated 34-run public cohort (CCGB-34) and simulator stress tests with known latent boundaries, Y-Trim exhibits stable Read 2 operating behavior and Read 1 feasibility-limited behavior consistent with conditional read-through.
Conclusions: Template-free Adaptase tail trimming is best viewed as an evidence-limited FASTQ-stage decision rather than a generic preprocessing knob. By making admissibility and abstention explicit and exposing interpretable genomic-retention versus residual-carryover trade-offs, Y-Trim provides a practical uncertainty-aware preprocessing strategy for ssWGBS.2026-01-26T22:24:47ZYihan Fanghttp://arxiv.org/abs/2602.07648v1Alteration of the Brains Microbiome and Neuroinflammation Associated with Ventricular Catheters2026-02-07T18:08:58ZBackground and Objectives: Proximal catheter obstruction is the leading cause of ventriculoperitoneal shunt failure, yet the biological triggers of peri-catheter inflammation and tissue ingrowth remain poorly defined. Evidence of bacterial ribosomal RNA in human brain tissue suggests that low-biomass microbial exposure may influence the inflammatory microenvironment surrounding implants. This study examined if microbial signal is detectable in unaltered brain tissue and if catheter implantation produces microbial shifts relevant to shunt dysfunction. Methods: Twenty-nine female mice were assigned to unaltered control (UC), trauma control (TC), plain silicone catheter (PSC), or antibiotic-impregnated catheter (AIC) groups. Brain and cecum tissues were harvested at postoperative days 7 and 28 for 16S rRNA sequencing. Microbial composition and predicted functional pathways were analyzed. A separate cohort underwent longitudinal MRI to assess edema, glial scar formation, and macrophage-associated susceptibility signal. Results: Low-level microbial signal was detected in unaltered brain tissue. Catheter implantation induced material-dependent shifts in brain-associated microbial composition. PSC was associated with enrichment of pro-inflammatory taxa, whereas AIC favored immune-regulatory taxa. Predicted short-chain fatty acid biosynthesis was highest in AIC and lowest in PSC, while predicted lipopolysaccharide biosynthesis trended higher in PSC. MRI showed similar edema resolution but higher macrophage-associated susceptibility signal in PSC animals. Conclusion: Intracranial catheter implantation produces material-dependent shifts in low-biomass brain-associated microbial signal that parallel differential neuroimmune activation. These findings suggest catheter material may shape a biologically relevant peri-catheter niche with implications for chronic gliosis and proximal shunt obstruction.2026-02-07T18:08:58Z24 pages, 10 figuresZihan ZhuDipankar BiswasMichael MeggyesyDi CaoGwendolyn WilliamsRichard UmFarzad MaroufiRyan P. LeeJun HuaLiangliang ZhangJeffrey CapadonaHorst V. RecumMark G. Lucianohttp://arxiv.org/abs/2602.07475v1Bipartite Graph Attention-based Clustering for Large-scale scRNA-seq Data2026-02-07T10:10:18ZscRNA-seq clustering is a critical task for analyzing single-cell RNA sequencing (scRNA-seq) data, as it groups cells with similar gene expression profiles. Transformers, as powerful foundational models, have been applied to scRNA-seq clustering. Their self-attention mechanism automatically assigns higher attention weights to cells within the same cluster, enhancing the distinction between clusters. Existing methods for scRNA-seq clustering, such as graph transformer-based models, treat each cell as a token in a sequence. Their computational and space complexities are $\mathcal{O}(n^2)$ with respect to the number of cells, limiting their applicability to large-scale scRNA-seq datasets.To address this challenge, we propose a Bipartite Graph Transformer-based clustering model (BGFormer) for scRNA-seq data. We introduce a set of learnable anchor tokens as shared reference points to represent the entire dataset. A bipartite graph attention mechanism is introduced to learn the similarity between cells and anchor tokens, bringing cells of the same class closer together in the embedding space. BGFormer achieves linear computational complexity with respect to the number of cells, making it scalable to large datasets. Experimental results on multiple large-scale scRNA-seq datasets demonstrate the effectiveness and scalability of BGFormer.2026-02-07T10:10:18ZZhuomin LiangLiang BaiXian Yanghttp://arxiv.org/abs/2603.02213v1A Zipf-preserving, long-range correlated surrogate for written language and other symbolic sequences2026-02-06T16:30:55ZSymbolic sequences such as written language and genomic DNA display characteristic frequency distributions and long-range correlations extending over many symbols. In language, this takes the form of Zipf's law for word frequencies together with persistent correlations spanning hundreds or thousands of tokens, while in DNA it is reflected in nucleotide composition and long-memory walks under purine-pyrimidine mappings. Existing surrogate models usually preserve either the frequency distribution or the correlation properties, but not both simultaneously.
We introduce a surrogate model that retains both constraints: it preserves the empirical symbol frequencies of the original sequence and reproduces its long-range correlation structure, quantified by the detrended fluctuation analysis (DFA) exponent. Our method generates surrogates of symbolic sequences by mapping fractional Gaussian noise (FGN) onto the empirical histogram through a frequency-preserving assignment. The resulting surrogates match the original in first-order statistics and long-range scaling while randomising short-range dependencies.
We validate the model on representative texts in English and Latin, and illustrate its broader applicability with genomic DNA, showing that base composition and DFA scaling are reproduced. This approach provides a principled tool for disentangling structural features of symbolic systems and for testing hypotheses on the origin of scaling laws and memory effects across language, DNA, and other symbolic domains.2026-02-06T16:30:55ZPhysica A 683 (2026) 131227Marcelo A. MontemurroMirko Degli Esposti10.1016/j.physa.2025.131227http://arxiv.org/abs/2511.08855v2Path Signatures Enable Model-Free Mapping of RNA Modifications2026-02-06T14:22:23ZDetecting chemical modifications on RNA molecules remains a key challenge in epitranscriptomics. Traditional reverse transcription-based sequencing methods introduce enzyme- and sequence-dependent biases and fragment RNA molecules, confounding the accurate mapping of modifications across the transcriptome. Nanopore direct RNA sequencing offers a powerful alternative by preserving native RNA molecules, enabling the detection of modifications at single-molecule resolution. However, current computational tools can identify only a limited subset of modification types within well-characterized sequence contexts for which ample training data exists. Here, we introduce a model-free computational method that reframes modification detection as an anomaly detection problem, requiring only canonical (unmodified) RNA reads without any other annotated data. For each nanopore read, our approach extracts robust, modification-sensitive features from the raw ionic current signal at a site using the signature transform, then computes an anomaly score by comparing the resulting feature vector to its nearest neighbors in an unmodified reference dataset. We convert anomaly scores into statistical p-values to enable anomaly detection at both individual read and site levels. Validation on densely-modified \textit{E. coli} rRNA demonstrates that our approach detects known sites harboring diverse modification types, without prior training on these modifications. We further applyied this framework to dengue virus (DENV) transcripts and mammalian mRNAs. For DENV sfRNA, it led to revealing a novel 2'-O-methylated site, which we validate orthogonally by qRT-PCR assays. These results demonstrate that our model-free approach operates robustly across different types of RNAs and datasets generated with different nanopore sequencing chemistries.2025-11-12T00:22:56ZMaud LemercierPaola ArrubarrenaSalvatore Di GiorgioJulia BrettschneiderThomas CassValerie GriescheIsabel S. Naarmann-de VriesAnastasia PapavasiliouAlessia RuggieriIrem TelliogluChia Ching WuF. Nina PapavasiliouTerry Lyonshttp://arxiv.org/abs/2602.06394v1Unlocking Noisy Real-World Corpora for Foundation Model Pre-Training via Quality-Aware Tokenization2026-02-06T05:26:59ZCurrent tokenization methods process sequential data without accounting for signal quality, limiting their effectiveness on noisy real-world corpora. We present QA-Token (Quality-Aware Tokenization), which incorporates data reliability directly into vocabulary construction. We make three key contributions: (i) a bilevel optimization formulation that jointly optimizes vocabulary construction and downstream performance, (ii) a reinforcement learning approach that learns merge policies through quality-aware rewards with convergence guarantees, and (iii) an adaptive parameter learning mechanism via Gumbel-Softmax relaxation for end-to-end optimization. Our experimental evaluation demonstrates consistent improvements: genomics (6.7 percentage point F1 gain in variant calling over BPE), finance (30% Sharpe ratio improvement). At foundation scale, we tokenize a pretraining corpus comprising 1.7 trillion base-pairs and achieve state-of-the-art pathogen detection (94.53 MCC) while reducing token count by 15%. We unlock noisy real-world corpora, spanning petabases of genomic sequences and terabytes of financial time series, for foundation model training with zero inference overhead.2026-02-06T05:26:59ZArvid E. GollwitzerParidhi LatawaDavid de GruijlDeepak A. SubramanianAdrián Noriega de la Colinahttp://arxiv.org/abs/2506.00597v2Processing-in-memory for genomics workloads2026-02-04T07:37:36ZLow-cost, high-throughput DNA and RNA sequencing (HTS) data is the backbone of the life sciences. Genome sequencing is now becoming a part of Predictive, Preventive, Personalized, and Participatory (termed 'P4') medicine. All genomic data are currently processed in energy-hungry computer clusters and centers, necessitating data transfer, consuming substantial energy, and wasting valuable time. Therefore, there is a need for fast, energy-efficient, and cost-efficient technologies that enable genomics research without requiring data centers and cloud platforms. We recently launched the BioPIM Project to leverage emerging processing-in-memory (PIM) technologies to enable energy- and cost-efficient analysis of bioinformatics workloads. The BioPIM Project focuses on co-designing algorithms and data structures commonly used in genomics with several PIM architectures to achieve the highest cost, energy, and time savings.2025-05-31T15:13:06ZIEEE Micro, 46 (2): 70-80, 2026William Andrew SimonLeonid YavitsKonstantina KoliogeorgiYann FalevozYoshihiro ShibuyaDominique LavenierIrem BoybatKlea ZambakuBerkan ŞahinMohammad SadrosadatiOnur MutluAbu SebastianRayan ChikhiThe BioPIM ConsortiumCan Alkan10.1109/MM.2026.3662105http://arxiv.org/abs/2602.04901v1Beyond Independent Genes: Learning Module-Inductive Representations for Gene Perturbation Prediction2026-02-03T16:43:40ZPredicting transcriptional responses to genetic perturbations is a central problem in functional genomics. In practice, perturbation responses are rarely gene-independent but instead manifest as coordinated, program-level transcriptional changes among functionally related genes. However, most existing methods do not explicitly model such coordination, due to gene-wise modeling paradigms and reliance on static biological priors that cannot capture dynamic program reorganization. To address these limitations, we propose scBIG, a module-inductive perturbation prediction framework that explicitly models coordinated gene programs. scBIG induces coherent gene programs from data via Gene-Relation Clustering, captures inter-program interactions through a Gene-Cluster-Aware Encoder, and preserves modular coordination using structure-aware alignment objectives. These structured representations are then modeled using conditional flow matching to enable flexible and generalizable perturbation prediction. Extensive experiments on multiple single-cell perturbation benchmarks show that scBIG consistently outperforms state-of-the-art methods, particularly on unseen and combinatorial perturbation settings, achieving an average improvement of 6.7% over the strongest baselines.2026-02-03T16:43:40ZJiafa RuanRuijie QuanZongxin YangLiyang XuYi Yanghttp://arxiv.org/abs/2602.03477v1ScDiVa: Masked Discrete Diffusion for Joint Modeling of Single-Cell Identity and Expression2026-02-03T12:50:29ZSingle-cell RNA-seq profiles are high-dimensional, sparse, and unordered, causing autoregressive generation to impose an artificial ordering bias and suffer from error accumulation. To address this, we propose scDiVa, a masked discrete diffusion foundation model that aligns generation with the dropout-like corruption process by defining a continuous-time forward masking mechanism in token space. ScDiVa features a bidirectional denoiser that jointly models discrete gene identities and continuous values, utilizing entropy-normalized serialization and a latent anchor token to maximize information efficiency and preserve global cell identity. The model is trained via depth-invariant time sampling and a dual denoising objective to simulate varying sparsity levels while ensuring precise recovery of both identity and magnitude. Pre-trained on 59 million cells, scDiVa achieves strong transfer performance across major benchmarks, including batch integration, cell type annotation, and perturbation response prediction. These results suggest that masked discrete diffusion serves as a biologically coherent and effective alternative to autoregression.2026-02-03T12:50:29Z19 pages, 11 figuresMingxuan WangCheng ChenGaoyang JiangZijia RenChuangxin ZhaoLu ShiYanbiao Mahttp://arxiv.org/abs/2602.03343v1MARADONER: Motif Activity Response Analysis Done Right2026-02-03T10:10:17ZInferring the activities of transcription factors from high-throughput transcriptomic or open chromatin profiling, such as RNA-/CAGE-/ATAC-Seq, is a long-standing challenge in systems biology. Identification of highly active master regulators enables mechanistic interpretation of differential gene expression, chromatin state changes, or perturbation responses across conditions, cell types, and diseases. Here, we describe MARADONER, a statistical framework and its software implementation for motif activity response analysis (MARA), utilizing the sequence-level features obtained with pattern matching (motif scanning) of individual promoters and promoter- or gene-level activity or expression estimates. Compared to the classic MARA, MARADONER (MARA-done-right) employs an unbiased variance parameter estimation and a bias-adjusted likelihood estimation of fixed effects, thereby enhancing goodness-of-fit and the accuracy of activity estimation. Further, MARADONER is capable of accounting for heteroscedasticity of motif scores and activity estimates.2026-02-03T10:10:17ZGeorgy MeshcheryakovAndrey I. Buyanhttp://arxiv.org/abs/2302.13268v7Revolutionizing Genomics with Reinforcement Learning Techniques2026-02-02T04:15:10ZIn recent years, Reinforcement Learning (RL) has emerged as a powerful tool for solving a wide range of problems, including decision-making and genomics. The exponential growth of raw genomic data over the past two decades has exceeded the capacity of manual analysis, leading to a growing interest in automatic data analysis and processing. RL algorithms are capable of learning from experience with minimal human supervision, making them well-suited for genomic data analysis and interpretation. One of the key benefits of using RL is the reduced cost associated with collecting labeled training data, which is required for supervised learning. While there have been numerous studies examining the applications of Machine Learning (ML) in genomics, this survey focuses exclusively on the use of RL in various genomics research fields, including gene regulatory networks (GRNs), genome assembly, and sequence alignment. We present a comprehensive technical overview of existing studies on the application of RL in genomics, highlighting the strengths and limitations of these approaches. We then discuss potential research directions that are worthy of future exploration, including the development of more sophisticated reward functions as RL heavily depends on the accuracy of the reward function, the integration of RL with other machine learning techniques, and the application of RL to new and emerging areas in genomics research. Finally, we present our findings and conclude by summarizing the current state of the field and the future outlook for RL in genomics.2023-02-26T08:43:08ZMohsen KaramiHoda KhadijehHoda JahanianRoohallah AlizadehsaniIman DehzangiJuan M GorrizYudong ZhangJia WangFarshid HajatiMin YangThantrira PorntaveetusHamid Alinejad-Roknyhttp://arxiv.org/abs/2602.01230v1Toward Interpretable and Generalizable AI in Regulatory Genomics2026-02-01T13:46:00ZDeciphering how DNA sequence encodes gene regulation remains a central challenge in biology. Advances in machine learning and functional genomics have enabled sequence-to-function (seq2func) models that predict molecular regulatory readouts directly from DNA sequence. These models are now widely used for variant effect prediction, mechanistic interpretation, and regulatory sequence design. Despite strong performance on held-out genomic regions, their ability to generalize across genetic variation and cellular contexts remains inconsistent. Here we examine how architectural choices, training data, and prediction tasks shape the behavior of seq2func models. We synthesize how interpretability methods and evaluation practices have probed learned cis-regulatory organization and highlighted systematic failure modes, clarifying why strong predictive accuracy can fail to translate into robust regulatory understanding. We argue that progress will require reframing seq2func models as continually refined systems, in which targeted perturbation experiments, systematic evaluation, and iterative model updates are tightly coupled through AI-experiment feedback loops. Under this framework, seq2func models become self-improving tools that progressively deepen their mechanistic grounding and more reliably support biological discovery.2026-02-01T13:46:00Z5 figures, 1 tableMasayuki NagaiAlan E. MurphyKaeli RizzoPeter K. Koohttp://arxiv.org/abs/2502.07272v5GENERator: A Long-Context Generative Genomic Foundation Model2026-01-31T01:53:58ZThe rapid advancement of DNA sequencing has produced vast genomic datasets, yet interpreting and engineering genomic function remain fundamental challenges. Recent large language models have opened new avenues for genomic analysis, but existing approaches are often limited by restricted training scope, constrained generative capability, or prohibitive computational cost. We introduce GENErator, a generative genomic foundation model for long-context DNA modeling, with a context length of 98k nucleotides, pre-trained on 386 billion nucleotides of eukaryotic DNA. Without task-specific fine-tuning, GENERator exhibits strong intrinsic capabilities: unsupervised embedding analyses reveal phylogenetically coherent structure, and sequence recovery benchmarks demonstrate generative accuracy comparable to or exceeding state-of-the-art models with substantially improved computational efficiency. In a zero-shot setting, GENERator achieves competitive variant effect prediction performance relative to alignment-based methods, while remaining fully alignment-free and broadly applicable across species. With task-specific fine-tuning, the model attains leading performance on established genomic benchmarks. We further demonstrate practical generative applications. GENERator can generate protein-coding DNA sequences that translate into structurally plausible proteins and, through a prompt-guided design framework, design cis-regulatory elements with targeted activity profiles, including synthetic super-enhancers validated by high-throughput UMI-STARR-seq assays. Together, these results establish GENERator as an efficient and biologically grounded framework for genomic interpretation and programmable sequence design. Code and supplementary resources are available at https://github.com/GenerTeam/GENERator.2025-02-11T05:39:49ZWei WuQiuyi LiYuanyuan ZhangZhihao ZhanRuipu ChenMingyang LiKun FuJunyan QiYongzhou BaoChao WangYiheng ZhuZhiyun ZhangJian TangFuli FengJieping YeYuwen LiuHui XiongZheng Wang