https://arxiv.org/api/wMsL+FUI5A9GEPDq7nsCLDd6KAo 2026-06-21T07:57:07Z 13258 30 15 http://arxiv.org/abs/2606.16044v1 Circuit Tracing in Autoregressive Protein Language Models 2026-06-14T22:28:50Z

Protein language models (pLMs) can generate novel protein sequences with properties beyond those observed in nature, yet the mechanisms underlying protein generation remain poorly understood. Existing mechanistic interpretability methods based on sparse autoencoders and transcoders primarily focus on protein representation learning models and do not capture the computation required for autoregressive generation. Here, we introduce ProGenMech, a mechanistic interpretability framework for generative protein language models that extends cross-layer transcoders (CLTs) to ProGen3, a sparse Mixture-of-Experts model trained for both causal generation and span infilling. Unlike per-layer approaches, CLTs reconstruct each layer using sparse latent variables from all preceding layers, enabling faithful recovery of inter-layer generative computation. We further develop a zero-shot circuit discovery framework to identify sparse latent circuits responsible for protein generation and fitness prediction. In causal generation and zero-shot fitness estimation tasks, ProGenMech outperforms local transcoder baselines in recovering ProGen3's probability distribution and functional scoring behavior, while matching the original model's generative distribution in span infilling tasks. Moreover, the recovered circuits reveal biologically meaningful motifs and functional regions associated with conserved sequence patterns and protein fitness landscapes, establishing a foundation for interpretable and steerable protein generation.

2026-06-14T22:28:50Z Accepted into the Mechanistic Interpretability Workshop at ICML 2026. 24 pages, 14 figures Darin Tsui William Deinzer Daniel Saeedi Amirali Aghazadeh http://arxiv.org/abs/2602.22673v2 Forecasting Bacterial Antimicrobial Resistance Trends Using Machine Learning on WHO GLASS Surveillance Data: A Retrieval-Augmented Generation Approach for Policy Decision Support 2026-06-13T19:25:09Z

Background: Antimicrobial resistance (AMR) is a global health threat. While the WHO Global Antimicrobial Resistance and Use Surveillance System (GLASS) provides standardized data, population-level machine learning forecasting of resistance trends remains limited. Translating computational forecasts into policy requires transparent interpretation mechanisms. Methods: Surveillance data (2021-2023) comprising 5,909 observations across 44 countries and five WHO regions were processed. A rigorous temporal split prevented data leakage. Six models (Naive, Linear, Ridge, XGBoost, LightGBM, LSTM) were benchmarked to forecast one-year-ahead resistance rates using features including prior-year resistance and antibiotic consumption. Evaluation metrics (MAE, RMSE, sMAPE) were computed, with 95% bootstrap confidence intervals for MAE. A local Retrieval-Augmented Generation (RAG) system utilizing Gemma 4 was implemented to translate forecast findings into policy guidance grounded in retrieved WHO documents. Results: XGBoost achieved the best performance (test MAE = 6.13% [95% CI: 5.83-6.44]), an 85.3% error reduction versus the naive baseline (MAE = 41.79%). SHAP analysis identified prior-year resistance as the dominant predictor (50.5% gain), confirming strong autoregressive behavior. Regional forecast error tracked closely with surveillance coverage, ranging from 3.65% in the European Region to 8.61% in South-East Asia. The RAG pipeline generated accurate, source-attributed policy responses without fabricated citations. Conclusion: Short-term AMR resistance rates exhibit strong temporal autocorrelation that can be accurately forecasted using gradient boosting. Coupling these forecasts with a hallucination-resistant RAG system provides a scalable, evidence-based decision-support framework for AMR governance.

2026-02-26T06:45:08Z 20 pages, 8 figures, code and data available at https://github.com/TanvirTurja/amr-forecasting-rag Md Tanvir Hasan Turja http://arxiv.org/abs/2602.13121v2 LinkedNN: a neural model of linkage disequilibrium decay for recent effective population size inference 2026-06-13T16:47:46Z

Summary: A bioinformatics tool is presented for estimating recent effective population size by using a neural network to automatically compute linkage disequilibrium-related features as a function of genomic distance between polymorphisms. The new method outperforms existing deep learning and summary statistic-based approaches using relatively few sequenced individuals and variant sites, making it particularly valuable for molecular ecology applications with sparse, unphased data. Availability and implementation: The program is available as an easily installable Python package with documentation here: https://pypi.org/project/linkedNN/. The open source code is available from: https://github.com/the-smith-lab/LinkedNN.

2026-02-13T17:18:21Z Chris C R Smith http://arxiv.org/abs/2606.15012v1 A Kuramoto-von Mises Time Series Model for Probabilistic Modeling of Coupled Oscillators 2026-06-12T23:17:08Z

A system of coupled oscillators provides a fundamental framework for modeling a wide range of physical and biological phenomena. In neuroscience, the central nervous system exhibits synchronized oscillatory activity with adjacent brain regions, giving rise to traveling wave dynamics for instance during sleep. Similarly, in the gastrointestinal system, neuromuscular cells coordinate their oscillations to generate propagating waves of slow wave activity. To estimate probability distributions of multivariate phase relationships, existing approaches typically rely on equilibrium thermodynamics, expressing the system in a Boltzmann form through a pairwise exponential family distribution. However, these assumptions are often violated in real-world systems, which are inherently dynamic and frequently transition between equilibrium and non-equilibrium regimes. To address this, we propose an efficient method for estimating the probability distribution of coupled oscillators that does not assume thermodynamic equilibrium. Using a Langevin dynamics-based construction, the approach enables accurate modeling even in non-equilibrium regimes. The maximum likelihood estimation method is shown to have a closed form algebraic solution in the high sampling rate regime, a condition commonly satisfied by modern data acquisition systems, which makes it readily applicable in practice. We demonstrate its robustness on simulated data, where it outperforms existing approaches in non-equilibrium settings, and further illustrate its utility for characterizing dynamic brain traveling waves in response to brain stimulation and in hypothesis testing within the context of electrophysiologic recordings of the human stomach.

2026-06-12T23:17:08Z 15 pages, 4 figures Yun Hwang Todd P. Coleman http://arxiv.org/abs/2606.14692v1 Implications of hierarchical Markov models of behavior: on irreversibility, predictability, and dimensionality 2026-06-12T17:55:28Z

The maturation of quantitative tools for studying the high-level structure of animal behavior, and especially tools which represent spontaneous behavior as a sequence of stereotyped and neurally well-defined 'syllables', demands that the field revisit a fundamental theoretical question: if the coarse structure of behavior can be accurately described by Markov models, what do these models really tell us about behavior? In this work, we explore the theoretical implications of these models and discuss how they allow us to quantitatively formulate questions about the sequence-like nature and effective dimensionality of behavior. One important insight is that the eigenvalues and eigenvectors of various model-associated matrices furnish interpretable time scales and modifications of behavior that occur on those time scales. We illustrate our points using both toy examples and Markov models fit to real data. By analyzing the consequences of Markov representations, we clarify the theoretical meaning of progress in quantifying behavior.

2026-06-12T17:55:28Z Accepted to the Proceedings Track of the 9th annual conference on Cognitive Computational Neuroscience (CCN, 2026) John J. Vastola Kanaka Rajan 10.32470/06d965e7 http://arxiv.org/abs/2606.14603v1 Towards In Silico Cancer Therapy Design: An Agent-Based Approach for GPU-Accelerated Molecular Pathway Simulation 2026-06-12T16:25:01Z

Agent-based modelling is gaining recognition as a powerful approach for simulating complex cellular pathways, owing to its ability to reproduce emergent biological behaviours without requiring extensive kinetic parameterisation. In this article, we present a GPU-accelerated agent-based simulator specifically designed to model and analyse signalling pathways involved in cancer progression, and to evaluate therapeutic interventions. Our approach leverages the computing capabilities of FLAME GPU 2, a GPU-accelerated agent-based modelling framework, to efficiently manage simulations involving millions of molecules interacting within a three-dimensional environment. Each molecule is represented as an autonomous agent with defined physical properties, capable of binding, releasing reaction products, migrating between compartments, and interacting based on spatial proximity. An intuitive graphical interface supports model construction, parameter setup, and real-time modification of treatment strategies. As the primary focus of this paper, we validate the simulator on the MAPK/ERK cascade affected by the BRAFV600E mutation, demonstrating that it accurately reproduces dose-response trends observed in clinical data and outperforms both deterministic models and our prior agent-based implementations. A second case study extends the approach to nuclear signalling by reproducing the dynamics of cFos expression and phosphorylation. This demonstrates the simulator's ability to capture compartmentalised regulation, reproducing transient mRNA responses and protein accumulation, including the effect of an unresolved negative transcriptional regulator. Together, these results show that GPU-accelerated ABM can faithfully replicate both drug response and emergent gene expression dynamics, providing a scalable and biologically grounded computational tool for supporting precision oncology.

2026-06-12T16:25:01Z 16 pages, 7 figures, 2 tables. A preliminary version of this work appeared in the Collections of Short Papers of CIBB 2025 (20th International Conference on Computational Intelligence Methods for Bioinformatics and Biostatistics, Milan, 10-12 September 2025) Stefano Maestri http://arxiv.org/abs/2606.14835v1 The Essential Role Of Ribosomal Feedback In Bacterial Cell Growth And Metabolic Load -- A Systems Biology Approach For Unveiling Shared Resources Regulation Within Synthetic Genetic Circuits 2026-06-12T15:35:39Z

Modeling growth in bacterial cells is a major issue in systems and synthetic biology. Despite several growth rate functions proposed in the literature, most focus on nutrient composition without explicitly accounting for the possible perturbation provided by the expression of recombinant genes, an effect known as cell load or burden. On the other hand, mathematical models that attempt to provide mechanistic details on the phenomena, leveraging ribosome partitioning and nutrient availability, are generally too detailed and complex to be easily applied to the rational design of synthetic genetic circuits. A bottom-up approach is adopted herein to identify and analyze the minimal model structure, thereby unveiling the fundamental role of negative feedback in ribosomal synthesis in predicting the effects of cell load on both gene expression and growth rate. Indeed, to ensure cellular efficiency, ribosome synthesis must be finely regulated. While an increased number of ribosomes generally enhances protein production and cellular performance, their synthesis incurs a high energetic cost. For this reason, cells have evolved mechanisms to tightly control ribosome synthesis, avoiding unnecessary accumulation. One of the key regulatory strategies, usually neglected in previous cell models, involves a negative feedback loop that modulates the production of ribosomal components. This feedback ensures that ribosomes are produced only in the amount strictly needed, balancing functionality and energy expenditure. This work evaluates the individual contribution of this feedback under heterologous expression conditions using minimal gene-circuit models, explicitly linking ribosome allocation, hidden couplings between protein synthesis levels, and growth rate.

2026-06-12T15:35:39Z Chiara Cimolato Elisa Gaetan Lorenzo Pasotti Luca Schenato Massimo Bellato http://arxiv.org/abs/2606.14449v1 Measurement-limited learning of conformational heterogeneity in cryo-electron microscopy 2026-06-12T13:33:56Z

Cryogenic electron microscopy images sample individual biomolecules from their conformational landscapes, offering a route to infer the distributions underlying molecular mechanisms. However, because images are indirect measurements, they limit which features of an underlying landscape are statistically identifiable. In ensemble reweighting, this problem appears as a choice of resolution: conformational space is discretized into representative structures whose population weights are inferred from images. Adding structures increases nominal resolution, but nearby conformations may generate overlapping image distributions and indistinguishable weights. Here, we develop an information-theoretic framework that selects representative conformations by maximizing mutual information between ensemble weights and images under a probabilistic forward model. Analytically, we show in a one-dimensional Gaussian model that measurement noise sets the optimal spacing. Applied to molecular conformations sampled from simulation, the framework constructs near-optimal ensembles that span heterogeneity while avoiding redundancy. Thus, the measurement process induces a maximally learnable coarse graining of conformation space.

2026-06-12T13:33:56Z 35 pages (7 of main text and 28 of Appendices), 3 figures Henry H. Mattingly Luke Evans Pilar Cossio http://arxiv.org/abs/2511.06426v4 Robust Parametric Estimation of Avian Cranial Morphology 2026-06-12T13:10:07Z

Understanding the growth and form of complex morphological structures is one of the most fundamental problems in biology. While many prior works have analyzed the beak morphology of Darwin's finches, other cranial features are relatively less explored. In this work, we develop geometric and statistical methods for analyzing the skull morphology of Darwin's finches and their relatives, focusing on the relationship between their skull dimensions, orbit curvature, and neurocranial geometries. Unlike traditional landmark-based approaches that scale linearly with human labor, our framework is fully unsupervised. Specifically, by utilizing tools in computational geometry, differential geometry, and numerical optimization, we develop efficient algorithms for quantifying various key geometric features of the skull. We then perform a statistical analysis and discover a strong correlation between skull size and orbit curvature. Based on our findings, we further establish a predictive model that can estimate the orbit curvature using easily obtainable linear skull measurements. Our results show that the predictive model is highly effective and capable of explaining 85.48\% of the variance in curvature with an average prediction error of only 6.35\%. Altogether, our work establishes a rigorous foundation for the digital estimation and high-throughput phenotyping of large-scale museum collections, overcoming the scalability bottlenecks of manual methods.

2025-11-09T15:46:26Z Kaikwan Lau Gary P. T. Choi http://arxiv.org/abs/2605.14998v3 Learning Developmental Scaffoldings to Guide Self-Organisation 2026-06-12T11:37:36Z

From subcellular structures to entire organisms, many natural systems generate complex organisation through self-organisation: local interactions that collectively give rise to global structure without any blueprint of the outcome. Yet a significant portion of the information driving such processes is not produced by self-organisation itself, instead, it is often offloaded to initial conditions of the system. Biological development is a prime example, where maternal pre-patterns encode positional and symmetry-breaking information that scaffolds the self-organising process. From maternal morphogen gradients in early embryogenesis to tissue-level morphogenetic pre-patterns guiding organ formation, this transfer of information to initial conditions, analogous to a memory-compute trade-off in computational systems, is a fundamental part of developmental processes. In this work, we study this offloading phenomenon by introducing a model that jointly learns both the self-organisation rules and the pre-patterns, allowing their interplay to be varied and measured under controlled conditions: a Neural Cellular Automaton (NCA) paired with a learned coordinate-based pattern generator (SIREN), both trained simultaneously to generate a set of patterns. We provide information-theoretic analyses of how information is distributed between pre-patterns and the self-organising process, and show that jointly learning both components yields improvements in robustness, encoding capacity, and symmetry breaking over purely self-organising alternatives. Our analysis further suggests that effective pre-patterns do not simply approximate their targets; rather, they bias the developmental dynamics in ways that facilitate convergence, pointing to a non-trivial relationship between the structure of initial conditions and the dynamics of self-organisation.

2026-05-14T16:01:25Z 8 pages + acknowledgements and references, 5 figures. Camera-ready version for ALife 2026 Milton L. Montero Elias Najarro Jakob Schauser Sebastian Risi http://arxiv.org/abs/2606.13620v1 Balancing label resolution and computational cost in dynamical models of lipid metabolism 2026-06-11T17:34:33Z

Lipid metabolism is a central biological process that is commonly studied using destructive mass-spectrometry experiments. A recently proposed strategy, uses multiple labels to extract temporal information about lipid metabolism from a single destructive measurement. However, the computational complexity of the model-based data analysis increases rapidly with the number of labels, creating a fundamental trade-off between the information content of the measurements and the cost of analysis. Here, we examine how the number of modelled labels affects parameter estimation accuracy, trajectory recovery, and computational cost, and whether modelling fewer labels than are experimentally available can mitigate this trade-off. Using synthetic data from a five-label experiment, we find that modelling three of the five labels provides a practical balance between experimental feasibility, inferential power, and computational tractability. In an application to hepatocyte triglyceride cycling, we further show that the most cost-efficient, single-label model can yield biologically implausible predictions for unobserved species, whereas models that resolve more labels better constrain these latent dynamics. These results provide practical guidance for selecting model resolution in multi-label experiments and establish a quantitative basis for balancing inferential power against computational cost.

2026-06-11T17:34:33Z 3 Supplementary Files Paul Jonas Jost Christoph Thiele Jan Hasenauer http://arxiv.org/abs/2606.13475v1 A likelihood-based framework for simultaneously learning both noise and growth dynamics using biologically-informed neural networks 2026-06-11T15:29:14Z

In recent years, neural ordinary differential equation frameworks such as Biologically-Informed Neural Networks (BINNs) have shown promise for learning mechanistic laws from sparse data. However, most existing approaches implicitly assume homoscedastic Gaussian noise, and therefore do not account for potentially meaningful structure in biological variability. Here, we present an extension to the existing BINNs framework that includes a learnable noise model, allowing discovery of the noise model directly from data. Using population growth as an example, we demonstrate that the framework accurately recovers the underlying noise structure and improves predictions of the underlying growth laws compared to existing approaches. As such, this work establishes a general likelihood-based framework for jointly learning dynamics and heteroscedastic noise within mechanistic neural network approaches.

2026-06-11T15:29:14Z 28 pages (including one page SI), 6 figures (one in SI) Rebecca M. Crossley Ruth E. Baker http://arxiv.org/abs/2606.13386v1 Mathematical Modeling of HDV RNA, HBV DNA, and HBsAg Dynamics during Lonafarnib-Based Therapy: Insights from the LOWR HDV-1 Study 2026-06-11T14:16:09Z

Lonafarnib (LNF) is an investigational drug targeting hepatitis delta virus (HDV) but not hepatitis B virus (HBV), providing a unique opportunity to model HDV kinetics and how changes in HDV affect HBV. We performed a detailed kinetic analysis and developed a mathematical model to explain serum HBV DNA, HDV RNA and hepatitis B surface antigen (HBsAg) kinetics in 15 HBV/HDV coinfected patients receiving LNF-based treatment. After a delay of 0-2 days, patients experienced a rapid 1st-phase HDV-decline followed by either a viral plateau, 2nd slower-decline phase, or viral breakthrough (VB). LNF monotherapy led to a flat-partial-response (often followed by VB), while LNF combination therapy with ritonavir or pegylated interferon-$α$ (PEG-IFN$α$) was associated with a biphasic HDV decline (without VB). All treatments except LNF+PEG-IFN$α$ had at least one patient experiencing an increase in HBV on-treatment. Our model successfully reproduced the observed HDV and HBV kinetics. We estimated an HDV RNA half-life of 1.26 days [95% confidence interval, CI: 1.05--1.47] in serum and treatment efficacy of 94% in inhibiting HDV RNA production across all treatments [95% CI: 89%--97%], as reflected by the 1st phase HDV decline. The 2nd phase of HDV decline was explained by a time-dependent increase in efficacy, reaching a maximum of 98.9%. The model explained the increase in serum HBV DNA by a median 4-fold [interquartile range, IQR: 1--28] increase in HBV DNA production rate when HDV declined below an inhibitory threshold. The stability of serum HBsAg was explained by a constant number of HBsAg-producing cells.

2026-06-11T14:16:09Z Adquate Mhlanga Louis Shekhtman Rami Zakh Sarah Duehren Ashish Goyal Alexander Churkin Vladimir Reinharz Danny Barash Jeffrey Glenn Ohad Etzion Scott J. Cotler Cihan Yurdaydin Harel Dahari http://arxiv.org/abs/2606.12854v1 Small LLMs for Biomedical Claim Verification: Cost-Effective Fine-Tuning, Structural Dataset Shortcuts, and Cross-Domain Generalization 2026-06-11T03:38:46Z

Large Language Models such as GPT-4o and GPT-5 achieve strong zero-shot performance on biomedical claim verification, but cost and opacity limit scalable use. We fine-tune three small LLMs: Phi-3-mini (3.8B), Qwen2.5-3B, and Mistral-7B, via QLoRA on SciFact and HealthVer, providing the first study of QLoRA models against GPT-4o and fine-tuned BioLinkBERT encoders. Mistral-7B QLoRA surpasses both GPT-4o and GPT-5 (up to 12% F1 gain) at a fractional cost using just 1,008 training examples. We conduct extensive in-domain and cross-domain evaluation: models trained on SciFact tested on HealthVer and vice versa, at matched sizes to isolate dataset structure from data quantity. We identify a previously unreported structural artifact in SciFact that inflates in-domain scores, and show through bidirectional out-of-domain evaluation that training on structurally sound data enables robust cross-domain transfer. We plan to release all code and adapter checkpoints.

2026-06-11T03:38:46Z 8 pages, 2 figures, 12 tables. To appear at BioNLP Workshop, ACL 2026 Gaurav Kumar http://arxiv.org/abs/2606.12838v1 OCOO-T : A Simple and Scalable Virtual Cell Model for Transcriptional Perturbation Response Prediction 2026-06-11T03:04:38Z

Predicting single-cell transcriptional responses to genetic, chemical and cytokine perturbations is a fundamental challenge in computational biology and AI Virtual Cell (AIVC) modeling, with direct implications for drug discovery and the elucidation of gene regulatory networks. Existing approaches often rely on auxiliary cell-state encoders, hierarchical variational autoencoders, dedicated Transformer encoder-decoder modules, or gene-interaction priors to compress high-dimensional expression profiles into latent representations. While effective, these designs increase architectural complexity and may limit scalability and generalizability. This paper introduces OCOO-T, a minimalist flow-matching-based AIVC model for transcriptional perturbation response prediction. OCOO-T utilizes a vanilla Transformer stack that operates directly on continuous gene expression profiles and formulates perturbation response prediction as a continuous-time denoising process. Perturbation embeddings, dosage information, and cell-line/cell-type specificity are integrated through adaptive layer normalization and in-context tokens. Comprehensive evaluations on Tahoe100M, Replogle, and PBMC benchmarks demonstrate that OCOO-T achieves state-of-the-art performance across diverse perturbations and cell types while effectively scaling to long transcriptional profiles through patching and depatching of cellular contexts. By leveraging the simplicity of Transformer-based denoising for single-cell omics, OCOO-T provides an effective and scalable framework for in-silico cellular simulation.

2026-06-11T03:04:38Z 22 pages, 6 figures Danning Jiang Zheming An Yalong Zhao Lipeng Lai