https://arxiv.org/api/+OfNR8x/tHKml8cSWGpufOBfEQM 2026-06-13T11:03:01Z 6754 0 15 http://arxiv.org/abs/2606.13556v1 Is It You or Your Environment? A Bayesian Inference Framework for Genomically-Anchored Personalized Physiological Interpretation 2026-06-11T16:38:38Z

Personalized health AI systems face a fundamental cold-start problem: machine learning models for physiological interpretation require weeks of individual behavioral data before they can distinguish constitutional variation from environmentally driven deviation. We propose a solution grounded in causal inference and Bayesian prior design. An individual's genomic profile serves as an exogenous genetic anchor -- a domain-informed, personalized prior that is fixed at conception, immune to reverse causation, and available before a single behavioral observation is collected. The anchor initializes a Bayesian belief state over an individual's physiological set point G-hat = mu + sum(beta_i * g_i), where beta_i are GWAS-derived effect sizes and g_i are risk-allele counts. Each incoming physiological measurement P produces a non-constitutional deviation delta = P - G-hat that separates the signal attributable to environment and state from the constitutionally fixed baseline. As behavioral data accrue, the prior decays according to G-hat_t = w(t)*G-hat_genomic + [1-w(t)]*P-bar_t, transitioning from genome-dominated to empirical-baseline-dominated inference. The same observed HRV of 55 ms generates a suppression hypothesis for a person whose prior predicts 80 ms, and an enhancement hypothesis for a person whose prior predicts 30 ms -- a reversal impossible without a personalized anchor. We develop this architecture across six physiological domains, grading genomic priors by evidence strength, distinguishing robustly replicated anchors (FTO, FADS1/2, FKBP5) from contested candidate genes (SLC6A4, MAOA, DRD2). We address the inference boundary between association, Mendelian randomization, and individual token causation, and define four constraints for deployment: evidence-graded priors, dynamic decay, ancestry-matched effect sizes, and attribution rather than deterministic output.

2026-06-11T16:38:38Z 24 pages, 8 figures, 3 tables. Conceptual framework paper Aruna Dey Suraj Biswas http://arxiv.org/abs/2606.13047v1 Irregular curvature at focal adhesions modulates Piezo1 activity and low frequency ultrasound induced apoptosis in cancer cells 2026-06-11T08:32:52Z

Low-frequency, low intensity ultrasound (LIUS) has emerged as a promising physical modality capable of inducing selective apoptosis of cancer cells, while sparing healthy epithelial cells and fibroblasts. Hitherto, the mechanism underlying this selectivity has been unclear, but we now propose and develop a theoretical framework linking the distinct mechanical behaviours of cancer versus healthy cells to their differential responses to LIUS. We point out that cancer cells exhibit inhomogeneous ventral stress-fiber networks, which can produce irregular focal adhesion geometry and inward membrane curvature near focal adhesions under low-intensity ultrasound (LIUS). These curvature irregularities can favor loose packing of Piezo1 channels, thereby preserving their activity. In contrast, healthy epithelial cells and fibroblasts display more homogeneous cytoskeletal organization, which can result in more regular curvature profiles adjacent to focal adhesions. This leads to curvature-driven cholesterol redistribution, resulting in altered spatial organization of Piezo1 clusters and reduced coordinated channel activity and allowing cells to remain in their active, proliferative state when exposed to LIUS. Based on theoretical modeling and previous experimental findings, we propose that differences in cytoskeletal organization and membrane curvature can contribute to distinct Piezo1 activation patterns between healthy and cancerous cells. Our analysis identifies curvature-mediated Piezo1 redistribution as a potential physical basis for LIUS selectivity and provides a mechanistic foundation for designing ultrasound-based therapies to exploit the intrinsic cytoskeletal vulnerabilities of cancer cells.

2026-06-11T08:32:52Z 38 pages, 4 figures Physics of Life Reviews, June 2026 Ivana Pajic-Lijakovic Milan Milivojevic Boris Martinac Peter V. E. McClintock 10.1016/j.plrev.2026.06.004 http://arxiv.org/abs/2604.20782v2 LAFA: A Framework for Reproducible Longitudinal Assessment of Protein Function Annotation Models 2026-06-10T19:08:20Z

Motivation: Protein function prediction is a challenging task and an open problem in computational biology. The Critical Assessment of protein Function Annotation (CAFA) is a triennial, community-driven initiative that provides an independent, large-scale evaluation of computational methods for protein function prediction through time-delayed benchmarking experiments. CAFA has played a key role in highlighting high-performing methodologies and fostering detailed analysis and exchange of ideas. However, outside the periodic CAFA challenges, there is no platform for the continuous evaluation of newly developed methods and tracking performance as function annotations accumulate. Results: Here we introduce the Longitudinal Assessment of Protein Function Annotation Models server (LAFA) as a persistent benchmarking system for protein function prediction methods. LAFA provides a continuous evaluation of containerized function prediction methods, enabling up-to-date and robust comparative assessment of method performance under evolving ground truth. LAFA accelerates methodological iteration, supports reproducibility, and offers a more dynamic and fine-grained view of progress in protein function prediction. Code and Data Availability: LAFA is available at https://functionbench.net/. Detailed evaluation results can be found at https://github.com/anphan0828/CAFA_forever

2026-04-22T17:09:36Z An Phan Yanli Wang Frimpong Boadu Jianlin Cheng Predrag Radivojac Iddo Friedberg http://arxiv.org/abs/2606.11382v1 GLACIER: A Multimodal Student-Teacher Foundation Model for Molecular Property Prediction 2026-06-09T19:05:58Z

Deep learning models facilitate the discovery of molecules with tailored properties among billions of candidate compounds. However, the computational burden to develop and deploy state-of-the-art models continuously increases, limiting their scalability. Most large-scale models are unimodal in nature and overlook the potential to leverage complementary molecular data modalities. To address these shortcomings, this paper introduces the Graph-Language Alignment for Chemical Inference and Exploration using Representations (GLACIER) model, a student-teacher framework that integrates molecular graphs, SMILES strings, and physicochemical descriptors to learn rich molecular embeddings. Our framework consists of three stages: (1) we pretrain three student encoders on 100,000 drug-like molecules: a message-passing neural network for molecular graphs, a transformer-based encoder for SMILES strings, and a multilayer perceptron for physicochemical descriptors, (2) we fuse these student modalities using a novel Finsler geometry-aware module, and (3) distill complementary knowledge from large teacher models, including MiniMol and MolFormer, into a single lightweight model via contrastive learning. We demonstrate that GLACIER is a robust framework that delivers high predictive performance and computational efficiency in complex molecular property prediction tasks. Our code is publicly available at https://github.com/eemokey/glacier.

2026-06-09T19:05:58Z Emily Nguyen Yongchan Hong Harsh Toshniwal Yan Liu Andreas Luttens http://arxiv.org/abs/2604.25701v2 Bayesian Rate Inference for Sequence Motif Dynamics in Systems of Reactive Nucleic Acids 2026-06-09T18:43:22Z

The RNA world hypothesis suggests a pathway of how life emerged on early earth. It assumes that life started with RNA based systems, capable of storing, transmitting and replicating information, envisioning that monomers and short RNA oligomers interact to form longer strands, eventually becoming catalytically active ribozymes. Key reactions in RNA pools are hybridization, dehybridization, templated ligation, and cleavage. Those reactions depend on many environmental parameters and the wide range of possible configurations among interacting strands. In order to scan such high dimensional parameter spaces, efficient descriptions are needed. Motif rate equations project complex strand reactor dynamics onto sequence motif space. Here we present a Bayesian inference framework to infer their parameters from ligation count data produced by strand reactor simulations. This provides a framework to match the simpler motif rate equations to more complex simulations. Additionally, it is a step towards inferring reaction rate constants directly from experimental data, including rigorous uncertainty estimation. This could be an essential procedure to connect theory and experiment, and deepen our understanding of the essential features necessary for life to emerge.

2026-04-28T14:28:37Z 18 pages, 8 figures, pre-submission Johannes Harth-Kitzerow Ulrich Gerland Torsten A. Enßlin http://arxiv.org/abs/2606.11057v1 Flexible Kernels for Protein Property Prediction 2026-06-09T16:20:36Z

Despite its importance to applications in protein design, predicting protein properties like binding affinity and thermostability from sparse experimental data remains a significant challenge. Accordingly, we introduce a class of sequence kernels that exploit evolutionary substitution matrices as well as local linearity and demonstrate that the resulting Gaussian processes provide data-efficient models of protein property landscapes, frequently outperforming alternatives that rely on foundation model embeddings. Furthermore--by learning what are in effect structure-aware substitution matrices--we show that our kernels can readily incorporate structural information from foundation models. We demonstrate that these structure-conditioned kernels are well suited to multi-task learning across multiple protein property landscapes and can decisively outperform local supervised learning methods.

2026-06-09T16:20:36Z 50 pages; to appear at ICML 2026 Martin Jankowiak Yerdos Ordabayev Rudraksh Tuwani Henry N. Ward Hunter Nisonoff James M. McFarland Gevorg Grigoryan http://arxiv.org/abs/2606.10955v1 A kinetic model of shear-induced rupture of short dsDNA 2026-06-09T14:58:44Z

Force-induced dissociation of short double-stranded DNA (dsDNA) is central to single-molecule biophysics and DNA nanotechnology, yet a physically grounded kinetic description of shear-induced rupture for finite-length constructs remains lacking. Here we develop a master equation framework built on a force-dependent nucleation-zipper pathway with single-base transitions, enabling direct calculation of dissociation rates and transition state distances over a broad force range. Applied to a DNA-gold nanoparticle-DNA construct under constant shear force, the model accurately reproduces the experimental room-temperature data in the covered force regime and provides a unified interpretation of prior measurements on similarly sheared duplexes across all force regimes. A central result is that the three-dimensional helical geometry of dsDNA is essential for correctly defining the end to end distance under shear in the rod-like polymer model of short dsDNA. We further show that the extracted transition state distances are robust to variations in ssDNA polymer parameters within the experimentally relevant regime. Finally, we analyze the temperature dependence of the transition state distance and discuss how our framework captures globally-heated rupture while identifying the additional complications introduced by localized plasmonic heating in gold nanoparticle-coupled constructs. These results provide a predictive kinetic foundation for interpreting force-rupture experiments and for designing force- and temperature-actuated DNA nanostructures.

2026-06-09T14:58:44Z Supporting Information is provided at the end of the main text Ayman Hussein Ralf Bundschuh http://arxiv.org/abs/2605.31498v3 Scalable Inference-Time Annealing with Surrogate Likelihood Estimators 2026-06-08T17:55:28Z

A long standing challenge in computational chemistry and biophysics is efficiently sampling the Boltzmann distribution of molecules. Advances in generative modeling have been proposed to address the limitations of conventional sampling techniques by eliminating the computational cost of simulation. A promising direction is iteratively finetuning diffusion models along a temperature ladder whereby training data is generated via importance sampling during inference-time annealing. Unfortunately, these methods require computing a divergence over the score field to estimate importance weights, rendering them intractable for larger systems. Here we present scalable inference-time annealing (SITA), which retrains flow-based models to generate samples at progressively lower temperatures using an energy-based model to facilitate fast surrogate likelihoods. We demonstrate state-of-the-art performance on both Alanine Dipeptide and Alanine Tripeptide while avoiding costly divergence terms. Our code is available at https://github.com/countrsignal/sita.git

2026-05-29T16:20:59Z 26 pages, 5 figures, submitted to JMLR 2026 Daniel Peñaherrera Rishal Aggarwal David Ryan Koes http://arxiv.org/abs/2606.08647v1 Protein Dynamics Beyond Structure Prediction 2026-06-07T14:23:58Z

The ability to predict protein three-dimensional structures from amino acid sequences is a landmark achievement in molecular biology, where recent deep learning approaches such as AlphaFold are the culmination of decades of work. Yet, the quantitative understanding of how protein sequences give rise to dynamic conformational changes and higher-order assemblies remains unsolved. Folding and conformational states are dynamic, stochastic processes, shaped by sequence, energy, co-translational constraints, chaperone machineries, and the physicochemical conditions of the cellular environment. Recent advances now position the field to move beyond static structural endpoints toward a mechanistic understanding of folding dynamics in living systems. Single-molecule techniques enable time-resolved observation of folding trajectories and intermediate states hitherto hidden by traditional structural biology approaches, while computational innovations and data-driven approaches offer new ways to integrate heterogeneous data across scales. In this Roadmap, we review the current conceptual landscape of protein folding, examine the experimental and theoretical gaps that remain, and discuss emerging strategies that integrate high-resolution measurements with multiscale modeling. We outline a roadmap toward a quantitative and predictive science of protein folding dynamics, conformational kinetics, and macromolecular self-assembly. Realizing this vision would transform our understanding of the dynamics of molecular self-organization, from the folding of individual polypeptides to the emergence of dynamic macromolecular complexes. This will enable rational control of folding and misfolding in health and disease, extend protein engineering principles beyond static structural design, and establish a mechanistic foundation for predictive and personalized interventions in proteostasis-related disorders.

2026-06-07T14:23:58Z 53 pages, 4 figures Juliette Griffié Betty Sviatlana Shashkova Betty Antonio Ciarlo Betty Sreekanth K. Manikandan Betty Claes Andréasson Betty Malin Bäckström Betty Tristan Bereau Betty Hjalmar Brismar Betty Carlos Bustamante Betty Marta Carroni Betty Roberto Covino Betty Andreas Dahlin Betty Sebastian Deindl Betty Lucie Delemotte Betty Arne Elofsson Betty John Eriksson Betty Giovanna Fragneto Betty Anders Gunnarsson Betty Per Hammarström Betty Caroline Ingre Betty Christian Kaiser Betty Petronella Kettunen Betty Mark C. Leake Betty Benjamin Loos Betty Anna Månberg Betty Antonia S. J. S. Mey Betty Richard Neutze Betty Thomas Nyström Betty Karl Palmås Betty Charley Schaefer Betty Markus J. Tamás Betty Nicola Ticozzi Betty Tomás S. Pilvelic Betty Jacopo Sacquegno Betty B. M. Betty Tijms Gunnar von Heijne Björn Wallner Vitali Zhaunerchyk Simon Olsson Joana B. Pereira Julia Fernandez-Rodriguez Fredrik Westerlund Giovanni Volpe http://arxiv.org/abs/2606.02462v2 APLSuite: An Integrated Suite for CD4+ T Cell Epitope Prediction via Antigen Processing Likelihood 2026-06-05T16:44:49Z

Computational epitope prediction is a critical tool for exploring and understanding CD4+ T cell-mediated immune responses, a key aspect of adaptive immunity. While existing computational methods primarily focus on supervised learning approaches, they often overlook the essential role of antigen processing in determining binding specificity. To address this limitation, our group developed Antigen Processing Likelihood (APL), an algorithm that integrates crystallographic B-factor, solvent accessible surface area (SASA), hydrogen exchange protection factors (COREX), and sequence entropy. In this paper we introduce APLSuite, a comprehensive and lightweight software suite designed to streamline APL-based epitope prediction. APLSuite integrates distributed RESTful API services, a Python client for data aggregation and processing, a data science tool for efficient epitope computation, and a user-friendly graphical user interface for non-coding users. It provides a seamless and efficient pipeline for APL calculation and epitope prediction that can be finished in minutes with GPU-acceleration, which has not been implemented by existed tools. This flexible and extensible software suite is deployable on desktop and cloud environments, offering both guided and customizable workflows to meet diverse research needs in immunology research and immunotherapy development. (The project page for this work is available at: https://tulane-mettu-landry-lab.github.io/blogs/APLSuite/)

2026-06-01T16:35:12Z Application Note; The source code for this work is available at: https://github.com/Jiarui0923/APL The project page for this work is available at: https://tulane-mettu-landry-lab.github.io/blogs/APLSuite/ Jiarui Li Marco K. Carbullido Jai Bansal Samuel J. Landry Ramgopal R. Mettu http://arxiv.org/abs/2507.08920v4 AMix-1: A Pathway to Test-Time Scalable Protein Foundation Model 2026-06-05T11:04:35Z

We introduce AMix-1, a powerful protein foundation model built on Bayesian Flow Networks and empowered by a systematic training methodology, encompassing pretraining scaling laws, emergent capability analysis, in-context learning mechanism, and test-time scaling algorithm. To guarantee robust scalability, we establish a predictive scaling law and reveal the progressive emergence of structural understanding via loss perspective, culminating in a strong 1.7-billion model. Building on this foundation, we devise a multiple sequence alignment (MSA)-based in-context learning strategy to unify protein design into a general framework, where AMix-1 recognizes deep evolutionary signals among MSAs and consistently generates structurally and functionally coherent proteins. This framework enables the successful design of a dramatically improved AmeR variant with an up to $50\times$ activity increase over its wild type. Pushing the boundaries of protein engineering, we further empower AMix-1 with an evolutionary test-time scaling algorithm for in silico directed evolution that delivers substantial, scalable performance gains as verification budgets are intensified, laying the groundwork for next-generation lab-in-the-loop protein design.

2025-07-11T17:02:25Z Changze Lv Jiang Zhou Siyu Long Lihao Wang Jiangtao Feng Dongyu Xue Yu Pei Hao Wang Zherui Zhang Yuchen Cai Zhiqiang Gao Ziyuan Ma Jiakai Hu Chaochen Gao Jingjing Gong Yuxuan Song Shuyi Zhang Xiaoqing Zheng Deyi Xiong Lei Bai Wanli Ouyang Ya-Qin Zhang Wei-Ying Ma Bowen Zhou Hao Zhou http://arxiv.org/abs/2606.06717v1 ShallowBench: Benchmarking Generative Drug Design Models on Shallow-Pocket Targets 2026-06-04T21:06:31Z

While generative AI models have demonstrated remarkable success in structure-based drug design, they predominantly rely on deep binding pockets and struggle to sample effective ligands for challenging low-pocketability targets, such as the historically "undruggable" oncology targets KRAS and MYC. To address this gap, we introduce ShallowBench, a strictly curated benchmark of 5,780 shallow-pocket targets extracted from CrossDocked2020. By computing the difference between an Alpha Shape "lid" volume and the underlying protein atom voxel volume, we successfully isolated targets with low concavity while ensuring sufficient surface area for binding. Evaluating various state-of-the-art generative models reveals weaker predicted binding affinity on these low-concavity interfaces. ShallowBench therefore provides a rigorous benchmark for generative biology models and highlights the necessity of new architectural innovations or loss functions capable of navigating these challenging targets.

2026-06-04T21:06:31Z Saket Reddy Shiwei Liu http://arxiv.org/abs/2606.05541v1 Methods for Inferring Interaction Potentials from Cross-Linking Mass Spectrometry Data 2026-06-04T00:52:12Z

Cross-linking mass spectrometry (XL-MS) has emerged as a powerful quantitative technique for probing intra-protein structural information as well as protein-protein interactions at an unprecedented scale. XL-MS data yield information on the pairwise spatial proximity of proteins through inter-molecular linkers. However, systematic methods for adapting such data for coarse-grained interacting particle models remain limited. Predominant focus is put on directly fitting radial distribution functions (RDFs), while numerous observables, e.g. coordination numbers, which are functionals of the RDF, cannot be uniquely inverted. In this work, we develop a framework for parameterizing interaction potentials from such observables in potentially phase-separated mixtures, as encountered in XL-MS results. We establish a connection between this problem and the inverse Henderson problem and adapt algorithms such as Iterative Boltzmann Inversion and Iterative Monte Carlo to its numerical solution. We derive exact and low-density limit gradient approximations and propose two new algorithms based on an adaptation of the predictor-corrector~framework. In total, we evaluate several optimization algorithms on biologically realistic ten-component test systems. We demonstrate that for homogeneous fluids, all methods achieve exceptional efficiency and accuracy. Critically, we further demonstrate successful parametrization in a challenging three-phase system. Here, three algorithms, namely Adam and gradient descent employing the low-density derivative as well as Newton's method with the exact gradient, reliably recover the correct parameters. These results establish a clear pathway from XL-MS experiments to coarse-grained protein models for systems where phase separation governs biological function, potentially enabling new investigations of biomolecular condensates and protein aggregation.

2026-06-04T00:52:12Z 19 pages, 10 Figure, 5 Tables Börries von Seggern Mohsen Sadeghi http://arxiv.org/abs/2606.05474v1 AlloGen: Conformation-Selective Binder Generation with Differential State Scoring 2026-06-03T21:53:17Z

Protein binder design has largely optimized for affinity alone, leaving conformational selectivity unaddressed: for allosteric targets such as kinases, nuclear receptors, and GPCRs, a binder that engages both active and inactive states provides no functional specificity regardless of how tightly it binds. We introduce AlloGen, a modular framework that decouples backbone generation from a learned state-selectivity scorer $Q_θ$, an SE(3)-invariant interface graph transformer trained via a two-phase curriculum that first learns interface geometry before imposing conformational discrimination. Because $Q_θ$ is fully differentiable and generator-agnostic, it integrates with any backbone generator as a passive reranker or an active gradient-based guide without retraining. Across a diverse benchmark of proteins spanning multiple families and conformational mechanisms, AlloGen consistently identifies binders that preferentially recognize desired structural states while rejecting alternative conformations. Experimental validation on calmodulin further demonstrates that these computational selectivity signals translate to physical molecules, yielding de novo peptides that bind the desired holo conformation while exhibiting no detectable binding to the apo state. Together, these results establish conformational selectivity as a learnable property and provide a general framework for state-selective protein binder design.

2026-06-03T21:53:17Z Hanqun Cao Zachary Quinn Aastha Pal Sumi Kimura Jingjie Zhang Pheng Ann Heng Pranam Chatterjee http://arxiv.org/abs/2605.16331v2 Retrieval and competition: how a protein foundation model starts a protein 2026-06-03T13:43:31Z

Protein language models are increasingly used to guide experimental and clinical decisions, yet it is often unclear whether a confident prediction reflects recognition of biological evidence or retrieval of a statistical default. We examine this distinction for a near-universal biological rule, that proteins begin with methionine, by tracing the computational pathway through which ESM2-8M produces this prediction. The model does not detect methionine at the masked position. Instead, it retrieves a methionine-favouring signal from a reference representation at the beginning-of-sequence token via a position-specific query assembled across layers, with the final output emerging through competition with context-dependent circuits. To understand how positional information reaches the readout, we introduce a norm-direction decomposition of attention scores within rotary frequency bands. Positional encoding operates through coupled changes in query norm and angular alignment distributed across these bands. On sequences whose true N-terminus is not methionine, where the biological question matters, the model predicts methionine anyway. This is not a correct prediction produced by an unexpected mechanism, but the output of a positional-prior retrieval circuit that matches the statistical average and fails where biology diverges from it. Distinguishing the two requires resolution at the level of individual circuits, frequency bands, and query composition, suggesting that mechanistic verification will be necessary, and challenging, for predictions where the biological stakes are higher. Even for the simplest biological rule, the model's prediction is mediated by a distributed computational circuit rather than direct recognition, suggesting that increasing task complexity will further obscure the relationship between model confidence and underlying biological evidence.

2026-05-05T17:51:21Z updated figure 4 Piotr Jedryszek Oliver M. Crook