https://arxiv.org/api/+OfNR8x/tHKml8cSWGpufOBfEQM2026-06-13T11:03:01Z6754015http://arxiv.org/abs/2606.13556v1Is It You or Your Environment? A Bayesian Inference Framework for Genomically-Anchored Personalized Physiological Interpretation2026-06-11T16:38:38ZPersonalized health AI systems face a fundamental cold-start problem: machine learning models for physiological interpretation require weeks of individual behavioral data before they can distinguish constitutional variation from environmentally driven deviation. We propose a solution grounded in causal inference and Bayesian prior design. An individual's genomic profile serves as an exogenous genetic anchor -- a domain-informed, personalized prior that is fixed at conception, immune to reverse causation, and available before a single behavioral observation is collected. The anchor initializes a Bayesian belief state over an individual's physiological set point G-hat = mu + sum(beta_i * g_i), where beta_i are GWAS-derived effect sizes and g_i are risk-allele counts. Each incoming physiological measurement P produces a non-constitutional deviation delta = P - G-hat that separates the signal attributable to environment and state from the constitutionally fixed baseline. As behavioral data accrue, the prior decays according to G-hat_t = w(t)*G-hat_genomic + [1-w(t)]*P-bar_t, transitioning from genome-dominated to empirical-baseline-dominated inference. The same observed HRV of 55 ms generates a suppression hypothesis for a person whose prior predicts 80 ms, and an enhancement hypothesis for a person whose prior predicts 30 ms -- a reversal impossible without a personalized anchor. We develop this architecture across six physiological domains, grading genomic priors by evidence strength, distinguishing robustly replicated anchors (FTO, FADS1/2, FKBP5) from contested candidate genes (SLC6A4, MAOA, DRD2). We address the inference boundary between association, Mendelian randomization, and individual token causation, and define four constraints for deployment: evidence-graded priors, dynamic decay, ancestry-matched effect sizes, and attribution rather than deterministic output.2026-06-11T16:38:38Z24 pages, 8 figures, 3 tables. Conceptual framework paperAruna DeySuraj Biswashttp://arxiv.org/abs/2606.13047v1Irregular curvature at focal adhesions modulates Piezo1 activity and low frequency ultrasound induced apoptosis in cancer cells2026-06-11T08:32:52ZLow-frequency, low intensity ultrasound (LIUS) has emerged as a promising physical modality capable of inducing selective apoptosis of cancer cells, while sparing healthy epithelial cells and fibroblasts. Hitherto, the mechanism underlying this selectivity has been unclear, but we now propose and develop a theoretical framework linking the distinct mechanical behaviours of cancer versus healthy cells to their differential responses to LIUS. We point out that cancer cells exhibit inhomogeneous ventral stress-fiber networks, which can produce irregular focal adhesion geometry and inward membrane curvature near focal adhesions under low-intensity ultrasound (LIUS). These curvature irregularities can favor loose packing of Piezo1 channels, thereby preserving their activity. In contrast, healthy epithelial cells and fibroblasts display more homogeneous cytoskeletal organization, which can result in more regular curvature profiles adjacent to focal adhesions. This leads to curvature-driven cholesterol redistribution, resulting in altered spatial organization of Piezo1 clusters and reduced coordinated channel activity and allowing cells to remain in their active, proliferative state when exposed to LIUS. Based on theoretical modeling and previous experimental findings, we propose that differences in cytoskeletal organization and membrane curvature can contribute to distinct Piezo1 activation patterns between healthy and cancerous cells. Our analysis identifies curvature-mediated Piezo1 redistribution as a potential physical basis for LIUS selectivity and provides a mechanistic foundation for designing ultrasound-based therapies to exploit the intrinsic cytoskeletal vulnerabilities of cancer cells.2026-06-11T08:32:52Z38 pages, 4 figuresPhysics of Life Reviews, June 2026Ivana Pajic-LijakovicMilan MilivojevicBoris MartinacPeter V. E. McClintock10.1016/j.plrev.2026.06.004http://arxiv.org/abs/2604.20782v2LAFA: A Framework for Reproducible Longitudinal Assessment of Protein Function Annotation Models2026-06-10T19:08:20ZMotivation: Protein function prediction is a challenging task and an open problem in computational biology. The Critical Assessment of protein Function Annotation (CAFA) is a triennial, community-driven initiative that provides an independent, large-scale evaluation of computational methods for protein function prediction through time-delayed benchmarking experiments. CAFA has played a key role in highlighting high-performing methodologies and fostering detailed analysis and exchange of ideas. However, outside the periodic CAFA challenges, there is no platform for the continuous evaluation of newly developed methods and tracking performance as function annotations accumulate.
Results: Here we introduce the Longitudinal Assessment of Protein Function Annotation Models server (LAFA) as a persistent benchmarking system for protein function prediction methods. LAFA provides a continuous evaluation of containerized function prediction methods, enabling up-to-date and robust comparative assessment of method performance under evolving ground truth. LAFA accelerates methodological iteration, supports reproducibility, and offers a more dynamic and fine-grained view of progress in protein function prediction.
Code and Data Availability: LAFA is available at https://functionbench.net/. Detailed evaluation results can be found at https://github.com/anphan0828/CAFA_forever2026-04-22T17:09:36ZAn PhanYanli WangFrimpong BoaduJianlin ChengPredrag RadivojacIddo Friedberghttp://arxiv.org/abs/2606.11382v1GLACIER: A Multimodal Student-Teacher Foundation Model for Molecular Property Prediction2026-06-09T19:05:58ZDeep learning models facilitate the discovery of molecules with tailored properties among billions of candidate compounds. However, the computational burden to develop and deploy state-of-the-art models continuously increases, limiting their scalability. Most large-scale models are unimodal in nature and overlook the potential to leverage complementary molecular data modalities. To address these shortcomings, this paper introduces the Graph-Language Alignment for Chemical Inference and Exploration using Representations (GLACIER) model, a student-teacher framework that integrates molecular graphs, SMILES strings, and physicochemical descriptors to learn rich molecular embeddings. Our framework consists of three stages: (1) we pretrain three student encoders on 100,000 drug-like molecules: a message-passing neural network for molecular graphs, a transformer-based encoder for SMILES strings, and a multilayer perceptron for physicochemical descriptors, (2) we fuse these student modalities using a novel Finsler geometry-aware module, and (3) distill complementary knowledge from large teacher models, including MiniMol and MolFormer, into a single lightweight model via contrastive learning. We demonstrate that GLACIER is a robust framework that delivers high predictive performance and computational efficiency in complex molecular property prediction tasks. Our code is publicly available at https://github.com/eemokey/glacier.2026-06-09T19:05:58ZEmily NguyenYongchan HongHarsh ToshniwalYan LiuAndreas Luttenshttp://arxiv.org/abs/2604.25701v2Bayesian Rate Inference for Sequence Motif Dynamics in Systems of Reactive Nucleic Acids2026-06-09T18:43:22ZThe RNA world hypothesis suggests a pathway of how life emerged on early earth. It assumes that life started with RNA based systems, capable of storing, transmitting and replicating information, envisioning that monomers and short RNA oligomers interact to form longer strands, eventually becoming catalytically active ribozymes. Key reactions in RNA pools are hybridization, dehybridization, templated ligation, and cleavage. Those reactions depend on many environmental parameters and the wide range of possible configurations among interacting strands. In order to scan such high dimensional parameter spaces, efficient descriptions are needed. Motif rate equations project complex strand reactor dynamics onto sequence motif space. Here we present a Bayesian inference framework to infer their parameters from ligation count data produced by strand reactor simulations. This provides a framework to match the simpler motif rate equations to more complex simulations. Additionally, it is a step towards inferring reaction rate constants directly from experimental data, including rigorous uncertainty estimation. This could be an essential procedure to connect theory and experiment, and deepen our understanding of the essential features necessary for life to emerge.2026-04-28T14:28:37Z18 pages, 8 figures, pre-submissionJohannes Harth-KitzerowUlrich GerlandTorsten A. Enßlinhttp://arxiv.org/abs/2606.11057v1Flexible Kernels for Protein Property Prediction2026-06-09T16:20:36ZDespite its importance to applications in protein design, predicting protein properties like binding affinity and thermostability from sparse experimental data remains a significant challenge. Accordingly, we introduce a class of sequence kernels that exploit evolutionary substitution matrices as well as local linearity and demonstrate that the resulting Gaussian processes provide data-efficient models of protein property landscapes, frequently outperforming alternatives that rely on foundation model embeddings. Furthermore--by learning what are in effect structure-aware substitution matrices--we show that our kernels can readily incorporate structural information from foundation models. We demonstrate that these structure-conditioned kernels are well suited to multi-task learning across multiple protein property landscapes and can decisively outperform local supervised learning methods.2026-06-09T16:20:36Z50 pages; to appear at ICML 2026Martin JankowiakYerdos OrdabayevRudraksh TuwaniHenry N. WardHunter NisonoffJames M. McFarlandGevorg Grigoryanhttp://arxiv.org/abs/2606.10955v1A kinetic model of shear-induced rupture of short dsDNA2026-06-09T14:58:44ZForce-induced dissociation of short double-stranded DNA (dsDNA) is central to single-molecule biophysics and DNA nanotechnology, yet a physically grounded kinetic description of shear-induced rupture for finite-length constructs remains lacking. Here we develop a master equation framework built on a force-dependent nucleation-zipper pathway with single-base transitions, enabling direct calculation of dissociation rates and transition state distances over a broad force range. Applied to a DNA-gold nanoparticle-DNA construct under constant shear force, the model accurately reproduces the experimental room-temperature data in the covered force regime and provides a unified interpretation of prior measurements on similarly sheared duplexes across all force regimes. A central result is that the three-dimensional helical geometry of dsDNA is essential for correctly defining the end to end distance under shear in the rod-like polymer model of short dsDNA. We further show that the extracted transition state distances are robust to variations in ssDNA polymer parameters within the experimentally relevant regime. Finally, we analyze the temperature dependence of the transition state distance and discuss how our framework captures globally-heated rupture while identifying the additional complications introduced by localized plasmonic heating in gold nanoparticle-coupled constructs. These results provide a predictive kinetic foundation for interpreting force-rupture experiments and for designing force- and temperature-actuated DNA nanostructures.2026-06-09T14:58:44ZSupporting Information is provided at the end of the main textAyman HusseinRalf Bundschuhhttp://arxiv.org/abs/2605.31498v3Scalable Inference-Time Annealing with Surrogate Likelihood Estimators2026-06-08T17:55:28ZA long standing challenge in computational chemistry and biophysics is efficiently sampling the Boltzmann distribution of molecules. Advances in generative modeling have been proposed to address the limitations of conventional sampling techniques by eliminating the computational cost of simulation. A promising direction is iteratively finetuning diffusion models along a temperature ladder whereby training data is generated via importance sampling during inference-time annealing. Unfortunately, these methods require computing a divergence over the score field to estimate importance weights, rendering them intractable for larger systems. Here we present scalable inference-time annealing (SITA), which retrains flow-based models to generate samples at progressively lower temperatures using an energy-based model to facilitate fast surrogate likelihoods. We demonstrate state-of-the-art performance on both Alanine Dipeptide and Alanine Tripeptide while avoiding costly divergence terms. Our code is available at https://github.com/countrsignal/sita.git2026-05-29T16:20:59Z26 pages, 5 figures, submitted to JMLR 2026Daniel PeñaherreraRishal AggarwalDavid Ryan Koeshttp://arxiv.org/abs/2606.08647v1Protein Dynamics Beyond Structure Prediction2026-06-07T14:23:58ZThe ability to predict protein three-dimensional structures from amino acid sequences is a landmark achievement in molecular biology, where recent deep learning approaches such as AlphaFold are the culmination of decades of work. Yet, the quantitative understanding of how protein sequences give rise to dynamic conformational changes and higher-order assemblies remains unsolved. Folding and conformational states are dynamic, stochastic processes, shaped by sequence, energy, co-translational constraints, chaperone machineries, and the physicochemical conditions of the cellular environment. Recent advances now position the field to move beyond static structural endpoints toward a mechanistic understanding of folding dynamics in living systems. Single-molecule techniques enable time-resolved observation of folding trajectories and intermediate states hitherto hidden by traditional structural biology approaches, while computational innovations and data-driven approaches offer new ways to integrate heterogeneous data across scales. In this Roadmap, we review the current conceptual landscape of protein folding, examine the experimental and theoretical gaps that remain, and discuss emerging strategies that integrate high-resolution measurements with multiscale modeling. We outline a roadmap toward a quantitative and predictive science of protein folding dynamics, conformational kinetics, and macromolecular self-assembly. Realizing this vision would transform our understanding of the dynamics of molecular self-organization, from the folding of individual polypeptides to the emergence of dynamic macromolecular complexes. This will enable rational control of folding and misfolding in health and disease, extend protein engineering principles beyond static structural design, and establish a mechanistic foundation for predictive and personalized interventions in proteostasis-related disorders.2026-06-07T14:23:58Z53 pages, 4 figuresJuliette GriffiéBettySviatlana ShashkovaBettyAntonio CiarloBettySreekanth K. ManikandanBettyClaes AndréassonBettyMalin BäckströmBettyTristan BereauBettyHjalmar BrismarBettyCarlos BustamanteBettyMarta CarroniBettyRoberto CovinoBettyAndreas DahlinBettySebastian DeindlBettyLucie DelemotteBettyArne ElofssonBettyJohn ErikssonBettyGiovanna FragnetoBettyAnders GunnarssonBettyPer HammarströmBettyCaroline IngreBettyChristian KaiserBettyPetronella KettunenBettyMark C. LeakeBettyBenjamin LoosBettyAnna MånbergBettyAntonia S. J. S. MeyBettyRichard NeutzeBettyThomas NyströmBettyKarl PalmåsBettyCharley SchaeferBettyMarkus J. TamásBettyNicola TicozziBettyTomás S. PilvelicBettyJacopo SacquegnoBettyB. M.Betty TijmsGunnar von HeijneBjörn WallnerVitali ZhaunerchykSimon OlssonJoana B. PereiraJulia Fernandez-RodriguezFredrik WesterlundGiovanni Volpehttp://arxiv.org/abs/2606.02462v2APLSuite: An Integrated Suite for CD4+ T Cell Epitope Prediction via Antigen Processing Likelihood2026-06-05T16:44:49ZComputational epitope prediction is a critical tool for exploring and understanding CD4+ T cell-mediated immune responses, a key aspect of adaptive immunity. While existing computational methods primarily focus on supervised learning approaches, they often overlook the essential role of antigen processing in determining binding specificity. To address this limitation, our group developed Antigen Processing Likelihood (APL), an algorithm that integrates crystallographic B-factor, solvent accessible surface area (SASA), hydrogen exchange protection factors (COREX), and sequence entropy.
In this paper we introduce APLSuite, a comprehensive and lightweight software suite designed to streamline APL-based epitope prediction. APLSuite integrates distributed RESTful API services, a Python client for data aggregation and processing, a data science tool for efficient epitope computation, and a user-friendly graphical user interface for non-coding users. It provides a seamless and efficient pipeline for APL calculation and epitope prediction that can be finished in minutes with GPU-acceleration, which has not been implemented by existed tools. This flexible and extensible software suite is deployable on desktop and cloud environments, offering both guided and customizable workflows to meet diverse research needs in immunology research and immunotherapy development. (The project page for this work is available at: https://tulane-mettu-landry-lab.github.io/blogs/APLSuite/)2026-06-01T16:35:12ZApplication Note; The source code for this work is available at: https://github.com/Jiarui0923/APL The project page for this work is available at: https://tulane-mettu-landry-lab.github.io/blogs/APLSuite/Jiarui LiMarco K. CarbullidoJai BansalSamuel J. LandryRamgopal R. Mettuhttp://arxiv.org/abs/2507.08920v4AMix-1: A Pathway to Test-Time Scalable Protein Foundation Model2026-06-05T11:04:35ZWe introduce AMix-1, a powerful protein foundation model built on Bayesian Flow Networks and empowered by a systematic training methodology, encompassing pretraining scaling laws, emergent capability analysis, in-context learning mechanism, and test-time scaling algorithm. To guarantee robust scalability, we establish a predictive scaling law and reveal the progressive emergence of structural understanding via loss perspective, culminating in a strong 1.7-billion model. Building on this foundation, we devise a multiple sequence alignment (MSA)-based in-context learning strategy to unify protein design into a general framework, where AMix-1 recognizes deep evolutionary signals among MSAs and consistently generates structurally and functionally coherent proteins. This framework enables the successful design of a dramatically improved AmeR variant with an up to $50\times$ activity increase over its wild type. Pushing the boundaries of protein engineering, we further empower AMix-1 with an evolutionary test-time scaling algorithm for in silico directed evolution that delivers substantial, scalable performance gains as verification budgets are intensified, laying the groundwork for next-generation lab-in-the-loop protein design.2025-07-11T17:02:25ZChangze LvJiang ZhouSiyu LongLihao WangJiangtao FengDongyu XueYu PeiHao WangZherui ZhangYuchen CaiZhiqiang GaoZiyuan MaJiakai HuChaochen GaoJingjing GongYuxuan SongShuyi ZhangXiaoqing ZhengDeyi XiongLei BaiWanli OuyangYa-Qin ZhangWei-Ying MaBowen ZhouHao Zhouhttp://arxiv.org/abs/2606.06717v1ShallowBench: Benchmarking Generative Drug Design Models on Shallow-Pocket Targets2026-06-04T21:06:31ZWhile generative AI models have demonstrated remarkable success in structure-based drug design, they predominantly rely on deep binding pockets and struggle to sample effective ligands for challenging low-pocketability targets, such as the historically "undruggable" oncology targets KRAS and MYC. To address this gap, we introduce ShallowBench, a strictly curated benchmark of 5,780 shallow-pocket targets extracted from CrossDocked2020. By computing the difference between an Alpha Shape "lid" volume and the underlying protein atom voxel volume, we successfully isolated targets with low concavity while ensuring sufficient surface area for binding. Evaluating various state-of-the-art generative models reveals weaker predicted binding affinity on these low-concavity interfaces. ShallowBench therefore provides a rigorous benchmark for generative biology models and highlights the necessity of new architectural innovations or loss functions capable of navigating these challenging targets.2026-06-04T21:06:31ZSaket ReddyShiwei Liuhttp://arxiv.org/abs/2606.05541v1Methods for Inferring Interaction Potentials from Cross-Linking Mass Spectrometry Data2026-06-04T00:52:12ZCross-linking mass spectrometry (XL-MS) has emerged as a powerful quantitative technique for probing intra-protein structural information as well as protein-protein interactions at an unprecedented scale. XL-MS data yield information on the pairwise spatial proximity of proteins through inter-molecular linkers. However, systematic methods for adapting such data for coarse-grained interacting particle models remain limited. Predominant focus is put on directly fitting radial distribution functions (RDFs), while numerous observables, e.g. coordination numbers, which are functionals of the RDF, cannot be uniquely inverted. In this work, we develop a framework for parameterizing interaction potentials from such observables in potentially phase-separated mixtures, as encountered in XL-MS results. We establish a connection between this problem and the inverse Henderson problem and adapt algorithms such as Iterative Boltzmann Inversion and Iterative Monte Carlo to its numerical solution. We derive exact and low-density limit gradient approximations and propose two new algorithms based on an adaptation of the predictor-corrector~framework. In total, we evaluate several optimization algorithms on biologically realistic ten-component test systems. We demonstrate that for homogeneous fluids, all methods achieve exceptional efficiency and accuracy. Critically, we further demonstrate successful parametrization in a challenging three-phase system. Here, three algorithms, namely Adam and gradient descent employing the low-density derivative as well as Newton's method with the exact gradient, reliably recover the correct parameters. These results establish a clear pathway from XL-MS experiments to coarse-grained protein models for systems where phase separation governs biological function, potentially enabling new investigations of biomolecular condensates and protein aggregation.2026-06-04T00:52:12Z19 pages, 10 Figure, 5 TablesBörries von SeggernMohsen Sadeghihttp://arxiv.org/abs/2606.05474v1AlloGen: Conformation-Selective Binder Generation with Differential State Scoring2026-06-03T21:53:17ZProtein binder design has largely optimized for affinity alone, leaving conformational selectivity unaddressed: for allosteric targets such as kinases, nuclear receptors, and GPCRs, a binder that engages both active and inactive states provides no functional specificity regardless of how tightly it binds. We introduce AlloGen, a modular framework that decouples backbone generation from a learned state-selectivity scorer $Q_θ$, an SE(3)-invariant interface graph transformer trained via a two-phase curriculum that first learns interface geometry before imposing conformational discrimination. Because $Q_θ$ is fully differentiable and generator-agnostic, it integrates with any backbone generator as a passive reranker or an active gradient-based guide without retraining. Across a diverse benchmark of proteins spanning multiple families and conformational mechanisms, AlloGen consistently identifies binders that preferentially recognize desired structural states while rejecting alternative conformations. Experimental validation on calmodulin further demonstrates that these computational selectivity signals translate to physical molecules, yielding de novo peptides that bind the desired holo conformation while exhibiting no detectable binding to the apo state. Together, these results establish conformational selectivity as a learnable property and provide a general framework for state-selective protein binder design.2026-06-03T21:53:17ZHanqun CaoZachary QuinnAastha PalSumi KimuraJingjie ZhangPheng Ann HengPranam Chatterjeehttp://arxiv.org/abs/2605.16331v2Retrieval and competition: how a protein foundation model starts a protein2026-06-03T13:43:31ZProtein language models are increasingly used to guide experimental and clinical decisions, yet it is often unclear whether a confident prediction reflects recognition of biological evidence or retrieval of a statistical default. We examine this distinction for a near-universal biological rule, that proteins begin with methionine, by tracing the computational pathway through which ESM2-8M produces this prediction. The model does not detect methionine at the masked position. Instead, it retrieves a methionine-favouring signal from a reference representation at the beginning-of-sequence token via a position-specific query assembled across layers, with the final output emerging through competition with context-dependent circuits. To understand how positional information reaches the readout, we introduce a norm-direction decomposition of attention scores within rotary frequency bands. Positional encoding operates through coupled changes in query norm and angular alignment distributed across these bands. On sequences whose true N-terminus is not methionine, where the biological question matters, the model predicts methionine anyway. This is not a correct prediction produced by an unexpected mechanism, but the output of a positional-prior retrieval circuit that matches the statistical average and fails where biology diverges from it. Distinguishing the two requires resolution at the level of individual circuits, frequency bands, and query composition, suggesting that mechanistic verification will be necessary, and challenging, for predictions where the biological stakes are higher. Even for the simplest biological rule, the model's prediction is mediated by a distributed computational circuit rather than direct recognition, suggesting that increasing task complexity will further obscure the relationship between model confidence and underlying biological evidence.2026-05-05T17:51:21Zupdated figure 4Piotr JedryszekOliver M. Crook