https://arxiv.org/api/2/9hjabclSdi8dLWTsk+mF/a6bU2026-03-16T09:46:42Z66353015http://arxiv.org/abs/2602.23179v1Induction Meets Biology: Mechanisms of Repeat Detection in Protein Language Models2026-02-26T16:39:04ZProtein sequences are abundant in repeating segments, both as exact copies and as approximate segments with mutations. These repeats are important for protein structure and function, motivating decades of algorithmic work on repeat identification. Recent work has shown that protein language models (PLMs) identify repeats, by examining their behavior in masked-token prediction. To elucidate their internal mechanisms, we investigate how PLMs detect both exact and approximate repeats. We find that the mechanism for approximate repeats functionally subsumes that of exact repeats. We then characterize this mechanism, revealing two main stages: PLMs first build feature representations using both general positional attention heads and biologically specialized components, such as neurons that encode amino-acid similarity. Then, induction heads attend to aligned tokens across repeated segments, promoting the correct answer. Our results reveal how PLMs solve this biological task by combining language-based pattern matching with specialized biological knowledge, thereby establishing a basis for studying more complex evolutionary processes in PLMs.2026-02-26T16:39:04ZGal Kesten-PomeranzYaniv NikankinAnja ReuschTomer TsabanOra Schueler-FurmanYonatan Belinkovhttp://arxiv.org/abs/2511.21476v2Steering Generative Models for Protein Design: Aligning and Conditioning Strategies2026-02-26T14:31:20ZGenerative artificial intelligence models learn probability distributions from data and produce novel samples that capture the salient properties of their training sets. Proteins are particularly attractive for such approaches given their abundant data and the versatility of their representations, ranging from sequences to structures and functions. This versatility has motivated the rapid development of generative models for protein design, enabling the generation of functional proteins and enzymes with unprecedented success. However, because these models mirror their training distribution, they tend to sample from its most probable modes, while low-probability regions, often encoding valuable properties, remain underexplored. To address this challenge, recent work has proposed strategies for steering generative models toward user-specified properties. In this review, we survey and categorize these strategies, distinguishing approaches that modify model parameters, such as reinforcement learning or supervised fine-tuning, from those that keep the model's parameters fixed, including conditional generation, retrieval-augmented strategies, Bayesian guidance, and tailored sampling methods. Together, these developments are beginning to enable the steering of generative models toward proteins with desired properties.2025-11-26T15:09:13ZFilippo StoccoMichele GaribboNoelia Ferruzhttp://arxiv.org/abs/2602.21931v1From quantitative modeling of fluorescence experiments on biomolecules to the prediction of spectroscopic dye properties2026-02-25T14:11:56ZFluorescence spectroscopy and modeling provide powerful means to characterize biomacromolecular structures, dynamics, and interactions. Förster resonance energy transfer serves as a key technique for this due to its nanometer-scale distance sensitivity. Quantitative interpretation of fluorescence data relies on models that link molecular structure to observable spectroscopic quantities and vice versa. Integrative modelling frameworks combine fluorescence observables with complementary structural information to infer molecular structures and conformational ensembles. This review outlines conceptual components of fluorescence-based modeling, discusses dye representations, and highlights advances toward refined models enabling quantitative structural analysis. Finally, we discuss the prediction of spectroscopic properties of dyes based on biomolecular structures and fluorescence assay design beyond traditional FRET applications.2026-02-25T14:11:56Z14 pages, 4 figures: FRET, integrative structural biology, structural modelling, dye models, fluorescence spectroscopy, PIFE, quenchingThomas-Otavio PeulenDaria MaksutovaThorben Cordeshttp://arxiv.org/abs/2602.21787v1Spectral entropy of the discrete Hasimoto effective potential exposes sub-residue geometric transitions in protein secondary structure2026-02-25T11:08:58ZCharacterizing the geometric boundaries of protein secondary structures is fundamental to understanding macromolecular folding. By applying the discrete Hasimoto map to translate backbone geometry into a one-dimensional discrete nonlinear Schrödinger potential $V_{\mathrm{re}}[n]$, we establish a frequency-domain framework for protein conformations. Short-time Fourier transform analysis across 320,453 residues from 1,986 non-redundant proteins defines a local spectral entropy $H_{\mathrm{spec}}$ that consistently orders structural states. Helical segments emerge as narrow-band low-entropy regimes dominated by zero-frequency components, whereas coils manifest as broadband noise. We demonstrate that boundaries separating these states exhibit step-like sharpness characteristic of a first-order-like geometric transition with a sub-residue median width of 0.145 residues. This abrupt kinematic transition provides a spatial counterpart to the cooperative Zimm--Bragg thermodynamic model of helix nucleation. The extreme spatial narrowness exposes an intrinsic limitation governed by the Gabor uncertainty principle, explaining why the pointwise integrability residual $E[n]$ acts as an effective high-pass filter for boundary detection. Guided by this limit we introduce a dual-probe approach combining the high-pass residual for local torsion discontinuities with a low-frequency energy ratio $R_{\mathrm{LF}}$ measuring the DC-dominated flatness of helical interiors. Unifying these complementary signals improves the detection area under the curve from 0.783 to 0.815. Because high-entropy broadband regions coincide with the flexible loops and hinges implicated in allostery, the spectral entropy of the Hasimoto potential may serve as a sequence-agnostic geometric proxy for mapping functional dynamics from backbone coordinates.2026-02-25T11:08:58ZYiquan Wanghttp://arxiv.org/abs/2509.02060v3Morphology-Aware Peptide Discovery via Masked Conditional Generative Modeling2026-02-24T10:17:07ZPeptide self-assembly prediction offers a powerful bottom-up strategy for designing biocompatible, low-toxicity materials for large-scale synthesis in a broad range of biomedical and energy applications. However, screening the vast sequence space for categorization of aggregate morphology remains intractable. We introduce PepMorph, an end-to-end peptide discovery pipeline that generates novel sequences that are not only prone to aggregate but whose self-assembly is steered toward fibrillar or spherical morphologies by conditioning on isolated peptide descriptors that serve as morphology proxies. To this end, we compiled a new dataset by leveraging existing aggregation propensity datasets and extracting geometric and physicochemical descriptors. This dataset is then used to train a Transformer-based Conditional Variational Autoencoder with a masking mechanism, which generates novel peptides under arbitrary conditioning. After filtering to ensure design specifications and validation of generated sequences through coarse-grained molecular dynamics (CG-MD) simulations, PepMorph yielded 83% success rate under our CG-MD validation protocol and morphology criterion for the targeted class, showcasing its promise as a framework for application-driven peptide discovery.2025-09-02T07:58:12Z46 pages, 4 figures, 6 tablesNuno CostaJulija Zavadlavhttp://arxiv.org/abs/2512.24192v2SeedProteo: Accurate De Novo All-Atom Design of Protein Binders2026-02-24T06:31:20ZWe present SeedProteo, a diffusion-based model for de novo all-atom protein design. We demonstrate how to repurpose a cutting-edge folding architecture into a powerful generative design framework by effectively integrating self-conditioning features. Extensive benchmarks highlight the model's capabilities across two distinct tasks: in unconditional generation, SeedProteo exhibits superior length generalization and structural diversity, maintaining robustness for long sequences and complex topologies; in binder design, it achieves state-of-the-art performance among open-source methods, attaining the highest in-silico design success rates, structural diversity and novelty. Finally, we validate SeedProteo through wet-lab assays on two therapeutic targets, achieving hit rates of 70%-80% and picomolar-level binding affinities, establishing leading results. To facilitate community adoption, we provide public access to SeedProteo via a webserver (https://seedfold.io/proteinDesign).2025-12-30T12:50:38ZWei QuYiming MaFei YeChan LuYi ZhouKexin ZhangLan WangMinrui GuiQuanquan Guhttp://arxiv.org/abs/2602.20449v1Protein Language Models Diverge from Natural Language: Comparative Analysis and Improved Inference2026-02-24T01:18:30ZModern Protein Language Models (PLMs) apply transformer-based model architectures from natural language processing to biological sequences, predicting a variety of protein functions and properties. However, protein language has key differences from natural language, such as a rich functional space despite a vocabulary of only 20 amino acids. These differences motivate research into how transformer-based architectures operate differently in the protein domain and how we can better leverage PLMs to solve protein-related tasks. In this work, we begin by directly comparing how the distribution of information stored across layers of attention heads differs between the protein and natural language domain. Furthermore, we adapt a simple early-exit technique-originally used in the natural language domain to improve efficiency at the cost of performance-to achieve both increased accuracy and substantial efficiency gains in protein non-structural property prediction by allowing the model to automatically select protein representations from the intermediate layers of the PLMs for the specific task and protein at hand. We achieve performance gains ranging from 0.4 to 7.01 percentage points while simultaneously improving efficiency by over 10 percent across models and non-structural prediction tasks. Our work opens up an area of research directly comparing how language models change behavior when moved into the protein domain and advances language modeling in biological domains.2026-02-24T01:18:30ZAnna HartChi HanJeonghwan KimHuimin ZhaoHeng Jihttp://arxiv.org/abs/2509.15796v2Monte Carlo Tree Diffusion with Multiple Experts for Protein Design2026-02-23T19:33:01ZThe goal of protein design is to generate amino acid sequences that fold into functional structures with desired properties. Prior methods combining autoregressive language models with Monte Carlo Tree Search (MCTS) struggle with long-range dependencies and suffer from an impractically large search space. We propose MCTD-ME, Monte Carlo Tree Diffusion with Multiple Experts, which integrates masked diffusion models with tree search to enable multi-token planning and efficient exploration under the guidance of multiple experts. Unlike autoregressive planners, MCTD-ME uses biophysical-fidelity-enhanced diffusion denoising as the rollout engine, jointly revising multiple positions and scaling to large sequence spaces. It further leverages experts of varying capacities to enrich exploration, guided by a pLDDT-based masking schedule that targets low-confidence regions while preserving reliable residues. We propose a novel multi-expert selection rule ( PH-UCT-ME) extends Shannon-entropy-based UCT to expert ensembles with mutual information. MCTD-ME achieves superior performance on the CAMEO and PDB benchmarks, excelling in protein design tasks such as inverse folding, folding, and conditional design challenges like motif scaffolding on lead optimization tasks. Our framework is model-agnostic, plug-and-play, and extensible to denovo protein engineering and beyond.2025-09-19T09:24:42ZXuefeng LiuMingxuan CaoSonghao JiangXiao LuoXiaotian DuanMengdi WangTobin R. SosnickJinbo XuRick Stevenshttp://arxiv.org/abs/2602.16255v2Piecewise integrability of the discrete Hasimoto map for analytic prediction and design of helical peptides2026-02-23T07:01:59ZThe representation of protein backbone geometry through the discrete nonlinear Schrödinger equation provides a theoretical connection between biological structure and integrable systems. Although the global application of this framework is constrained by chiral degeneracies and non-local interactions, helical peptides can be modeled as piecewise integrable systems where the discrete Hasimoto map remains applicable within specific geometric boundaries. We delineate these boundaries through an analytic mapping $(φ,ψ) \rightarrow (κ,τ)$ between biochemical dihedral angles and Frenet frame parameters for 50 helical peptide chains. This transformation is globally information-preserving but ill-conditioned within the helical basin (median Jacobian condition number 31), suggesting chiral information loss arises primarily from local coordinate compression rather than topological singularities. Using a local integrability error $E[n]$ derived from the discrete dispersion relation, we show deviations from integrability are driven predominantly by torsion non-uniformity, while curvature remains rigid. This metric identifies integrable islands where the analytic dispersion relation predicts backbone coordinates with sub-angstrom accuracy (median RMSD 0.77\,Å), enabling a segmentation strategy that isolates structural defects and trims non-integrable terminal fraying. Evaluating only these integrable islands, the dispersion relation extracts high-accuracy structural cores for 88\% of the dataset. Inverse backbone design is feasible within a defined integrability zone where the design constraint reduces essentially to controlling torsion uniformity. These findings advance the Hasimoto formalism from a qualitative descriptor toward a precise quantitative framework for analyzing and designing local protein geometry within the limits of piecewise integrability.2026-02-18T08:11:23ZYiquan Wanghttp://arxiv.org/abs/2602.14005v2Physical principles of building protein megacomplexes in a crowded milieu2026-02-22T21:15:55ZMultiple phenotypic protein expressions arising from one genome represent variations in the protein relative abundance and their stoichiometry. A lack of definite compositional parts challenges the modeling of protein megacomplexes and cellular architectures. Despite the advances in protein structural predictions with AI, the mechanism of protein interactions and the emergence of megacomplexes they assemble remains unclear. Here, we present a statistical physics framework of grand canonical ensemble to explore the protein interactions that drive the emergent assembly of a megacomplex using the observational mass spectrometry datasets including protein relative abundance and the cross linked connections. Using chromatin remodeler megacomplex, INO80, as an example, we discovered a class of divergent protein that plays a critical role in orchestrating the assembly beyond nearest neighbors, dependent on the excluded volumes exerted by others. With the constraints of the excluded volumes by varying crowding contents, these divergent subunits orchestrate and form clusters with selective components growing into configurationally distinct architectures. We propose a machinery view for the INO80 chromatin remodeler complex where each loosely associated subunits can be occasionally recruited for parts as attachment into a core assembly driven by excluded volumes. Our computational framework provides a mechanistic insight into taking the macromolecular crowding as necessary physicochemical variables representing cell states to remodel the configurations of protein megacomplexes with structurally loose modules.2026-02-15T06:10:59ZJiayi WangJules NdeAndrei G. GasicJacob HaseleyMargaret S. Cheunghttp://arxiv.org/abs/2602.18203v1Metrology of Complexity and Implications for the Study of the Emergence of Life2026-02-20T13:30:00ZOne of the longest standing open problems in science is how life arises from non-living matter. If it is possible to measure this transition in the lab, then it might be possible to understand the physical mechanisms by which the emergence of life occurs, which so far have evaded scientific understanding. A significant hurdle is the lack of standards or a framework for cross comparison across different experimental contexts and planetary environments. In this essay, I review current challenges in experimental approaches to origin of life chemistry, focusing on those associated with quantifying experimental selectivity versus de novo generation of molecular complexity, and I highlight new methods using molecular assembly theory to measure molecular complexity. This metrology-centered approach can enable rigorous testing of hypotheses about the cascade of major transitions in molecular order marking the emergence of life, while potentially bridging traditional divides between metabolism-first and genetics-first scenarios. Grounding the study of life's origins in measurable complexity has significant implications for the search for life beyond Earth, suggesting paths toward theory-driven detection of biological complexity in diverse planetary contexts. As the field moves forward, standardized measurements of molecular complexity may help unify currently disparate approaches to understanding how matter transforms to life. Much remains to be done in this exciting frontier.2026-02-20T13:30:00ZSara Imari Walkerhttp://arxiv.org/abs/2502.16189v2Co-Evolution-Based Metal-Binding Residue Prediction with Graph Neural Networks2026-02-20T06:37:32ZUnderstanding protein-metal interactions is central to structural biology, with metal ions being vital for catalysis, stability, and signal transduction. Predicting metal-binding residues and metal types remains challenging due to the structural and evolutionary complexity of proteins. Conventional sequence- and structure-based methods often fail to capture co-evolutionary constraints that reflect how residues evolve together to maintain metal-binding functionality. Recent co-evolution-based methods capture part of this information, but still underutilize the complete co-evolved residue network. To address this limitation, we introduce the Metal-Binding Graph Neural Network (MBGNN), which leverages the complete co-evolved residue network to better capture complex dependencies within protein structures. Experimental results show that MBGNN substantially outperforms the state-of-the-art co-evolution-based method MetalNet2, achieving F1 score improvements of 2.5% for binding residue identification and 3.3% for metal type classification on the MetalNet2 dataset. Its superiority is further demonstrated on both the MetalNet2 and MIonSite datasets, where it outperforms two co-evolution-based and two sequence-based methods, achieving the highest mean F1 scores across both prediction tasks. These findings highlight how integrating co-evolutionary residue networks with graph-based learning advances our ability to decode protein-metal interactions, thereby facilitating functional annotation and rational metalloprotein design. The code and data are released at https://github.com/SRastegari/MBGNN.2025-02-22T11:22:08Z10 pages, 6 figuresSayedmohammadreza RastegariSina TabakhiXianyuan LiuTianyi JiangWei SangHaiping Luhttp://arxiv.org/abs/2603.03342v1Cryo-SWAN: the Multi-Scale Wavelet-decomposition-inspired Autoencoder Network for molecular density representation of molecular volumes2026-02-18T18:05:08ZLearning robust representations of 3D shapes from voxelized data is essential for advancing AI methods in biomedical imaging. However, most contemporary 3D computer vision approaches operate on point clouds, meshes, or octrees, while volumetric density maps, the native format of structural biology and cryo-EM, remain comparatively underexplored. We present Cryo-SWAN, a voxel-based variational autoencoder inspired by multi-scale wavelet decomposition. The model performs conditional coarse-to-fine latent encoding and recursive residual quantization across perception scales, enabling accurate capture of both global geometry and high-frequency structural detail in molecular density volumes. Evaluated on ModelNet40, BuildingNet, and a newly curated dataset of cryo-EM volumes, ProteinNet3D, Cryo-SWAN consistently improves reconstruction quality over state-of-the-art 3D autoencoders. We demonstrate that the molecular densities organize in learned latent space according to shared geometric features, while integration with diffusion models enables denoising and conditional shape generation. Together, Cryo-SWAN is a practical framework for data-driven structural biology and volumetric imaging.2026-02-18T18:05:08Z16 pages, 5 figuresRui LiArtsemi YushkevichMikhail KudryashevArtur Yakimovichhttp://arxiv.org/abs/2602.00663v2SEISMO: Increasing Sample Efficiency in Molecular Optimization with a Trajectory-Aware LLM Agent2026-02-18T10:50:04ZOptimizing the structure of molecules to achieve desired properties is a central bottleneck across the chemical sciences, particularly in the pharmaceutical industry where it underlies the discovery of new drugs. Since molecular property evaluation often relies on costly and rate-limited oracles, such as experimental assays, molecular optimization must be highly sample-efficient. To address this, we introduce SEISMO, an LLM agent that performs strictly online, inference-time molecular optimization, updating after every oracle call without the need for population-based or batched learning. SEISMO conditions each proposal on the full optimization trajectory, combining natural-language task descriptions with scalar scores and, when available, structured explanatory feedback. Across the Practical Molecular Optimization benchmark of 23 tasks, SEISMO achieves a 2-3 times higher area under the optimisation curve than prior methods, often reaching near-maximal task scores within 50 oracle calls. Our additional medicinal-chemistry tasks show that providing explanatory feedback further improves efficiency, demonstrating that leveraging domain knowledge and structured information is key to sample-efficient molecular optimization.2026-01-31T11:23:48ZFabian P. Krüger and Andrea Hunklinger contributed equally to this workFabian P. KrügerAndrea HunklingerAdrian WolnyTim J. AdlerIgor TetkoSantiago David Villalbahttp://arxiv.org/abs/2410.14355v2Pore-level Quantitative Structure-Activity Relationship (QSAR) for Water Permeation Rate in Aquaporins2026-02-17T22:09:13ZAquaporins (AQPs) and aquaglyceroporins (AQGPs) play a crucial role in regulating water transport and solute selectivity across biological membranes. Besides their biological relevance, AQPs have at-tracted growing interest as models for the design of next-generation biomimetic membranes for water filtration. In this work, we present a pore-level Quantitative Structure-Activity Relationship (QSAR) approach that relates structural and physicochemical pore descriptors with experimentally reported water permeation rates across a set of AQ(G)Ps with high-resolution 3D structures. This data-driven methodology, presented here as a proof of concept, introduces a multi-feature framework for determining pore descriptors associated with water transport efficiency in AQ(G)Ps. Applied to two compiled permeation rate datasets, this framework recapitulates determinants previously reported in single-feature studies, while also highlighting additional pore descriptors that emerge as relevant in a multi-variable context. The insights gained through this approach may, in perspective, contribute to advancing the rational design of AQP-based filtration devices and to deepening the molecular understanding of the function of these valuable macromolecules in health and disease.2024-10-18T10:21:25ZJuan José Galano-FrutosLuca BergamascoPaolo VigoMatteo MorcianoMatteo FasanoDavide PirolliEliodoro ChiavazzoMaria Cristina de Rosa