Functional bottlenecks can emerge from non-epistatic underlying traits

2026-02-17T21:29:38Z

Protein fitness landscapes frequently exhibit epistasis, where the effect of a mutation depends on the genetic context in which it occurs, i.e., the rest of the protein sequence. Epistasis increases landscape complexity, often resulting in multiple fitness peaks. In its simplest form, known as global epistasis, fitness is modeled as a non-linear function of an underlying additive trait. In contrast, more complex epistasis arises from a network of (pairwise or many-body) interactions between residues, which cannot be removed by a single non-linear transformation. Recent studies have explored how global and network epistasis contribute to the emergence of functional bottlenecks - fitness landscape topologies where two broad high-fitness basins, representing distinct phenotypes, are separated by a bottleneck that can only be crossed via one or a few mutational paths. Here, we introduce and analyze a stylized model of global epistasis with an additive underlying trait. We demonstrate that functional bottlenecks arise with high probability if the model is properly calibrated. Furthermore, our results underscore that a proper balance between neutral and non-neutral mutations is needed for the emergence of functional bottlenecks.

Exploring Drug Safety Through Knowledge Graphs: Protein Kinase Inhibitors as a Case Study

2026-02-17T12:30:33Z

Adverse Drug Reactions (ADRs) are a leading cause of morbidity and mortality. Existing prediction methods rely mainly on chemical similarity, machine learning on structured databases, or isolated target profiles, but often fail to integrate heterogeneous, partly unstructured evidence effectively. We present a knowledge graph-based framework that unifies diverse sources, drug-target data (ChEMBL), clinical trial literature (PubMed), trial metadata (ClinicalTrials.gov), and post-marketing safety reports (FAERS) into a single evidence-weighted bipartite network of drugs and medical conditions. Applied to 400 protein kinase inhibitors, the resulting network enables contextual comparison of efficacy (HR, PFS, OS), phenotypic and target similarity, and ADR prediction via target-to-adverse-event correlations. A non-small cell lung cancer case study correctly highlights established and candidate drugs, target communities (ERbB, ALK, VEGF), and tolerability differences. Designed as an orthogonal, extensible analysis and search tool rather than a replacement for current models, the framework excels at revealing complex patterns, supporting hypothesis generation, and enhancing pharmacovigilance. Code and data are publicly available at https://github.com/davidjackson99/PKI_KG.

Rethinking Diffusion Models with Symmetries through Canonicalization with Applications to Molecular Graph Generation

2026-02-16T18:58:55Z

Many generative tasks in chemistry and science involve distributions invariant to group symmetries (e.g., permutation and rotation). A common strategy enforces invariance and equivariance through architectural constraints such as equivariant denoisers and invariant priors. In this paper, we challenge this tradition through the alternative canonicalization perspective: first map each sample to an orbit representative with a canonical pose or order, train an unconstrained (non-equivariant) diffusion or flow model on the canonical slice, and finally recover the invariant distribution by sampling a random symmetry transform at generation time. Building on a formal quotient-space perspective, our work provides a comprehensive theory of canonical diffusion by proving: (i) the correctness, universality and superior expressivity of canonical generative models over invariant targets; (ii) canonicalization accelerates training by removing diffusion score complexity induced by group mixtures and reducing conditional variance in flow matching. We then show that aligned priors and optimal transport act complementarily with canonicalization and further improves training efficiency. We instantiate the framework for molecular graph generation under $S_n \times SE(3)$ symmetries. By leveraging geometric spectra-based canonicalization and mild positional encodings, canonical diffusion significantly outperforms equivariant baselines in 3D molecule generation tasks, with similar or even less computation. Moreover, with a novel architecture Canon, CanonFlow achieves state-of-the-art performance on the challenging GEOM-DRUG dataset, and the advantage remains large in few-step generation.

FGBench: A Dataset and Benchmark for Molecular Property Reasoning at Functional Group-Level in Large Language Models

2026-02-16T03:38:47Z

Large language models (LLMs) have gained significant attention in chemistry. However, most existing datasets center on molecular-level property prediction and overlook the role of fine-grained functional group (FG) information. Incorporating FG-level data can provide valuable prior knowledge that links molecular structures with textual descriptions, which can be used to build more interpretable, structure-aware LLMs for reasoning on molecule-related tasks. Moreover, LLMs can learn from such fine-grained information to uncover hidden relationships between specific functional groups and molecular properties, thereby advancing molecular design and drug discovery. Here, we introduce FGBench, a dataset comprising 625K molecular property reasoning problems with functional group information. Functional groups are precisely annotated and localized within the molecule, which ensures the dataset's interoperability thereby facilitating further multimodal applications. FGBench includes both regression and classification tasks on 245 different functional groups across three categories for molecular property reasoning: (1) single functional group impacts, (2) multiple functional group interactions, and (3) direct molecular comparisons. In the benchmark of state-of-the-art LLMs on 7K curated data, the results indicate that current LLMs struggle with FG-level property reasoning, highlighting the need to enhance reasoning capabilities in LLMs for chemistry tasks. We anticipate that the methodology employed in FGBench to construct datasets with functional group-level information will serve as a foundational framework for generating new question-answer pairs, enabling LLMs to better understand fine-grained molecular structure-property relationships. The dataset and evaluation code are available at https://github.com/xuanliugit/FGBench.

Conformational landscapes in cryo-ET data based on MD simulations

2026-02-15T22:51:11Z

Cryo-electron tomography (cryo-ET) provides a unique window into molecular organization in cellular environments (in situ). However, the interpretation of molecular structural information is complicated by several intrinsic properties of cryo-ET data, such as noise, missing wedge, and continuous conformational variability of the molecules. Additionally, in crowded in situ environments, the number of particles extracted is sometimes small and precludes extensive classification into discrete states. These challenges shift the emphasis from high-resolution structure determination toward validation and interpretation of low-resolution density maps, and analysis of conformational flexibility. Molecular Dynamics (MD) simulations are particularly well suited to this task, as they provide a physically grounded way to explore continuous conformation transitions consistent with both experimental data and molecular energetics. This review focuses on the roles of MD simulations in cryo-ET, emphasizing their use in emerging methods for conformational landscape determination and their contribution to gain new biological insight.

A Novel 4-D Dataset Paradigm for Studying Complete Ligand-Protein Dissociation Dynamics

2026-02-14T12:55:15Z

The kinetics and dynamics of drug-protein binding and dissociation are crucial to understanding drug absorption and metabolism. Despite advances in artificial intelligence (AI) tools for drug-protein interaction studies, existing training datasets remain limited to static structures or quasi-static conformations. This paper proposes a novel computational approach for rapidly generating drug-protein dissociation trajectories and presents the inaugural dynamically time-resolved 4-D (t, x, y, z) trajectory database DD-13M. This dataset captures over 26,000 complete dissociation processes for 565 ligand-protein complexes, providing nearly 13 million frames of all-atom simulation trajectories. A deep equivariant generative model, UnbindingFlow, was trained using the DD-13M dataset. This model has the capacity to produce dissociation trajectories for novel targets whilst accurately predicting their rate constants (koff). DD-13M introduces a new type of training dataset for AI models, establishing a de novo paradigm for studying the dynamics of drug-protein interactions.

Hermes: Large DEL Datasets Train Generalizable Protein-Ligand Binding Prediction Models

2026-02-13T22:27:52Z

The quality and consistency of training data remain critical bottlenecks for protein-ligand binding prediction. Public affinity datasets, aggregated from thousands of labs and assay formats, introduce biases that limit model generalization and complicate evaluation. DNA-encoded chemical libraries (DELs) offer a potential solution: unified experimental protocols generating massive binding datasets across diverse chemical and protein target space. We present Hermes, a lightweight transformer trained exclusively on DEL data from screens against hundreds of protein targets, representing one of the largest and most protein-diverse DEL training sets applied to protein-ligand interaction (PLI) modeling to date. Despite never seeing traditional affinity measurements during training, Hermes generalizes to held-out targets, novel chemical scaffolds, and external benchmarks derived from public binding data and high-throughput screens. Our results demonstrate that DEL data alone captures transferable protein-ligand interaction representations, while Hermes' minimal architecture enables inference speeds suitable for large-scale virtual screening.

Protect$^*$: Steerable Retrosynthesis through Neuro-Symbolic State Encoding

2026-02-13T19:41:55Z

Large Language Models (LLMs) have shown remarkable potential in scientific domains like retrosynthesis; yet, they often lack the fine-grained control necessary to navigate complex problem spaces without error. A critical challenge is directing an LLM to avoid specific, chemically sensitive sites on a molecule - a task where unconstrained generation can lead to invalid or undesirable synthetic pathways. In this work, we introduce Protect$^*$, a neuro-symbolic framework that grounds the generative capabilities of Large Language Models (LLMs) in rigorous chemical logic. Our approach combines automated rule-based reasoning - using a comprehensive database of 55+ SMARTS patterns and 40+ characterized protecting groups - with the generative intuition of neural models. The system operates via a hybrid architecture: an ``automatic mode'' where symbolic logic deterministically identifies and guards reactive sites, and a ``human-in-the-loop mode'' that integrates expert strategic constraints. Through ``active state tracking,'' we inject hard symbolic constraints into the neural inference process via a dedicated protection state linked to canonical atom maps. We demonstrate this neuro-symbolic approach through case studies on complex natural products, including the discovery of a novel synthetic pathway for Erythromycin B, showing that grounding neural generation in symbolic logic enables reliable, expert-level autonomy.

Structural barriers of the discrete Hasimoto map applied to protein backbone geometry

2026-02-13T18:16:56Z

Determining the three-dimensional structure of a protein from its amino-acid sequence remains a fundamental problem in biophysics. The discrete Frenet geometry of the C$_α$ backbone can be mapped, via a Hasimoto-type transform, onto a complex scalar field $ψ=κ\,e^{i\sumτ}$ satisfying a discrete nonlinear Schrödinger equation (DNLS), whose soliton solutions reproduce observed secondary-structure motifs. Whether this mapping, which provides an elegant geometric description of folded states, can be extended to a predictive framework for protein folding remains an open question. We derive an exact closed-form decomposition of the DNLS effective potential $V_{\text{eff}}=V_{\text{re}}+iV_{\text{im}}$ in terms of curvature ratios and torsion angles, validating the result to machine precision across 856 non-redundant proteins. Our analysis identifies three structural barriers to forward prediction: (i)~$V_{\text{im}}$ encodes chirality via the odd symmetry of $\sinτ$, accounting for ${\sim}31\%$ of the total information and implying a $2^N$ degeneracy if neglected; (ii)~$V_{\text{re}}$ is determined primarily (${\sim}95\%$) by local geometry, rendering it effectively sequence-agnostic; and (iii)~self-consistent field iterations fail to recover native structures (mean RMSD $= 13.1$\,Å) even with hydrogen-bond terms, yielding torsion correlations indistinguishable from zero. Constructively, we demonstrate that the residual of the DNLS dispersion relation serves as a geometric order parameter for $α$-helices (ROC AUC $= 0.72$), defining them as regions of maximal integrability. These findings establish that the Hasimoto map functions as a kinematic identity rather than a dynamical governing equation, presenting fundamental obstacles to its use as a predictive framework for protein folding.

Cross-Chirality Generalization by Axial Vectors for Hetero-Chiral Protein-Peptide Interaction Design

2026-02-13T02:46:29Z

D-peptide binders targeting L-proteins have promising therapeutic potential. Despite rapid advances in machine learning-based target-conditioned peptide design, generating D-peptide binders remains largely unexplored. In this work, we show that by injecting axial features to $E(3)$-equivariant (polar) vector features,it is feasible to achieve cross-chirality generalization from homo-chiral (L--L) training data to hetero-chiral (D--L) design tasks. By implementing this method within a latent diffusion model, we achieved D-peptide binder design that not only outperforms existing tools in in silico benchmarks, but also demonstrates efficacy in wet-lab validation. To our knowledge, our approach represents the first wet-lab validated generative AI for the de novo design of D-peptide binders, offering new perspectives on handling chirality in protein design.

Distinguishing life from non-life via molecular frontier orbital energy gaps

2026-02-11T23:48:37Z

Amino acids (AAs) are a key target in the search for life beyond Earth due to their extensive role in the machinery of all known life, persistence over geologic timescales, and analytical detectability. However, AAs can also arise from abiotic processes on planets and in space. For example, material from asteroid Bennu contained 33 AAs, including 15 of the 20 proteinogenic AAs that are fundamental to life's functions. Distinguishing life from non-life based on AAs in a sample remains an unsolved problem, particularly when their isotopic and structural signatures (e.g., chirality) could be altered via physicochemical processes. Here we introduce LUMOS (Life Unveiled via Molecular Orbital Signatures), a statistical framework that distinguishes life from non-life by analyzing the distribution of abundance-weighted HOMO-LUMO gap (HLG) values of AAs within a sample. Compilation of AAs datasets from diverse environments and provenances revealed that abiotic samples display highly uniform distributions of AAs HLGs. In contrast, biotic samples show greater variance and preference towards AAs with lower HLG, likely reflecting the need for life to control when, where, and how chemical reactions occur. LUMOS achieves >95% accuracy in distinguishing biotic versus abiotic provenance across diverse environmental and extraterrestrial conditions. These results suggest that varied molecular reactivity within biochemical systems may be a universal feature of life, representing an agnostic biosignature unlinked to the specific set of AAs used by life as we know it. LUMOS is compatible with existing analytical instrumentation, applicable to returned samples or in situ analyses. Broader characterization of abiotic and biotic environments will further refine the chemical boundaries separating biotic from abiotic chemical systems.

Out-of-equilibrium selection pressure enhances inference from protein sequence data

2026-02-11T23:00:31Z

Homologous proteins have similar three-dimensional structures and biological functions that shape their sequences. The resulting coevolution-driven correlations underlie methods from Potts models to AlphaFold, which infer protein structure and function from sequences. Using a minimal model, we show that fluctuating selection strength and the onset of new selection pressures improve coevolution-based inference of structural contacts. Our conclusions extend to realistic synthetic data and to the inference of interaction partners. Out-of-equilibrium noise arising from ubiquitous variations in natural selection thus enhances, rather than hinders, the success of inference from protein sequences.

Scalable Spatio-Temporal SE(3) Diffusion for Long-Horizon Protein Dynamics

2026-02-11T16:42:29Z

Molecular dynamics (MD) simulations remain the gold standard for studying protein dynamics, but their computational cost limits access to biologically relevant timescales. Recent generative models have shown promise in accelerating simulations, yet they struggle with long-horizon generation due to architectural constraints, error accumulation, and inadequate modeling of spatio-temporal dynamics. We present STAR-MD (Spatio-Temporal Autoregressive Rollout for Molecular Dynamics), a scalable SE(3)-equivariant diffusion model that generates physically plausible protein trajectories over microsecond timescales. Our key innovation is a causal diffusion transformer with joint spatio-temporal attention that efficiently captures complex space-time dependencies while avoiding the memory bottlenecks of existing methods. On the standard ATLAS benchmark, STAR-MD achieves state-of-the-art performance across all metrics--substantially improving conformational coverage, structural validity, and dynamic fidelity compared to previous methods. STAR-MD successfully extrapolates to generate stable microsecond-scale trajectories where baseline methods fail catastrophically, maintaining high structural quality throughout the extended rollout. Our comprehensive evaluation reveals severe limitations in current models for long-horizon generation, while demonstrating that STAR-MD's joint spatio-temporal modeling enables robust dynamics simulation at biologically relevant timescales, paving the way for accelerated exploration of protein function.

Towards Universal Spatial Transcriptomics Super-Resolution: A Generalist Physically Consistent Flow Matching Framework

2026-02-11T08:44:40Z

Spatial transcriptomics provides an unprecedented perspective for deciphering tissue spatial heterogeneity. However, high-resolution spatial transcriptomic technology remains constrained by limited gene coverage, technical complexity, and high cost. Existing spatial transcriptomics super-resolution methods from low resolution data suffer from two fundamental limitations: poor out-of-distribution generalization stemming from a neglect of inherent biological heterogeneity, and a lack of physical consistency. To address these challenges, we propose SRast, a novel physically constrained generalist framework designed for robust spatial transcriptomics super-resolution. To tackle heterogeneity, SRast employs a strategic decoupling architecture that explicitly decouples gene semantics representation from spatial geometry deconvolution, utilizing self-supervised learning to align latent distributions and mitigate cross-sample shifts. Regarding physical priors, SRast reformulates the task as ratio prediction on the simplex, performing a flow matching model to learn optimal transport-based geometric transformations that strictly enforce local mass conservation. Extensive experiments across diverse species, tissues, and platforms demonstrate that SRast achieves state-of-the-art performance, exhibiting superior zero-shot generalization capabilities and ensuring physical consistency in recovering fine-grained biological structures.

Drug Release Modeling using Physics-Informed Neural Networks

2026-02-10T16:51:50Z

Accurate modeling of drug release is essential for designing and developing controlled-release systems. Classical models (Fick, Higuchi, Peppas) rely on simplifying assumptions that limit their accuracy in complex geometries and release mechanisms. Here, we propose a novel approach using Physics-Informed Neural Networks (PINNs) and Bayesian PINNs (BPINNs) for predicting release from planar, 1D-wrinkled, and 2D-crumpled films. This approach uniquely integrates Fick's diffusion law with limited experimental data to enable accurate long-term predictions from short-term measurements, and is systematically benchmarked against classical drug release models. We embedded Fick's second law into PINN as loss with 10,000 Latin-hypercube collocation points and utilized previously published experimental datasets to assess drug release performance through mean absolute error (MAE) and root mean square error (RMSE), considering noisy conditions and limited-data scenarios. Our approach reduced mean error by up to 40% relative to classical baselines across all film types. The PINN formulation achieved RMSE <0.05 utilizing only the first 6% of the release time data (reducing 94% of release time required for the experiments) for the planar film. For wrinkled and crumpled films, the PINN reached RMSE <0.05 in 33% of the release time data. BPINNs provide tighter and more reliable uncertainty quantification under noise. By combining physical laws with experimental data, the proposed framework yields highly accurate long-term release predictions from short-term measurements, offering a practical route for accelerated characterization and more efficient early-stage drug release system formulation.