https://arxiv.org/api/4H8PqHCH1CXwTTEqPMKl4qYK/aI2026-03-22T08:56:29Z664213515http://arxiv.org/abs/2601.00505v1Effect of Electric Charge on Biotherapeutic Transport, Binding and Absorption: A Computational Study2026-01-01T23:11:29ZThis study explores the effects of electric charge on the dynamics of drug transport and absorption in subcutaneous injections of monoclonal antibodies (mAbs). We develop a novel mathematical and computational model, based on the Nernst-Planck equations and porous media flow theory, to investigate the complex interactions between mAbs and charged species in subcutaneous tissue. The model enables us to study short-term transport dynamics and long-term binding and absorption for two mAbs with different electric properties. We examine the influence of buffer pH, body mass index, injection depth, and formulation concentration on drug distribution and compare our numerical results with experimental data from the literature.2026-01-01T23:11:29Z27 pages, 13 figuresMario de LucioPavlos P. VlachosHector Gomezhttp://arxiv.org/abs/2512.24643v2Diagnosing Heteroskedasticity and Resolving Multicollinearity Paradoxes in Physicochemical Property Prediction2026-01-01T10:32:53ZLipophilicity (logP) prediction remains central to drug discovery, yet linear regression models for this task frequently violate statistical assumptions in ways that invalidate their reported performance metrics. We analyzed 426,850 bioactive molecules from a rigorously curated intersection of PubChem, ChEMBL, and eMolecules databases, revealing severe heteroskedasticity in linear models predicting computed logP values (XLOGP3): residual variance increases 4.2-fold in lipophilic regions (logP greater than 5) compared to balanced regions (logP 2 to 4). Classical remediation strategies (Weighted Least Squares and Box-Cox transformation) failed to resolve this violation (Breusch-Pagan p-value less than 0.0001 for all variants). Tree-based ensemble methods (Random Forest R-squared of 0.764, XGBoost R-squared of 0.765) proved inherently robust to heteroskedasticity while delivering superior predictive performance. SHAP analysis resolved a critical multicollinearity paradox: despite a weak bivariate correlation of 0.146, molecular weight emerged as the single most important predictor (mean absolute SHAP value of 0.573), with its effect suppressed in simple correlations by confounding with topological polar surface area (TPSA). These findings demonstrate that standard linear models face fundamental challenges for computed lipophilicity prediction and provide a principled framework for interpreting ensemble models in QSAR applications.2025-12-31T05:32:13Z7 pages, 4 figures, 3 tables, to be published in KST 2026, unabridged version exists as arXiv:2512.24643v1Proc. 2026 Int. Conf. Knowl. Smart Technol. (KST), 2026, pp. 645-651 MalikussaidSeptian Caesar FloreskoAde RomadhonyIsman KurniawanWarih MaharaniHilal Hudan Nuha10.1109/KST67832.2026.11431952http://arxiv.org/abs/2512.24354v1SeedFold: Scaling Biomolecular Structure Prediction2025-12-30T17:05:01ZHighly accurate biomolecular structure prediction is a key component of developing biomolecular foundation models, and one of the most critical aspects of building foundation models is identifying the recipes for scaling the model. In this work, we present SeedFold, a folding model that successfully scales up the model capacity. Our contributions are threefold: first, we identify an effective width-scaling strategy for the Pairformer to increase representation capacity; second, we introduce a novel linear triangular attention that reduces computational complexity to enable efficient scaling; finally, we construct a large-scale distillation dataset to substantially enlarge the training set. Experiments on FoldBench show that SeedFold outperforms AlphaFold3 on most protein-related tasks.2025-12-30T17:05:01ZYi ZhouChan LuYiming MaWei QuFei YeKexin ZhangLan WangMinrui GuiQuanquan Guhttp://arxiv.org/abs/2601.00863v1Selective Imperfection as a Generative Framework for Analysis, Creativity and Discovery2025-12-30T11:14:51ZWe introduce materiomusic as a generative framework linking the hierarchical structures of matter with the compositional logic of music. Across proteins, spider webs and flame dynamics, vibrational and architectural principles recur as tonal hierarchies, harmonic progressions, and long-range musical form. Using reversible mappings, from molecular spectra to musical tones and from three-dimensional networks to playable instruments, we show how sound functions as a scientific probe, an epistemic inversion where listening becomes a mode of seeing and musical composition becomes a blueprint for matter. These mappings excavate deep time: patterns originating in femtosecond molecular vibrations or billion-year evolutionary histories become audible. We posit that novelty in science and art emerges when constraints cannot be satisfied within existing degrees of freedom, forcing expansion of the space of viable configurations. Selective imperfection provides the mechanism restoring balance between coherence and adaptability. Quantitative support comes from exhaustive enumeration of all 2^12 musical scales, revealing that culturally significant systems cluster in a mid-entropy, mid-defect corridor, directly paralleling the Hall-Petch optimum where intermediate defect densities maximize material strength. Iterating these mappings creates productive collisions between human creativity and physics, generating new information as musical structures encounter evolutionary constraints. We show how swarm-based AI models compose music exhibiting human-like structural signatures such as small-world connectivity, modular integration, long-range coherence, suggesting a route beyond interpolation toward invention. We show that science and art are generative acts of world-building under constraint, with vibration as a shared grammar organizing structure across scales.2025-12-30T11:14:51ZMarkus J. Buehlerhttp://arxiv.org/abs/2512.23784v1Sheaf-theoretic representation of the proteolipid code2025-12-29T16:25:09ZMembrane particles such as proteins and lipids organize into zones that perform unique functions. Here, I introduce a topological and category-theoretic framework to represent particle and zone intra-scale interactions and inter-scale coupling. This involves carefully demarcating between different presheaf- or sheaf-assigned data levels to preserve functorial structure and account for particle and zone generalized poses. The framework can accommodate Hamiltonian mechanics, enabling dynamical modeling. This amounts to a versatile mathematical formalism for membrane structure and multiscale coupling.2025-12-29T16:25:09Z16 pages, 3 figuresTroy A. Kervinhttp://arxiv.org/abs/2512.23175v1HELM-BERT: A Transformer for Medium-sized Peptide Property Prediction2025-12-29T03:29:54ZTherapeutic peptides have emerged as a pivotal modality in modern drug discovery, occupying a chemically and topologically rich space. While accurate prediction of their physicochemical properties is essential for accelerating peptide development, existing molecular language models rely on representations that fail to capture this complexity. Atom-level SMILES notation generates long token sequences and obscures cyclic topology, whereas amino-acid-level representations cannot encode the diverse chemical modifications central to modern peptide design. To bridge this representational gap, the Hierarchical Editing Language for Macromolecules (HELM) offers a unified framework enabling precise description of both monomer composition and connectivity, making it a promising foundation for peptide language modeling. Here, we propose HELM-BERT, the first encoder-based peptide language model trained on HELM notation. Based on DeBERTa, HELM-BERT is specifically designed to capture hierarchical dependencies within HELM sequences. The model is pre-trained on a curated corpus of 39,079 chemically diverse peptides spanning linear and cyclic structures. HELM-BERT significantly outperforms state-of-the-art SMILES-based language models in downstream tasks, including cyclic peptide membrane permeability prediction and peptide-protein interaction prediction. These results demonstrate that HELM's explicit monomer- and topology-aware representations offer substantial data-efficiency advantages for modeling therapeutic peptides, bridging a long-standing gap between small-molecule and protein language models.2025-12-29T03:29:54Z35 pages; includes Supplementary InformationSeungeon LeeTakuto KoyamaItsuki MaedaShigeyuki MatsumotoYasushi Okunohttp://arxiv.org/abs/2512.23080v1QSAR-Guided Generative Framework for the Discovery of Synthetically Viable Odorants2025-12-28T21:06:01ZThe discovery of novel odorant molecules is key for the fragrance and flavor industries, yet efficiently navigating the vast chemical space to identify structures with desirable olfactory properties remains a significant challenge. Generative artificial intelligence offers a promising approach for \textit{de novo} molecular design but typically requires large sets of molecules to learn from. To address this problem, we present a framework combining a variational autoencoder (VAE) with a quantitative structure-activity relationship (QSAR) model to generate novel odorants from limited training sets of odor molecules. The self-supervised learning capabilities of the VAE allow it to learn SMILES grammar from ChemBL database, while its training objective is augmented with a loss term derived from an external QSAR model to structure the latent representation according to odor probability. While the VAE demonstrated high internal consistency in learning the QSAR supervision signal, validation against an external, unseen ground truth dataset (Unique Good Scents) confirms the model generates syntactically valid structures (100\% validity achieved via rejection sampling) and 94.8\% unique structures. The latent space is effectively structured by odor likelihood, evidenced by a Fréchet ChemNet Distance (FCD) of $\approx$ 6.96 between generated molecules and known odorants, compared to $\approx$ 21.6 for the ChemBL baseline. Structural analysis via Bemis-Murcko scaffolds reveals that 74.4\% of candidates possess novel core frameworks distinct from the training data, indicating the model performs extensive chemical space exploration beyond simple derivatization of known odorants. Generated candidates display physicochemical properties ....2025-12-28T21:06:01ZTim C. PearceAhmed Ibrahimhttp://arxiv.org/abs/2512.22820v1Epigenetic state encodes locus-specific chromatin mechanics2025-12-28T07:15:13ZChromatin is repeatedly deformed in vivo during transcription, nuclear remodeling, and confined migration - yet how mechanical response varies from locus to locus, and how it relates to epigenetic state, remains unclear. We develop a theory to infer locus-specific viscoelasticity from three-dimensional genome organization. Using chromatin structures derived from contact maps, we calculate frequency-dependent storage and loss moduli for individual loci and establish that the mechanical properties are determined both by chromatin epigenetic marks and organization. On large length scales, chromatin exhibits Rouse-like viscoelastic scaling, but this coarse behavior masks extensive heterogeneity at the single-locus level. Loci segregate into two mechanical subpopulations with distinct longest relaxation times: one characterized by single-timescale and another by multi-timescale relaxation. The multi-timescale loci are strongly enriched in active marks, and the longest relaxation time for individual loci correlates inversely with effective local stiffness. Pull-release simulations further predict a time-dependent susceptibility: H3K27ac-rich loci deform more under sustained forcing yet can resist brief, large impulses. At finer genomic scales, promoters, enhancers, and gene bodies emerge as "viscoelastic islands" aligned with their focal interactions. Together, these results suggest that chromatin viscoelasticity is an organized, epigenetically coupled property of the 3D genome, providing a mechanistic layer that may influence enhancer-promoter communication, condensate-mediated organization, and response to cellular mechanical stress. The prediction that locus-specific mechanics in chromatin are controlled by 3D structures as well as the epigenetic states is amenable to experimental test.2025-12-28T07:15:13ZAlso available on bioRxiv (doi: 10.64898/2025.12.27.696709)Guang ShiD. Thirumalaihttp://arxiv.org/abs/2512.20924v1Clever Hans in Chemistry: Chemist Style Signals Confound Activity Prediction on Public Benchmarks2025-12-24T04:04:20ZCan machine learning models identify which chemist made a molecule from structure alone? If so, models trained on literature data may exploit chemist intent rather than learning causal structure-activity relationships. We test this by linking CHEMBL assays to publication authors and training a 1,815-class classifier to predict authors from molecular fingerprints, achieving 60% top-5 accuracy under scaffold-based splitting. We then train an activity model that receives only a protein identifier and an author-probability vector derived from structure, with no direct access to molecular descriptors. This author-only model achieves predictive power comparable to a simple baseline that has access to structure. This reveals a "Clever Hans" failure mode: models can predict bioactivity largely by inferring chemist goals and favorite targets without requiring a lab-independent understanding of chemistry. We analyze the sources of this leakage, propose author-disjoint splits, and recommend dataset practices to decouple chemist intent from biological outcomes.2025-12-24T04:04:20ZAndrew D. BlevinsIan K. Quigleyhttp://arxiv.org/abs/2508.12029v3BConformeR: A Conformer Based on Mutual Sampling for Unified Prediction of Continuous and Discontinuous Antibody Binding Sites2025-12-23T13:32:11ZAccurate prediction of antibody-binding sites (epitopes) on antigens is crucial for vaccine design, immunodiagnostics, therapeutic antibody development, antibody engineering, research into autoimmune and allergic diseases, and advancing our understanding of immune responses. Despite in silico methods that have been proposed to predict both linear (continuous) and conformational (discontinuous) epitopes, they consistently underperform in predicting conformational epitopes. In this work, we propose Conformer-based models trained separately on AlphaFold-predicted structures and experimentally determined structures, leveraging convolutional neural networks (CNNs) to extract local features and Transformers to capture long-range dependencies within antigen sequences. Ablation studies demonstrate that CNN enhances the prediction of linear epitopes, and the Transformer module improves the prediction of conformational epitopes. Experimental results show that our model outperforms existing baselines in terms of MCC, ROC-AUC, PR-AUC, and F1 scores on both linear and conformational epitopes.2025-08-16T12:31:39ZZhangyu YouJiahao MaHongzong LiYe-Fan HuJian-Dong Huanghttp://arxiv.org/abs/2512.20263v1Drug-like antibodies with low immunogenicity in human panels designed with Latent-X22025-12-23T11:17:59ZDrug discovery has long sought computational systems capable of designing drug-like molecules directly: developable and non-immunogenic from the start. Here we introduce Latent-X2, a frontier generative model that achieves this goal through zero-shot design of antibodies with strong binding affinities, drug-like properties, and, for the first time for any de novo generated antibody, confirmed low immunogenicity in human donor panels. Latent-X2 is an all-atom model conditioned on target structure, epitope specification, and optional antibody framework, jointly generating sequences and structures while modelling the bound complex. Testing only 4 to 24 designs per target in each modality, we successfully generated VHH and scFv antibodies against 9 of 18 evaluated targets, achieving a 50% target-level success rate with picomolar to nanomolar binding affinities. Designed molecules exhibit developability profiles that match or exceed those of approved antibody therapeutics, including expression yield, aggregation propensity, polyreactivity, hydrophobicity, and thermal stability, without optimization, filtering, or selection. In the first immunogenicity assessment of any AI-generated antibody, representative de novo VHH binders targeting TNFL9 exhibit both potent target engagement and low immunogenicity across T-cell proliferation and cytokine release assays. The model generalizes beyond antibodies: against K-Ras, long considered undruggable, we generated macrocyclic peptide binders competitive with trillion-scale mRNA display screens. These properties emerge directly from the model, demonstrating the therapeutic viability of zero-shot molecular design, now available without AI infrastructure or coding expertise at https://platform.latentlabs.com.2025-12-23T11:17:59ZRobin Rombach and Alexander W. R. Nelson contributed to this work as advisors to Latent Labs Latent Labs TeamHenry KenlayDaniella PretoriusJonathan CrabbéAlex BridglandSebastian M. SchmonAgrin HilmkilJames VuckovicSimon MathisTomas MattesonRebecca Bartke-CroughanAmir MotmaenRobin RombachMária VlachynskáAlexander W. R. NelsonDavid YuanAnnette ObikaSimon A. A. Kohlhttp://arxiv.org/abs/2512.19939v1Methods for Analyzing RNA Pseudoknots via Chord Diagrams and Intersection Graphs2025-12-23T00:13:18ZRNA molecules are known to form complex secondary structures including pseudoknots. A systematic framework for the enumeration, classification and prediction of secondary structures is critical to determine the biological significance of the molecular configurations of RNA. Chord diagrams are mathematical objects widely used to represent RNA secondary structures and to analyze structural motifs, however a mathematically rigorous enumeration of pseudoknots remains a challenge. We introduce a method that incorporates a distance-based metric $τ$ to analyze the intersection graph of a chord diagram associated with a pseudoknotted structure. In particular, our method formally defines a pseudoknot in terms of a weighted vertex cover of a certain intersection graph constructed from a partition of the chord diagram representing the nucleotide sequence of the RNA molecule. In this graph-theoretic context, we introduce a rigorous algorithm that enumerates pseudoknots, classifies secondary structures, and is sensitive to three-dimensional topological features. We implement our methods in MATLAB and test the algorithm on pseudoknotted structures from the bpRNA-1m database. Our findings confirm that genus is a robust quantifier of pseudoknot complexity.2025-12-23T00:13:18Z26 pages, 14 figures, 4 tablesRayan IbrahimAllison H. Moorehttp://arxiv.org/abs/2507.18545v2The unreasonable likelihood of being: origin of life, terraforming, and AI2025-12-21T12:07:19ZThe origin of life on Earth via the spontaneous emergence of a protocell prior to Darwinian evolution remains a fundamental open question in physics and chemistry. Here, we develop a conceptual framework based on information theory and algorithmic complexity. Using estimates grounded in modern computational models, we evaluate the difficulty of assembling structured biological information under plausible prebiotic conditions. Our results highlight the formidable entropic and informational barriers to forming a viable protocell within the available window of Earth's early history. While the idea of Earth being terraformed by advanced extraterrestrials might violate Occam's razor from within mainstream science, directed panspermia -- originally proposed by Francis Crick and Leslie Orgel -- remains a speculative but logically open alternative. Ultimately, uncovering physical principles for life's spontaneous emergence remains a grand challenge for biological physics.2025-07-24T16:10:46Z18 pages, 4 figuresRobert G. Endreshttp://arxiv.org/abs/2512.17815v1Structure-Aware Antibody Design with Affinity-Optimized Inverse Folding2025-12-19T17:20:05ZMotivation: The clinical efficacy of antibody therapeutics critically depends on high-affinity target engagement, yet laboratory affinity-maturation campaigns are slow and costly. In computational settings, most protein language models (PLMs) are not trained to favor high-affinity antibodies, and existing preference optimization approaches introduce substantial computational overhead without clear affinity gains. Therefore, this work proposes SimBinder-IF, which converts the inverse folding model ESM-IF into an antibody sequence generator by freezing its structure encoder and training only its decoder to prefer experimentally stronger binders through preference optimization.
Results: On the 11-assay AbBiBench benchmark, SimBinder-IF achieves a 55 percent relative improvement in mean Spearman correlation between log-likelihood scores and experimentally measured binding affinity compared to vanilla ESM-IF (from 0.264 to 0.410). In zero-shot generalization across four unseen antigen-antibody complexes, the correlation improves by 156 percent (from 0.115 to 0.294). SimBinder-IF also outperforms baselines in top-10 precision for ten-fold or greater affinity improvements. A case study redesigning antibody F045-092 for A/California/04/2009 (pdmH1N1) shows that SimBinder-IF proposes variants with substantially lower predicted binding free energy changes than ESM-IF (mean Delta Delta G -75.16 vs -46.57). Notably, SimBinder-IF trains only about 18 percent of the parameters of the full ESM-IF model, highlighting its parameter efficiency for high-affinity antibody generation.2025-12-19T17:20:05ZXinyan ZhaoYi-Ching TangRivaaj MonsiaVictor J. CantuAshwin Kumar RameshXiaozhong LiuZhiqiang AnXiaoqian JiangYejin Kimhttp://arxiv.org/abs/2508.12060v2Applied causality to infer protein dynamics and kinetics2025-12-19T12:38:07ZThe use of generative machine learning models, trained on the experimentally resolved structures deposited in the protein data bank, is an attractive approach to sampling conformational ensembles of proteins. However, the ensembles generated by these models lack timescale or causal information. We use the structural ensembles generated from AlphaFold2 at a range of MSA depths to parameterize the potential of mean force of an overdamped, memory-free, coarse-grained Langevin equation. This approach couples the AlphaFold2 ensembles to a causal model, allowing us to estimate the timescales spanned by the ensembles generated at each MSA depth. Performing this analysis on six variants of HIV-1 protease, we confirm an inverse relationship between MSA depth and the timescale of an ensemble's conformational fluctuations. The MSA depth essentially serves as a conformational restraint, and AlphaFold2 is generally able to probe timescales at or below those seen in microsecond-long, unbiased molecular dynamics simulations. We conclude by generalizing this approach to other generative structural ensemble-prediction methods as well as co-folding models, in this case the biologically functional HIV-1 protease dimer.2025-08-16T14:25:55Z24 pages; 8 figures; TOC figureAkashnathan AranganathanEric R. Beyerle