https://arxiv.org/api/4H8PqHCH1CXwTTEqPMKl4qYK/aI 2026-03-22T08:56:29Z 6642 135 15 http://arxiv.org/abs/2601.00505v1 Effect of Electric Charge on Biotherapeutic Transport, Binding and Absorption: A Computational Study 2026-01-01T23:11:29Z

This study explores the effects of electric charge on the dynamics of drug transport and absorption in subcutaneous injections of monoclonal antibodies (mAbs). We develop a novel mathematical and computational model, based on the Nernst-Planck equations and porous media flow theory, to investigate the complex interactions between mAbs and charged species in subcutaneous tissue. The model enables us to study short-term transport dynamics and long-term binding and absorption for two mAbs with different electric properties. We examine the influence of buffer pH, body mass index, injection depth, and formulation concentration on drug distribution and compare our numerical results with experimental data from the literature.

2026-01-01T23:11:29Z 27 pages, 13 figures Mario de Lucio Pavlos P. Vlachos Hector Gomez http://arxiv.org/abs/2512.24643v2 Diagnosing Heteroskedasticity and Resolving Multicollinearity Paradoxes in Physicochemical Property Prediction 2026-01-01T10:32:53Z

Lipophilicity (logP) prediction remains central to drug discovery, yet linear regression models for this task frequently violate statistical assumptions in ways that invalidate their reported performance metrics. We analyzed 426,850 bioactive molecules from a rigorously curated intersection of PubChem, ChEMBL, and eMolecules databases, revealing severe heteroskedasticity in linear models predicting computed logP values (XLOGP3): residual variance increases 4.2-fold in lipophilic regions (logP greater than 5) compared to balanced regions (logP 2 to 4). Classical remediation strategies (Weighted Least Squares and Box-Cox transformation) failed to resolve this violation (Breusch-Pagan p-value less than 0.0001 for all variants). Tree-based ensemble methods (Random Forest R-squared of 0.764, XGBoost R-squared of 0.765) proved inherently robust to heteroskedasticity while delivering superior predictive performance. SHAP analysis resolved a critical multicollinearity paradox: despite a weak bivariate correlation of 0.146, molecular weight emerged as the single most important predictor (mean absolute SHAP value of 0.573), with its effect suppressed in simple correlations by confounding with topological polar surface area (TPSA). These findings demonstrate that standard linear models face fundamental challenges for computed lipophilicity prediction and provide a principled framework for interpreting ensemble models in QSAR applications.

2025-12-31T05:32:13Z 7 pages, 4 figures, 3 tables, to be published in KST 2026, unabridged version exists as arXiv:2512.24643v1 Proc. 2026 Int. Conf. Knowl. Smart Technol. (KST), 2026, pp. 645-651 Malikussaid Septian Caesar Floresko Ade Romadhony Isman Kurniawan Warih Maharani Hilal Hudan Nuha 10.1109/KST67832.2026.11431952 http://arxiv.org/abs/2512.24354v1 SeedFold: Scaling Biomolecular Structure Prediction 2025-12-30T17:05:01Z

Highly accurate biomolecular structure prediction is a key component of developing biomolecular foundation models, and one of the most critical aspects of building foundation models is identifying the recipes for scaling the model. In this work, we present SeedFold, a folding model that successfully scales up the model capacity. Our contributions are threefold: first, we identify an effective width-scaling strategy for the Pairformer to increase representation capacity; second, we introduce a novel linear triangular attention that reduces computational complexity to enable efficient scaling; finally, we construct a large-scale distillation dataset to substantially enlarge the training set. Experiments on FoldBench show that SeedFold outperforms AlphaFold3 on most protein-related tasks.

2025-12-30T17:05:01Z Yi Zhou Chan Lu Yiming Ma Wei Qu Fei Ye Kexin Zhang Lan Wang Minrui Gui Quanquan Gu http://arxiv.org/abs/2601.00863v1 Selective Imperfection as a Generative Framework for Analysis, Creativity and Discovery 2025-12-30T11:14:51Z

We introduce materiomusic as a generative framework linking the hierarchical structures of matter with the compositional logic of music. Across proteins, spider webs and flame dynamics, vibrational and architectural principles recur as tonal hierarchies, harmonic progressions, and long-range musical form. Using reversible mappings, from molecular spectra to musical tones and from three-dimensional networks to playable instruments, we show how sound functions as a scientific probe, an epistemic inversion where listening becomes a mode of seeing and musical composition becomes a blueprint for matter. These mappings excavate deep time: patterns originating in femtosecond molecular vibrations or billion-year evolutionary histories become audible. We posit that novelty in science and art emerges when constraints cannot be satisfied within existing degrees of freedom, forcing expansion of the space of viable configurations. Selective imperfection provides the mechanism restoring balance between coherence and adaptability. Quantitative support comes from exhaustive enumeration of all 2^12 musical scales, revealing that culturally significant systems cluster in a mid-entropy, mid-defect corridor, directly paralleling the Hall-Petch optimum where intermediate defect densities maximize material strength. Iterating these mappings creates productive collisions between human creativity and physics, generating new information as musical structures encounter evolutionary constraints. We show how swarm-based AI models compose music exhibiting human-like structural signatures such as small-world connectivity, modular integration, long-range coherence, suggesting a route beyond interpolation toward invention. We show that science and art are generative acts of world-building under constraint, with vibration as a shared grammar organizing structure across scales.

2025-12-30T11:14:51Z Markus J. Buehler http://arxiv.org/abs/2512.23784v1 Sheaf-theoretic representation of the proteolipid code 2025-12-29T16:25:09Z

Membrane particles such as proteins and lipids organize into zones that perform unique functions. Here, I introduce a topological and category-theoretic framework to represent particle and zone intra-scale interactions and inter-scale coupling. This involves carefully demarcating between different presheaf- or sheaf-assigned data levels to preserve functorial structure and account for particle and zone generalized poses. The framework can accommodate Hamiltonian mechanics, enabling dynamical modeling. This amounts to a versatile mathematical formalism for membrane structure and multiscale coupling.

2025-12-29T16:25:09Z 16 pages, 3 figures Troy A. Kervin http://arxiv.org/abs/2512.23175v1 HELM-BERT: A Transformer for Medium-sized Peptide Property Prediction 2025-12-29T03:29:54Z

Therapeutic peptides have emerged as a pivotal modality in modern drug discovery, occupying a chemically and topologically rich space. While accurate prediction of their physicochemical properties is essential for accelerating peptide development, existing molecular language models rely on representations that fail to capture this complexity. Atom-level SMILES notation generates long token sequences and obscures cyclic topology, whereas amino-acid-level representations cannot encode the diverse chemical modifications central to modern peptide design. To bridge this representational gap, the Hierarchical Editing Language for Macromolecules (HELM) offers a unified framework enabling precise description of both monomer composition and connectivity, making it a promising foundation for peptide language modeling. Here, we propose HELM-BERT, the first encoder-based peptide language model trained on HELM notation. Based on DeBERTa, HELM-BERT is specifically designed to capture hierarchical dependencies within HELM sequences. The model is pre-trained on a curated corpus of 39,079 chemically diverse peptides spanning linear and cyclic structures. HELM-BERT significantly outperforms state-of-the-art SMILES-based language models in downstream tasks, including cyclic peptide membrane permeability prediction and peptide-protein interaction prediction. These results demonstrate that HELM's explicit monomer- and topology-aware representations offer substantial data-efficiency advantages for modeling therapeutic peptides, bridging a long-standing gap between small-molecule and protein language models.

2025-12-29T03:29:54Z 35 pages; includes Supplementary Information Seungeon Lee Takuto Koyama Itsuki Maeda Shigeyuki Matsumoto Yasushi Okuno http://arxiv.org/abs/2512.23080v1 QSAR-Guided Generative Framework for the Discovery of Synthetically Viable Odorants 2025-12-28T21:06:01Z

The discovery of novel odorant molecules is key for the fragrance and flavor industries, yet efficiently navigating the vast chemical space to identify structures with desirable olfactory properties remains a significant challenge. Generative artificial intelligence offers a promising approach for \textit{de novo} molecular design but typically requires large sets of molecules to learn from. To address this problem, we present a framework combining a variational autoencoder (VAE) with a quantitative structure-activity relationship (QSAR) model to generate novel odorants from limited training sets of odor molecules. The self-supervised learning capabilities of the VAE allow it to learn SMILES grammar from ChemBL database, while its training objective is augmented with a loss term derived from an external QSAR model to structure the latent representation according to odor probability. While the VAE demonstrated high internal consistency in learning the QSAR supervision signal, validation against an external, unseen ground truth dataset (Unique Good Scents) confirms the model generates syntactically valid structures (100\% validity achieved via rejection sampling) and 94.8\% unique structures. The latent space is effectively structured by odor likelihood, evidenced by a Fréchet ChemNet Distance (FCD) of $\approx$ 6.96 between generated molecules and known odorants, compared to $\approx$ 21.6 for the ChemBL baseline. Structural analysis via Bemis-Murcko scaffolds reveals that 74.4\% of candidates possess novel core frameworks distinct from the training data, indicating the model performs extensive chemical space exploration beyond simple derivatization of known odorants. Generated candidates display physicochemical properties ....

2025-12-28T21:06:01Z Tim C. Pearce Ahmed Ibrahim http://arxiv.org/abs/2512.22820v1 Epigenetic state encodes locus-specific chromatin mechanics 2025-12-28T07:15:13Z

Chromatin is repeatedly deformed in vivo during transcription, nuclear remodeling, and confined migration - yet how mechanical response varies from locus to locus, and how it relates to epigenetic state, remains unclear. We develop a theory to infer locus-specific viscoelasticity from three-dimensional genome organization. Using chromatin structures derived from contact maps, we calculate frequency-dependent storage and loss moduli for individual loci and establish that the mechanical properties are determined both by chromatin epigenetic marks and organization. On large length scales, chromatin exhibits Rouse-like viscoelastic scaling, but this coarse behavior masks extensive heterogeneity at the single-locus level. Loci segregate into two mechanical subpopulations with distinct longest relaxation times: one characterized by single-timescale and another by multi-timescale relaxation. The multi-timescale loci are strongly enriched in active marks, and the longest relaxation time for individual loci correlates inversely with effective local stiffness. Pull-release simulations further predict a time-dependent susceptibility: H3K27ac-rich loci deform more under sustained forcing yet can resist brief, large impulses. At finer genomic scales, promoters, enhancers, and gene bodies emerge as "viscoelastic islands" aligned with their focal interactions. Together, these results suggest that chromatin viscoelasticity is an organized, epigenetically coupled property of the 3D genome, providing a mechanistic layer that may influence enhancer-promoter communication, condensate-mediated organization, and response to cellular mechanical stress. The prediction that locus-specific mechanics in chromatin are controlled by 3D structures as well as the epigenetic states is amenable to experimental test.

2025-12-28T07:15:13Z Also available on bioRxiv (doi: 10.64898/2025.12.27.696709) Guang Shi D. Thirumalai http://arxiv.org/abs/2512.20924v1 Clever Hans in Chemistry: Chemist Style Signals Confound Activity Prediction on Public Benchmarks 2025-12-24T04:04:20Z

Can machine learning models identify which chemist made a molecule from structure alone? If so, models trained on literature data may exploit chemist intent rather than learning causal structure-activity relationships. We test this by linking CHEMBL assays to publication authors and training a 1,815-class classifier to predict authors from molecular fingerprints, achieving 60% top-5 accuracy under scaffold-based splitting. We then train an activity model that receives only a protein identifier and an author-probability vector derived from structure, with no direct access to molecular descriptors. This author-only model achieves predictive power comparable to a simple baseline that has access to structure. This reveals a "Clever Hans" failure mode: models can predict bioactivity largely by inferring chemist goals and favorite targets without requiring a lab-independent understanding of chemistry. We analyze the sources of this leakage, propose author-disjoint splits, and recommend dataset practices to decouple chemist intent from biological outcomes.

2025-12-24T04:04:20Z Andrew D. Blevins Ian K. Quigley http://arxiv.org/abs/2508.12029v3 BConformeR: A Conformer Based on Mutual Sampling for Unified Prediction of Continuous and Discontinuous Antibody Binding Sites 2025-12-23T13:32:11Z

Accurate prediction of antibody-binding sites (epitopes) on antigens is crucial for vaccine design, immunodiagnostics, therapeutic antibody development, antibody engineering, research into autoimmune and allergic diseases, and advancing our understanding of immune responses. Despite in silico methods that have been proposed to predict both linear (continuous) and conformational (discontinuous) epitopes, they consistently underperform in predicting conformational epitopes. In this work, we propose Conformer-based models trained separately on AlphaFold-predicted structures and experimentally determined structures, leveraging convolutional neural networks (CNNs) to extract local features and Transformers to capture long-range dependencies within antigen sequences. Ablation studies demonstrate that CNN enhances the prediction of linear epitopes, and the Transformer module improves the prediction of conformational epitopes. Experimental results show that our model outperforms existing baselines in terms of MCC, ROC-AUC, PR-AUC, and F1 scores on both linear and conformational epitopes.

2025-08-16T12:31:39Z Zhangyu You Jiahao Ma Hongzong Li Ye-Fan Hu Jian-Dong Huang http://arxiv.org/abs/2512.20263v1 Drug-like antibodies with low immunogenicity in human panels designed with Latent-X2 2025-12-23T11:17:59Z

Drug discovery has long sought computational systems capable of designing drug-like molecules directly: developable and non-immunogenic from the start. Here we introduce Latent-X2, a frontier generative model that achieves this goal through zero-shot design of antibodies with strong binding affinities, drug-like properties, and, for the first time for any de novo generated antibody, confirmed low immunogenicity in human donor panels. Latent-X2 is an all-atom model conditioned on target structure, epitope specification, and optional antibody framework, jointly generating sequences and structures while modelling the bound complex. Testing only 4 to 24 designs per target in each modality, we successfully generated VHH and scFv antibodies against 9 of 18 evaluated targets, achieving a 50% target-level success rate with picomolar to nanomolar binding affinities. Designed molecules exhibit developability profiles that match or exceed those of approved antibody therapeutics, including expression yield, aggregation propensity, polyreactivity, hydrophobicity, and thermal stability, without optimization, filtering, or selection. In the first immunogenicity assessment of any AI-generated antibody, representative de novo VHH binders targeting TNFL9 exhibit both potent target engagement and low immunogenicity across T-cell proliferation and cytokine release assays. The model generalizes beyond antibodies: against K-Ras, long considered undruggable, we generated macrocyclic peptide binders competitive with trillion-scale mRNA display screens. These properties emerge directly from the model, demonstrating the therapeutic viability of zero-shot molecular design, now available without AI infrastructure or coding expertise at https://platform.latentlabs.com.

2025-12-23T11:17:59Z Robin Rombach and Alexander W. R. Nelson contributed to this work as advisors to Latent Labs Latent Labs Team Henry Kenlay Daniella Pretorius Jonathan Crabbé Alex Bridgland Sebastian M. Schmon Agrin Hilmkil James Vuckovic Simon Mathis Tomas Matteson Rebecca Bartke-Croughan Amir Motmaen Robin Rombach Mária Vlachynská Alexander W. R. Nelson David Yuan Annette Obika Simon A. A. Kohl http://arxiv.org/abs/2512.19939v1 Methods for Analyzing RNA Pseudoknots via Chord Diagrams and Intersection Graphs 2025-12-23T00:13:18Z

RNA molecules are known to form complex secondary structures including pseudoknots. A systematic framework for the enumeration, classification and prediction of secondary structures is critical to determine the biological significance of the molecular configurations of RNA. Chord diagrams are mathematical objects widely used to represent RNA secondary structures and to analyze structural motifs, however a mathematically rigorous enumeration of pseudoknots remains a challenge. We introduce a method that incorporates a distance-based metric $τ$ to analyze the intersection graph of a chord diagram associated with a pseudoknotted structure. In particular, our method formally defines a pseudoknot in terms of a weighted vertex cover of a certain intersection graph constructed from a partition of the chord diagram representing the nucleotide sequence of the RNA molecule. In this graph-theoretic context, we introduce a rigorous algorithm that enumerates pseudoknots, classifies secondary structures, and is sensitive to three-dimensional topological features. We implement our methods in MATLAB and test the algorithm on pseudoknotted structures from the bpRNA-1m database. Our findings confirm that genus is a robust quantifier of pseudoknot complexity.

2025-12-23T00:13:18Z 26 pages, 14 figures, 4 tables Rayan Ibrahim Allison H. Moore http://arxiv.org/abs/2507.18545v2 The unreasonable likelihood of being: origin of life, terraforming, and AI 2025-12-21T12:07:19Z

The origin of life on Earth via the spontaneous emergence of a protocell prior to Darwinian evolution remains a fundamental open question in physics and chemistry. Here, we develop a conceptual framework based on information theory and algorithmic complexity. Using estimates grounded in modern computational models, we evaluate the difficulty of assembling structured biological information under plausible prebiotic conditions. Our results highlight the formidable entropic and informational barriers to forming a viable protocell within the available window of Earth's early history. While the idea of Earth being terraformed by advanced extraterrestrials might violate Occam's razor from within mainstream science, directed panspermia -- originally proposed by Francis Crick and Leslie Orgel -- remains a speculative but logically open alternative. Ultimately, uncovering physical principles for life's spontaneous emergence remains a grand challenge for biological physics.

2025-07-24T16:10:46Z 18 pages, 4 figures Robert G. Endres http://arxiv.org/abs/2512.17815v1 Structure-Aware Antibody Design with Affinity-Optimized Inverse Folding 2025-12-19T17:20:05Z

Motivation: The clinical efficacy of antibody therapeutics critically depends on high-affinity target engagement, yet laboratory affinity-maturation campaigns are slow and costly. In computational settings, most protein language models (PLMs) are not trained to favor high-affinity antibodies, and existing preference optimization approaches introduce substantial computational overhead without clear affinity gains. Therefore, this work proposes SimBinder-IF, which converts the inverse folding model ESM-IF into an antibody sequence generator by freezing its structure encoder and training only its decoder to prefer experimentally stronger binders through preference optimization. Results: On the 11-assay AbBiBench benchmark, SimBinder-IF achieves a 55 percent relative improvement in mean Spearman correlation between log-likelihood scores and experimentally measured binding affinity compared to vanilla ESM-IF (from 0.264 to 0.410). In zero-shot generalization across four unseen antigen-antibody complexes, the correlation improves by 156 percent (from 0.115 to 0.294). SimBinder-IF also outperforms baselines in top-10 precision for ten-fold or greater affinity improvements. A case study redesigning antibody F045-092 for A/California/04/2009 (pdmH1N1) shows that SimBinder-IF proposes variants with substantially lower predicted binding free energy changes than ESM-IF (mean Delta Delta G -75.16 vs -46.57). Notably, SimBinder-IF trains only about 18 percent of the parameters of the full ESM-IF model, highlighting its parameter efficiency for high-affinity antibody generation.

2025-12-19T17:20:05Z Xinyan Zhao Yi-Ching Tang Rivaaj Monsia Victor J. Cantu Ashwin Kumar Ramesh Xiaozhong Liu Zhiqiang An Xiaoqian Jiang Yejin Kim http://arxiv.org/abs/2508.12060v2 Applied causality to infer protein dynamics and kinetics 2025-12-19T12:38:07Z

The use of generative machine learning models, trained on the experimentally resolved structures deposited in the protein data bank, is an attractive approach to sampling conformational ensembles of proteins. However, the ensembles generated by these models lack timescale or causal information. We use the structural ensembles generated from AlphaFold2 at a range of MSA depths to parameterize the potential of mean force of an overdamped, memory-free, coarse-grained Langevin equation. This approach couples the AlphaFold2 ensembles to a causal model, allowing us to estimate the timescales spanned by the ensembles generated at each MSA depth. Performing this analysis on six variants of HIV-1 protease, we confirm an inverse relationship between MSA depth and the timescale of an ensemble's conformational fluctuations. The MSA depth essentially serves as a conformational restraint, and AlphaFold2 is generally able to probe timescales at or below those seen in microsecond-long, unbiased molecular dynamics simulations. We conclude by generalizing this approach to other generative structural ensemble-prediction methods as well as co-folding models, in this case the biologically functional HIV-1 protease dimer.

2025-08-16T14:25:55Z 24 pages; 8 figures; TOC figure Akashnathan Aranganathan Eric R. Beyerle