https://arxiv.org/api/nC/1sgOlvSff6BWG8X0E6xHCsWE 2026-03-22T10:30:31Z 6642 150 15 http://arxiv.org/abs/2512.17169v1 Application of machine learning to predict food processing level using Open Food Facts 2025-12-19T02:10:59Z

Ultra-processed foods are increasingly linked to health issues like obesity, cardiovascular disease, type 2 diabetes, and mental health disorders due to poor nutritional quality. This first-of-its-kind study at such a scale uses machine learning to classify food processing levels (NOVA) based on the Open Food Facts dataset of over 900,000 products. Models including LightGBM, Random Forest, and CatBoost were trained on nutrient concentration data. LightGBM performed best, achieving 80-85% accuracy across different nutrient panels and effectively distinguishing minimally from ultra-processed foods. Exploratory analysis revealed strong associations between higher NOVA classes and lower Nutri-Scores, indicating poorer nutritional quality. Products in NOVA 3 and 4 also had higher carbon footprints and lower Eco-Scores, suggesting greater environmental impact. Allergen analysis identified gluten and milk as common in ultra-processed items, posing risks to sensitive individuals. Categories like Cakes and Snacks were dominant in higher NOVA classes, which also had more additives, highlighting the role of ingredient modification. This study, leveraging the largest dataset of NOVA-labeled products, emphasizes the health, environmental, and allergenic implications of food processing and showcases machine learning's value in scalable classification. A user-friendly web tool is available for NOVA prediction using nutrient data: https://cosylab.iiitd.edu.in/foodlabel/.

2025-12-19T02:10:59Z 27 Pages (22 Pages of Main Manuscript + Supplementary Material), 7 Figures, 1 Table Nalin Arora Aviral Chauhan Siddhant Rana Mahansh Aditya Sumit Bhagat Aditya Kumar Akash Kumar Akanksh Semar Ayush Vikram Singh Ganesh Bagler http://arxiv.org/abs/2510.00774v2 GeoGraph: Geometric and Graph-based Ensemble Descriptors for Intrinsically Disordered Proteins 2025-12-18T11:13:18Z

While deep learning has revolutionized the prediction of rigid protein structures, modelling the conformational ensembles of Intrinsically Disordered Proteins (IDPs) remains a key frontier. Current AI paradigms present a trade-off: Protein Language Models (PLMs) capture evolutionary statistics but lack explicit physical grounding, while generative models trained to model full ensembles are computationally expensive. In this work we critically assess these limits and propose a path forward. We introduce GeoGraph, a simulation-informed surrogate trained to predict ensemble-averaged statistics of residue-residue contact-map topology directly from sequence. By featurizing coarse-grained molecular dynamics simulations into residue- and sequence-level graph descriptors, we create a robust and information-rich learning target. Our evaluation demonstrates that this approach yields representations that are more predictive of key biophysical properties than existing methods.

2025-10-01T11:13:53Z Accepted at AI4Science and ML4PS NeurIPS Workshops 2025. v2: comparison with Human-IDRome model and link to github added Eoin Quinn Marco Carobene Jean Quentin Sebastien Boyer Miguel Arbesú Oliver Bent http://arxiv.org/abs/2512.15984v1 Lifting Biomolecular Data Acquisition 2025-12-17T21:30:44Z

One strategy to scale up ML-driven science is to increase wet lab experiments' information density. We present a method based on a neural extension of compressed sensing to function space. We measure the activity of multiple different molecules simultaneously, rather than individually. Then, we deconvolute the molecule-activity map during model training. Co-design of wet lab experiments and learning algorithms provably leads to orders-of-magnitude gains in information density. We demonstrate on antibodies and cell therapies.

2025-12-17T21:30:44Z Eli N. Weinstein Andrei Slabodkin Mattia G. Gollub Kerry Dobbs Xiao-Bing Cui Fang Zhang Kristina Gurung Elizabeth B. Wood http://arxiv.org/abs/2504.16941v5 Mathematical Insights into Protein Architecture: Persistent Homology and Machine Learning Applied to the Flagellar Motor 2025-12-17T19:20:29Z

We present a machine learning approach that leverages persistent homology to classify bacterial flagellar motors into two functional states: rotated and stalled. By embedding protein structural data into a topological framework, we extract multiscale features from filtered simplicial complexes constructed over atomic coordinates. These topological invariants, specifically persistence diagrams and barcodes, capture critical geometric and connectivity patterns that correlate with motor function. The extracted features are vectorized and integrated into a machine learning pipeline that includes dimensionality reduction and supervised classification. Applied to a curated dataset of experimentally characterized flagellar motors from diverse bacterial species, our model demonstrates high classification accuracy and robustness to structural variation. This approach highlights the power of topological data analysis in revealing functionally relevant patterns beyond the reach of traditional geometric descriptors, offering a novel computational tool for protein function prediction.

2025-04-08T19:21:44Z Zakaria Lamine Abdelatif Hafid Mohamed Rahouti http://arxiv.org/abs/2512.15645v1 Machine learning for RNA-targeting drug design 2025-12-17T17:52:14Z

Targeting RNA with small molecules offers significant therapeutic potential. Machine learning could substantially accelerate preclinical drug discovery, from hit identification to lead optimization. Yet a fundamental limitation emerges: drug design machine learning models, tailored for proteins, are not readily applicable to RNAs because of fundamental differences between RNAs and proteins in both structural characteristics and interactions with small molecules. RNA-specific approaches have consequently emerged, primarily focusing on binding site identification and virtual screening. In this review, we comprehensively compare machine learning tools for RNA-targeting drug design according to the tasks they address, their methodology and their relevance in RNA-specific contexts. As open challenges will catalyze new method development, we emphasize the need for standardized, drug design-specific evaluation approaches. We provide clear guidelines to establish these standards along with a benchmark assessing the ability of current machine learning models to predict specific drug-RNA interactions.

2025-12-17T17:52:14Z 21 pages (38 with references and appendix), 4 figures Wissam Karroucha Carlos Oliver Veronique Stoven Vincent Mallet http://arxiv.org/abs/2512.22160v1 On the comparison of models and experiments in the study of DNA open states: the problem of degrees of freedom 2025-12-16T11:02:16Z

Simple mechanical models of DNA play an important role in studying the dynamics of its open states. The main requirement when developing a DNA model is the correct selection of its effective potentials and parameters based on experimental data. At the same time, various experiments allow us to "see" different types of DNA open states. Consideration of this feature is one of the most important conditions in the development, optimization, and parameterization of any mechanical model. Violation of this condition, i.e., the comparison of incomparable characteristics, leads to critical errors. The present investigation is devoted to the problem of degrees of freedom of DNA bases taken into account in mechanical models. Using the Peyrard-Bishop-Dauxois model as an example, two types of errors in interpreting experimental data when compared with the model are examined. The first one is a mismatch between the open state types in the model and experiment. The second one is an incorrect specification of the "threshold coordinate" of the open state. The concept of the effective total threshold coordinate of the radial separation of DNA strands for registration of opening is introduced. It is shown that correct interpretation of experimental data can actually eliminate discrepancies with theory.

2025-12-16T11:02:16Z 26 pages, 6 figures Alexey S. Shigaev Victor D. Lakhno http://arxiv.org/abs/2512.01417v2 Active Force Dynamics in Red Blood Cells Under Non-Invasive Optical Tweezers 2025-12-15T09:59:54Z

Red blood cells (RBCs) sustain mechanical stresses associated with microcirculatory flow through ATP-driven plasma membrane flickering. This is an active phenomenon driven by motor proteins that regulate interactions between the spectrin cytoskeleton and the lipid bilayer; it is manifested in RBC shape fluctuations reflecting the cell's mechanical and metabolic state. Yet, direct quantification of the forces and energetic costs underlying this non-equilibrium behavior remains challenging due to the invasiveness of existing techniques. Here, a minimally invasive method that combines bead-free, low-power optical tweezers with high-speed video microscopy was employed to track local membrane forces and displacements in single RBCs during the same time window. This independent dual-channel measurement enabled the construction of a mechano-dynamic phase space for RBCs under different chemical treatments, that allowed for differentiating between metabolic and structural states based on their fluctuation-force signatures. Quantification of mechanical work during flickering demonstrated that membrane softening enhanced fluctuations while elevating energy dissipation. The proposed optical tweezers methodology provides a robust framework for mapping the active mechanics of living cells, enabling precise probing of cellular physiology and detection of biomechanical dysfunction in diseases.

2025-12-01T08:48:15Z Arnau Dorn Clara Luque-Rioja Macarena Calero Diego Herráez-Aguilar Francisco Monroy Niccolò Caselli http://arxiv.org/abs/2512.12134v1 Modeling Dabrafenib Response Using Multi-Omics Modality Fusion and Protein Network Embeddings Based on Graph Convolutional Networks 2025-12-13T02:00:56Z

Cancer cell response to targeted therapy arises from complex molecular interactions, making single omics insufficient for accurate prediction. This study develops a model to predict Dabrafenib sensitivity by integrating multiple omics layers (genomics, transcriptomics, proteomics, epigenomics, and metabolomics) with protein network embeddings generated using Graph Convolutional Networks (GCN). Each modality is encoded into low dimensional representations through neural network preprocessing. Protein interaction information from STRING is incorporated using GCN to capture biological topology. An attention based fusion mechanism assigns adaptive weights to each modality according to its relevance. Using GDSC cancer cell line data, the model shows that selective integration of two modalities, especially proteomics and transcriptomics, achieves the best test performance (R2 around 0.96), outperforming all single omics and full multimodal settings. Genomic and epigenomic data were less informative, while proteomic and transcriptomic layers provided stronger phenotypic signals related to MAPK inhibitor activity. These results show that attention guided multi omics fusion combined with GCN improves drug response prediction and reveals complementary molecular determinants of Dabrafenib sensitivity. The approach offers a promising computational framework for precision oncology and predictive modeling of targeted therapies.

2025-12-13T02:00:56Z La Ode Aman A Mu'thi Andy Suryadi Dizky Ramadani Putri Papeo Hamsidar Hasan Ariani H Hutuba Netty Ino Ischak Yuszda K. Salimi http://arxiv.org/abs/2512.11412v1 Task-Specific Sparse Feature Masks for Molecular Toxicity Prediction with Chemical Language Models 2025-12-12T09:41:04Z

Reliable in silico molecular toxicity prediction is a cornerstone of modern drug discovery, offering a scalable alternative to experimental screening. However, the black-box nature of state-of-the-art models remains a significant barrier to adoption, as high-stakes safety decisions demand verifiable structural insights alongside predictive performance. To address this, we propose a novel multi-task learning (MTL) framework designed to jointly enhance accuracy and interpretability. Our architecture integrates a shared chemical language model with task-specific attention modules. By imposing an L1 sparsity penalty on these modules, the framework is constrained to focus on a minimal set of salient molecular fragments for each distinct toxicity endpoint. The resulting framework is trained end-to-end and is readily adaptable to various transformer-based backbones. Evaluated on the ClinTox, SIDER, and Tox21 benchmark datasets, our approach consistently outperforms both single-task and standard MTL baselines. Crucially, the sparse attention weights provide chemically intuitive visualizations that reveal the specific fragments influencing predictions, thereby enhancing insight into the model's decision-making process.

2025-12-12T09:41:04Z 6 pages, 4 figures Kwun Sy Lee Jiawei Chen Fuk Sheng Ford Chung Tianyu Zhao Zhenyuan Chen Debby D. Wang http://arxiv.org/abs/2507.02925v3 Large Language Model Agent for Modular Task Execution in Drug Discovery 2025-12-12T03:52:58Z

We present a modular framework powered by large language models (LLMs) that automates and streamlines key tasks across the early-stage computational drug discovery pipeline. By combining LLM reasoning with domain-specific tools, the framework performs biomedical data retrieval, literature-grounded question answering via retrieval-augmented generation, molecular generation, multi-property prediction, property-aware molecular refinement, and 3D protein-ligand structure generation. The agent autonomously retrieved relevant biomolecular information, including FASTA sequences, SMILES representations, and literature, and answered mechanistic questions with improved contextual accuracy compared to standard LLMs. It then generated chemically diverse seed molecules and predicted 75 properties, including ADMET-related and general physicochemical descriptors, which guided iterative molecular refinement. Across two refinement rounds, the number of molecules with QED > 0.6 increased from 34 to 55. The number of molecules satisfying empirical drug-likeness filters also rose; for example, compliance with the Ghose filter increased from 32 to 55 within a pool of 100 molecules. The framework also employed Boltz-2 to generate 3D protein-ligand complexes and provide rapid binding affinity estimates for candidate compounds. These results demonstrate that the approach effectively supports molecular screening, prioritization, and structure evaluation. Its modular design enables flexible integration of evolving tools and models, providing a scalable foundation for AI-assisted therapeutic discovery.

2025-06-26T00:19:01Z Janghoon Ock Radheesh Sharma Meda Srivathsan Badrinarayanan Neha S. Aluru Achuth Chandrasekhar Amir Barati Farimani http://arxiv.org/abs/2512.11244v1 Model Reduction of Multicellular Communication Systems via Singular Perturbation: Sender Receiver Systems 2025-12-12T03:06:00Z

We investigate multicellular sender receiver systems embedded in hydrogel beads, where diffusible signals mediate interactions among heterogeneous cells. Such systems are modeled by PDE ODE couplings that combine three dimensional diffusion with nonlinear intracellular dynamics, making analysis and simulation challenging. We show that the diffusion dynamics converges exponentially to a quasi steady spatial profile and use singular perturbation theory to reduce the model to a finite dimensional multiagent network. A closed form communication matrix derived from the spherical Green's function captures the effective sender receiver coupling. Numerical results show the reduced model closely matches the full dynamics while enabling scalable simulation of large cell populations.

2025-12-12T03:06:00Z Taishi Kotsuka Enoch Yeung http://arxiv.org/abs/2512.10515v1 UNAAGI: Atom-Level Diffusion for Generating Non-Canonical Amino Acid Substitutions 2025-12-11T10:35:10Z

Proposing beneficial amino acid substitutions, whether for mutational effect prediction or protein engineering, remains a central challenge in structural biology. Recent inverse folding models, trained to reconstruct sequences from structure, have had considerable impact in identifying functional mutations. However, current approaches are constrained to designing sequences composed exclusively of natural amino acids (NAAs). The larger set of non-canonical amino acids (NCAAs), which offer greater chemical diversity, and are frequently used in in-vivo protein engineering, remain largely inaccessible for current variant effect prediction methods. To address this gap, we introduce \textbf{UNAAGI}, a diffusion-based generative model that reconstructs residue identities from atomic-level structure using an E(3)-equivariant framework. By modeling side chains in full atomic detail rather than as discrete tokens, UNAAGI enables the exploration of both canonical and non-canonical amino acid substitutions within a unified generative paradigm. We evaluate our method on experimentally benchmarked mutation effect datasets and demonstrate that it achieves substantially improved performance on NCAA substitutions compared to the current state-of-the-art. Furthermore, our results suggest a shared methodological foundation between protein engineering and structure-based drug design, opening the door for a unified training framework across these domains.

2025-12-11T10:35:10Z Han Tang Wouter Boomsma http://arxiv.org/abs/2512.01976v3 Consistent Synthetic Sequences Unlock Structural Diversity in Fully Atomistic De Novo Protein Design 2025-12-11T00:59:32Z

High-quality training datasets are crucial for the development of effective protein design models, but existing synthetic datasets often include unfavorable sequence-structure pairs, impairing generative model performance. We leverage ProteinMPNN, whose sequences are experimentally favorable as well as amenable to folding, together with structure prediction models to align high-quality synthetic structures with recoverable synthetic sequences. In that way, we create a new dataset designed specifically for training expressive, fully atomistic protein generators. By retraining La-Proteina, which models discrete residue type and side chain structure in a continuous latent space, on this dataset, we achieve new state-of-the-art results, with improvements of +54% in structural diversity and +27% in co-designability. To validate the broad utility of our approach, we further introduce Proteina Atomistica, a unified flow-based framework that jointly learns the distribution of protein backbone structure, discrete sequences, and atomistic side chains without latent variables. We again find that training on our new sequence-structure data dramatically boosts benchmark performance, improving \method's structural diversity by +73% and co-designability by +5%. Our work highlights the critical importance of aligned sequence-structure data for training high-performance de novo protein design models. Our new dataset https://catalog.ngc.nvidia.com/orgs/nvidia/teams/clara/resources/proteina-atomistica_data/files?version=release , the Consistency Distilled Synthetic Protein Database, is made available as an open-source resource.

2025-12-01T18:34:16Z Danny Reidenbach Zhonglin Cao Zuobai Zhang Kieran Didi Tomas Geffner Guoqing Zhou Jian Tang Christian Dallago Arash Vahdat Emine Kucukbenli Karsten Kreis http://arxiv.org/abs/2512.08508v1 Fused Gromov-Wasserstein Contrastive Learning for Effective Enzyme-Reaction Screening 2025-12-09T11:49:24Z

Enzymes are crucial catalysts that enable a wide range of biochemical reactions. Efficiently identifying specific enzymes from vast protein libraries is essential for advancing biocatalysis. Traditional computational methods for enzyme screening and retrieval are time-consuming and resource-intensive. Recently, deep learning approaches have shown promise. However, these methods focus solely on the interaction between enzymes and reactions, overlooking the inherent hierarchical relationships within each domain. To address these limitations, we introduce FGW-CLIP, a novel contrastive learning framework based on optimizing the fused Gromov-Wasserstein distance. FGW-CLIP incorporates multiple alignments, including inter-domain alignment between reactions and enzymes and intra-domain alignment within enzymes and reactions. By introducing a tailored regularization term, our method minimizes the Gromov-Wasserstein distance between enzyme and reaction spaces, which enhances information integration across these domains. Extensive evaluations demonstrate the superiority of FGW-CLIP in challenging enzyme-reaction tasks. On the widely-used EnzymeMap benchmark, FGW-CLIP achieves state-of-the-art performance in enzyme virtual screening, as measured by BEDROC and EF metrics. Moreover, FGW-CLIP consistently outperforms across all three splits of ReactZyme, the largest enzyme-reaction benchmark, demonstrating robust generalization to novel enzymes and reactions. These results position FGW-CLIP as a promising framework for enzyme discovery in complex biochemical settings, with strong adaptability across diverse screening scenarios.

2025-12-09T11:49:24Z Gengmo Zhou Feng Yu Wenda Wang Zhifeng Gao Guolin Ke Zhewei Wei Zhen Wang http://arxiv.org/abs/2512.07692v2 Mapping Still Matters: Coarse-Graining with Machine Learning Potentials 2025-12-09T10:32:38Z

Coarse-grained (CG) modeling enables molecular simulations to reach time and length scales inaccessible to fully atomistic methods. For classical CG models, the choice of mapping, that is, how atoms are grouped into CG sites, is a major determinant of accuracy and transferability. At the same time, the emergence of machine learning potentials (MLPs) offers new opportunities to build CG models that can in principle learn the true potential of the mean force for any mapping. In this work, we systematically investigate how the choice of mapping influences the representations learned by equivariant MLPs by studying liquid hexane, amino acids, and polyalanine. We find that when the length scales of bonded and nonbonded interactions overlap, unphysical bond permutations can occur. We also demonstrate that correctly encoding species and maintaining stereochemistry are crucial, as neglecting either introduces unphysical symmetries. Our findings provide practical guidance for selecting CG mappings compatible with modern architectures and guide the development of transferable CG models.

2025-12-08T16:31:57Z Franz Görlich Julija Zavadlav