An Evolutionary Approach for Designing Stable and Highly Expressible Low-Immunogenicity Therapeutic mRNA Sequences

2026-05-27T05:20:17Z

Messenger RNA (mRNA) sequences as therapeutics require optimized design to ensure efficient translation, structural stability, and minimal immunogenicity. This study presents a two-stage in-silico framework that integrates deep learning and evolutionary computation for rational mRNA optimization instead of existing state-of-the-art models. In the first stage, a pretrained CodonTransformer (BERT-like Large Language Model) generates biologically coherent mRNA sequences encoding the target antigen. In the second stage, a genetic algorithm (GA) evolves these candidate sequences through codon-aware crossover and synonymous mutation guided by human codon usage preferences. Fitness functions for evaluation combined translation-related metrics (CAI, tAI, codon-pair bias), mRNA structural stability (local and global MFE via RNAfold, GC content), and reduced immunogenicity (CpG/UpA motif frequency). Over successive generations (38th, 40th, and 42nd), the GA improved (achieved CAI values of 0.73 to 0.74 and tAI values of 0.63 to 0.64) CAI and tAI by over 6% and codon-pair bias is high and consistent (0.97 ) and improved ribosomal accessibility at the 5' end, with an unpaired_30 fraction reaching 0.87; Global Minimum Free Energy (MFE) converged to a balanced range of -346 to -356 kcal/mol, achieving approximately 84% base-paired structural stability, and reduced immune-stimulatory motifs - lowering the average immune penalty to 27.3 in the final generation. Linear Design produces hyper-stable transcripts (MFE < - 2000 kcal/mol) that risk translation inefficiency due to extreme rigidity, and BiLSTM-CRF focuses solely on high CAI (0.96 to 0.98) without structural constraints, our framework achieves an optimal translation-stability equilibrium, highlighting the proposed BERT-GA framework as an effective, data-driven approach for the design and optimization of in-silico mRNA sequences.

Computational Modeling of Antibody-Antigen Complexes: PLM-Based and MSA-Based Approaches

2026-05-27T03:43:52Z

Antibodies play a central role in the immune response by specifically recognizing and neutralizing antigens, and therapeutic antibodies have become major drugs for cancer and autoimmune diseases. However, their discovery still relies on extensive in vitro screening, and accurate computational modeling of antibody structures and antibody-antigen interactions can prioritize candidates, reduce experimental burden, and accelerate rational design. Despite recent advances in high-accuracy protein and complex prediction, a persistent performance gap remains for antibody-related tasks compared with general protein-protein interactions, limiting downstream design. This thesis investigates why antibody-related tasks are harder and proposes improvements along two complementary directions. First, we investigate protein language model (PLM)-based methods for antibody and antibody-antigen structure prediction. Using embeddings from multiple PLMs, our approach achieves the best CDR-H3 accuracy among compared PLM-based methods on antibody monomer prediction. Extending it to complex prediction does not generalize: without co-evolutionary signals between antibody and antigen, single-sequence PLM representations do not reliably identify binding interfaces. Second, we develop two MSA-based interventions for antibody-antigen complex prediction: MSA refinement, which combines CDR-focused filtering with depth recovery from a larger sequence database, and convergence-aware recycling, which selects a stable intermediate recycle state for final diffusion sampling. Together, these interventions provide consistent gains over the AlphaFold3 baseline on a held-out antibody-antigen test set. Because the methods modify MSA construction and recycling behavior rather than model parameters, they apply without retraining or weight access.

From Detection to Mechanism: Cross-Attention Graph Neural Networks Enable Drug-Drug Interaction Type Prediction An Ablation Study with Acetylsalicylic Acid Validation

2026-05-27T02:22:44Z

Predicting whether two drugs interact (binary detection) is a substantially dif- ferent task from predicting the mechanism type of that interaction (multi-class classification). This study presents a systematic ablation study of three Graph Neural Network (GNN) architectures for drug-drug interaction (DDI) prediction on a publicly available benchmark dataset comprising 38,337 positive pairs across 86 interaction types. Three architectures are compared under identical training conditions (n = 61,339 pairs): a siamese dual Message Passing Neural Network (MPNN) with concatenation (Concat), a dual MPNN with four-head cross-attention (CrossAtt), and a ternary MPNN incorporating an interaction graph (Ternary). CrossAtt improves multi-class F1-macro by +0.186 absolute (+45%) over Concat, while improving binary AUC by only +0.012 (+1.3%) - confirming that atom-level inter-molecular communication specifically enables mechanism-type classification. The ternary architecture underperforms despite equivalent training data, with its failure consistent with a training instability hypothesis. Validation on ten acetylsali- cylic acid (ASA) drug pairs, held out prior to training, demonstrates 10/10 correct DDI-type predictions for CrossAtt versus 0/10 for Ternary. Two consistent failure cases are identified across all architectures, linking to structural limits established in a companion toxicity study.

La-Proteina: Atomistic Protein Generation via Partially Latent Flow Matching

2026-05-27T00:51:14Z

Recently, many generative models for de novo protein structure design have emerged. Yet, only few tackle the difficult task of directly generating fully atomistic structures jointly with the underlying amino acid sequence. This is challenging, for instance, because the model must reason over side chains that change in length during generation. We introduce La-Proteina for atomistic protein design based on a novel partially latent protein representation: coarse backbone structure is modeled explicitly, while sequence and atomistic details are captured via per-residue latent variables of fixed dimensionality, thereby effectively side-stepping challenges of explicit side-chain representations. Flow matching in this partially latent space then models the joint distribution over sequences and full-atom structures. La-Proteina achieves state-of-the-art performance on multiple generation benchmarks, including all-atom co-designability, diversity, and structural validity, as confirmed through detailed structural analyses and evaluations. Notably, La-Proteina also surpasses previous models in atomistic motif scaffolding performance, unlocking critical atomistic structure-conditioned protein design tasks. Moreover, La-Proteina is able to generate co-designable proteins of up to 800 residues, a regime where most baselines collapse and fail to produce valid samples, demonstrating La-Proteina's scalability and robustness.

Cycle Based Computational Pipeline for Extracting Instantaneous Whisking Frequency

2026-05-26T18:45:33Z

Whisking is a rhythmic and adaptive behavior that rodents use to probe and interact with their environment, and the frequency of movement reflects both sensorimotor processing and internal brain states. A robust and traditional method of whisker frequency estimation uses power spectral analysis of whisker position spanning several cycles. To improve the temporal resolution of whisker movement, we here estimate the period for each cycle, hence indirectly extracting an instantaneous frequency. We do this using markerless estimation of whisker position and identifying the peak and trough for each cycle. The cycle period is extracted, and artifacts are rejected with a ripple exclusion validator based on peak prominence and sequential amplitude filtering. The method is compared with power spectral estimation, using the Fourier transform of a temporal window of 0.5 seconds. We find that frequency estimation using a fixed window does not capture transient variability, while the cycle by cycle method recovers higher, time-resolved frequencies. The cycle by cycle approach also reveals the expected cycle-level variability. Artifact rejection through subsequence filtering removed spurious frequencies above 30 Hz, aligning refined frequencies with established physiological bounds (4 to 28 Hz). This pipeline provides an alternative solution for real time compatible frequency estimation, which better captures temporal variation at the expense of precision in frequency estimation.

Mechanistic Interpretability of Antibody Language Models Using SAEs

2026-05-26T15:50:47Z

Sparse autoencoders (SAEs) are a mechanistic interpretability technique that have been used to provide insight into learned concepts within large protein language models. Here, we employ TopK and Ordered SAEs to investigate autoregressive antibody language models, and steer their generation. We show that TopK SAEs can reveal biologically meaningful latent features, but high feature-concept correlation does not guarantee causal control over generation. In contrast, Ordered SAEs impose a hierarchical structure that reliably identifies steerable features, but at the expense of more complex and less interpretable activation patterns. These findings advance the mechanistic interpretability of domain-specific protein language models and suggest that, while TopK SAEs suffice for mapping latent features to concepts, Ordered SAEs are preferable when precise generative steering is required.

$\textit{BlockFormer}$ : Transformer-based inference from interaction maps

2026-05-26T12:41:09Z

Inference from interaction maps, such as centromere identification from genome-wide chromosome conformation capture techniques -- notably Hi-C -- can be formulated as a generic inverse problem: infer a set of parameters given a map summarizing pairwise interactions between entities through blocks of variable numbers and sizes. In this work, we introduce a data-driven approach that leverages shared structure between these maps, such as global alignment between localized patterns, while handling the variability in number and size of entities arising in real-world data. Our approach relies on a transformer architecture capable of handling such variability and a custom simulator to generate abundant, yet computationally cheap synthetic data for training. Applied to the problem of centromere localization, the method accurately recovers their genomic positions across a wide range of species of various genome sizes.

Self-Improvement Imitation with Biologically Guided Search for Protein Design Under Oracle Budgets

2026-05-26T08:29:36Z

Protein sequence optimization under tight oracle budgets requires methods that explore vast combinatorial spaces while making each evaluation informative. Existing reinforcement learning and off-policy generative approaches often degrade under surrogate noise, and position-agnostic mutation proposals risk disrupting functionally critical residues. We introduce SILO, a trajectory-level self-improvement imitation framework for oracle-budgeted protein design. SILO uses a hierarchical edit policy that decomposes each mutation into a position choice followed by a residue choice. In each active-learning round, the policy samples candidate trajectories via incremental stochastic beam search without replacement (SBS), and a UCB-based proxy ensemble, combined with an alanine-scan fitness score (AFS), selects candidates with functionally relevant edits for in silico oracle evaluation. The policy is then updated by next-action cross-entropy imitation on the round's best oracle-labeled trajectories, avoiding value-function estimation. Across eight reproduced protein fitness landscapes and five strong baselines from prior work, SILO achieves the highest maximum and top-100 mean fitness on 8 of 8 landscapes within our evaluations, often exhibiting faster early-stage improvement. In low-data and noisy-proxy stress tests on two landscapes per setting, SILO remains competitive or best when several baselines degrade. Ablations show that SBS with AFS account for much of the gains, with iterative imitation providing additional improvement. Code is available at: https://github.com/grimmlab/SILO.git

AI-Driven SERS for Non-invasive and Label-Free Extracellular Vesicle Detection Across Cellular Origins in Tears and Sweat

2026-05-25T06:17:24Z

Wearable sensing technology capable of point-of-care, continuous and non-invasive analysis of exosomes in biofluid such as tears and sweat is an essential part for future personalized medicine. Major detection and identification methods of cell secreted Extracellular Vesicles (EVs) often require labeling and are time-consuming, resulting in low efficiency in EV mechanism research and disease diagnosis. While the label-free Surface-enhanced Raman spectroscopy (SERS) has been combined with deep learning model for EV identification in blood, their application to non-invasive detection of EVs in tears and sweat are missing. Here, we filled this gap by developing an artificial intelligence (AI)-assisted Surface-enhanced Raman spectroscopy (SERS) method based on salt-induced nanoparticle aggregation for fast EV identification in tears and sweat with high accuracy. Significantly, our label-free detection and AI differentiation of EVs from 6 cell lines (HepG2, Hela, 143B, LO-2, BMSC, H8) achieved the identification of EVs in tear fluids from 7 different disease sources with accuracies >92%. Our results showed that this platform can not only distinguish EVs from multiple cell sources but also generate highly reproducible and selective EV signals in tear fluids without a need for chemical labeling or separation steps. Molecular dynamics simulations revealed that silver atoms (Ag) form electrostatic interactions with oxygen atoms of multiple amino acid residues in proteins, suggesting a high affinity. This strategy realizes ultra-sensitive and anti-interference detection of EVs, providing a new idea for the rapid diagnosis of clinical diseases.

FragmentNet: Adaptive Graph Fragmentation for Graph-to-Sequence Molecular Representation Learning

2026-05-25T05:20:26Z

Molecular representation learning methods typically tokenize molecules as individual atoms or use rigid, rule-based fragment decompositions, limiting their ability to capture meaningful chemical substructure context. We introduce FragmentNet, a graph-to-sequence model built around a novel adaptive, learned tokenizer that decomposes molecular graphs into chemically valid fragments of adjustable granularity, complemented by chemically aware spatial positional encodings that preserve molecular topology in the resulting sequence. Extending masked pre-training strategies from natural language processing to the molecular domain, we mask and reconstruct molecules at the level of chemically meaningful fragments rather than individual atoms. Evaluating across multiple property prediction benchmarks, we find that pre-training at fragment granularity leads to improved downstream performance on the majority of tasks, demonstrating that tokenization granularity is an important design choice for molecular representation learning.

ViroBench: Benchmarking Nucleotide Foundation Models on Viral Genomics Tasks

2026-05-25T03:31:46Z

Nucleotide sequences constitute the fundamental genetic basis of biological systems, rendering viral genomic analysis critical for biomedical advancement. Despite progress in biological foundation models, specifically nucleotide foundation models (NFMs), the field lacks a unified standard for viral genomics to facilitate community development and enforce biosecurity constraints. To address this, we introduce ViroBench, the first comprehensive and large-scale benchmark specifically designed for NFMs in viral settings. ViroBench evaluates models across two critical dimensions: biological understanding and latent biosecurity risk, covering 18 diverse scenarios within 4 task types. Extensive evaluation of 66 NFMs across diverse architectures yields three critical conclusions. Firstly, NFMs exhibit a performance degradation in biological understanding under phylogenetic and temporal shifts, indicating weak extrapolation capabilities. Secondly, generation tasks reveal a decoupling between statistical likelihood and biological functional validity, posing latent biosecurity risks. Thirdly, controlled ablation studies reveal that taxonomic diversity in pretraining data outweighs parameter scale. Specifically, a lightweight baseline trained on diverse data achieves a 67.5% performance gain over its original model. Overall, ViroBench provides interpretable, diagnostic evaluations and a reproducible measurement framework for future research on viral nucleotide foundation models. The datasets and code are publicly available at https://github.com/QIANJINYDX/ViroBench.

Accelerated Simulation Algorithms for Extreme First-Passage Problems with General Emission Profiles

2026-05-24T23:17:38Z

Fastest arrival events, where the first among many diffusing particles reaches a target, are central in triggering signal initiation in molecular stochastic systems. Classical approaches to simulate such events rely on full trajectory generation of all particles, leading to prohibitive computational costs in the large particle number regime. In this work, we present a general simulation framework for efficiently generating order statistics of arrival times by exploiting asymptotic first-passage distributions. This framework applies to diffusion processes in bounded domains with localized absorbing targets, for which short-time first-passage asymptotics are available, such as Brownian motion in dimensions one, two, and three. Starting with the case of instantaneous emission, we derive and implement a recursive inverse transform algorithm to simulate the first $k$ arrivals without tracking particle trajectories. We extend this algorithm to time-dependent emission profiles via an iterative approach, enabling the simulation of extreme statistics in systems with temporal injection, ranging from rapid to prolonged emission. Additionally, we provide asymptotic estimates of the mean fastest arrival time. To conclude, the present acceleration algorithm which bypasses Brownian simulations of trajectories can be used for spatial reaction networks, rare event detection, or diffusion-controlled activation.

Automated multi-dataset INST $^{13}$C metabolic flux analysis at microliter scale reveals robust fluxes but variable metabolite pools in $Corynebacterium~glutamicum$

2026-05-24T16:27:40Z

Isotopically non-stationary metabolic flux analysis (INST $^{13}$C-MFA) provides unique insights into cellular physiology but is typically limited by low throughput and high experimental costs. Here, we present a miniaturized and automated workflow that integrates transient isotope labeling experiments with advanced computational modeling to enable parallel INST $^{13}$C-MFA at microliter scale. The approach is demonstrated for an evolved $Corynebacterium~glutamicum$ strain capable of efficient growth on ethanol, a substrate for which isotopically stationary $^{13}$C-MFA is inherently limited due to low labeling diversity. Using robotic liquid handling, rapid hot isopropanol quenching, and LC-QToF-MS-based analytics, highly informative datasets were generated from parallel 48-well experiments with different ethanol tracers. Multi-dataset INST $^{13}$C-MFA unlocked joint estimation of intracellular fluxes and metabolite pool sizes and significantly improved flux precision compared to single-dataset analyses. While net fluxes were robust across datasets, pool size estimates exhibited variability and did not converge under joint inference, highlighting a fundamental methodological difference to single-dataset INST $^{13}$C-MFA. The resulting multi-dataset flux map reveals a central role of the glyoxylate shunt during growth on ethanol, consistent with metabolic adaption to C2-based substrate utilization. Overall, this work demonstrates that automated multi-dataset INST $^{13}$C-MFA is technically feasible and provides high-quality flux analysis at a fraction of the cost of conventional lab-scale bioreactor-based approaches. The presented workflow establishes a scalable framework for high-throughput quantitative fluxomics in microbial biotechnology and supports integration into iterative strain engineering and biofoundry pipelines.

Systemic physiological cliff at menopause revealed by temporal deconvolution of 300 million lab tests: a multi-cohort retrospective study

2026-05-24T16:19:26Z

Menopause fundamentally reshapes female physiology, yet current understanding is limited by small longitudinal cohorts that characterize it as a gradual transition. Large-scale biomedical datasets remain underutilized because the age of the final menstrual period (FMP) is rarely recorded. Here, we present a computational framework that leverages cross-sectional data to reconstruct systemic physiology as a function of time relative to FMP. We adapted a deconvolution framework from astronomy to recover systemic biological trajectories by deconvolving the population distribution of FMP age from chronological data. Applying this to two national cohorts with 300 million laboratory tests from 1.3 million females, we transformed cross-sectional measurements into high-resolution timelines anchored to the FMP. Our analysis reveals a step-like physiological cliff at the FMP across endocrine, skeletal, hepatic, renal, inflammatory, and lipid systems. These discontinuities are absent in males and highly concordant across independent populations. We demonstrate that systemic dysregulation begins over a decade prior to FMP, significantly expanding the window for preventive intervention. Furthermore, hormone replacement therapy (HRT) appears to markedly attenuate these abrupt physiological shifts. These findings further support a systemic and quantifiable view of the menopausal transition and provide a generalizable strategy for recovering hidden biological trajectories from human datasets, applicable to other life stages such as puberty or disease progression.

Trans-dimensional Bayesian model averaging for $^{13}$C-based metabolic flux analysis: Evidence-based flux inference under structural model uncertainty

2026-05-24T13:42:07Z

Accurate quantification of intracellular metabolic fluxes is central to systems biology and biotechnology. Flux estimation relies on biochemical network models, with $^{13}$C metabolic flux analysis (MFA) being the state-of-the-art approach. However, isotope labeling data are often insufficient to uniquely support a single network formulation. In such cases, flux estimates become model-dependent, highlighting the need for methods that explicitly account for structural uncertainty. Bayesian model averaging (BMA) provides a principled framework for this purpose, but its application to $^{13}$C-MFA has so far been restricted to uncertainty in reaction bidirectionality within fixed network topologies. We introduce a scalable Bayesian inference framework for $^{13}$C-MFA, Bayesian model set averaging, that applies BMA to encompass uncertainty in reactions and pathways. Our approach combines reversible jump Markov chain Monte Carlo for trans-dimensional exploration of model spaces with diffusive nested sampling for robust estimation of model evidences, enabling averaging over large families of metabolic network models. Using illustrative and application-scale synthetic case studies, we demonstrate that the method yields robust flux estimates, reveals when multiple network configurations are statistically indistinguishable, and recovers data-supported model structures. Importantly, rather than committing to a single model, the framework manages structural uncertainty: under limited data, competing models are retained, whereas increasing data informativeness improved model and flux recovery. The approach scales to billions of model variants, providing a practical foundation for uncertainty- and misspecification-aware quantitative flux inference in $^{13}$C-MFA.