https://arxiv.org/api/mW1hAhjPdAeYk5ANHNyfzRyfTXg2026-06-21T17:21:20Z1325815015http://arxiv.org/abs/2605.27986v1An Evolutionary Approach for Designing Stable and Highly Expressible Low-Immunogenicity Therapeutic mRNA Sequences2026-05-27T05:20:17ZMessenger RNA (mRNA) sequences as therapeutics require optimized design to ensure efficient translation, structural stability, and minimal immunogenicity. This study presents a two-stage in-silico framework that integrates deep learning and evolutionary computation for rational mRNA optimization instead of existing state-of-the-art models. In the first stage, a pretrained CodonTransformer (BERT-like Large Language Model) generates biologically coherent mRNA sequences encoding the target antigen. In the second stage, a genetic algorithm (GA) evolves these candidate sequences through codon-aware crossover and synonymous mutation guided by human codon usage preferences. Fitness functions for evaluation combined translation-related metrics (CAI, tAI, codon-pair bias), mRNA structural stability (local and global MFE via RNAfold, GC content), and reduced immunogenicity (CpG/UpA motif frequency). Over successive generations (38th, 40th, and 42nd), the GA improved (achieved CAI values of 0.73 to 0.74 and tAI values of 0.63 to 0.64) CAI and tAI by over 6% and codon-pair bias is high and consistent (0.97 ) and improved ribosomal accessibility at the 5' end, with an unpaired_30 fraction reaching 0.87; Global Minimum Free Energy (MFE) converged to a balanced range of -346 to -356 kcal/mol, achieving approximately 84% base-paired structural stability, and reduced immune-stimulatory motifs - lowering the average immune penalty to 27.3 in the final generation. Linear Design produces hyper-stable transcripts (MFE < - 2000 kcal/mol) that risk translation inefficiency due to extreme rigidity, and BiLSTM-CRF focuses solely on high CAI (0.96 to 0.98) without structural constraints, our framework achieves an optimal translation-stability equilibrium, highlighting the proposed BERT-GA framework as an effective, data-driven approach for the design and optimization of in-silico mRNA sequences.2026-05-27T05:20:17ZDhawa Sang DongMausam GurungSuraj Kandelhttp://arxiv.org/abs/2605.28886v1Computational Modeling of Antibody-Antigen Complexes: PLM-Based and MSA-Based Approaches2026-05-27T03:43:52ZAntibodies play a central role in the immune response by specifically recognizing and neutralizing antigens, and therapeutic antibodies have become major drugs for cancer and autoimmune diseases. However, their discovery still relies on extensive in vitro screening, and accurate computational modeling of antibody structures and antibody-antigen interactions can prioritize candidates, reduce experimental burden, and accelerate rational design. Despite recent advances in high-accuracy protein and complex prediction, a persistent performance gap remains for antibody-related tasks compared with general protein-protein interactions, limiting downstream design.
This thesis investigates why antibody-related tasks are harder and proposes improvements along two complementary directions. First, we investigate protein language model (PLM)-based methods for antibody and antibody-antigen structure prediction. Using embeddings from multiple PLMs, our approach achieves the best CDR-H3 accuracy among compared PLM-based methods on antibody monomer prediction. Extending it to complex prediction does not generalize: without co-evolutionary signals between antibody and antigen, single-sequence PLM representations do not reliably identify binding interfaces.
Second, we develop two MSA-based interventions for antibody-antigen complex prediction: MSA refinement, which combines CDR-focused filtering with depth recovery from a larger sequence database, and convergence-aware recycling, which selects a stable intermediate recycle state for final diffusion sampling. Together, these interventions provide consistent gains over the AlphaFold3 baseline on a held-out antibody-antigen test set. Because the methods modify MSA construction and recycling behavior rather than model parameters, they apply without retraining or weight access.2026-05-27T03:43:52ZPhD thesisXiao Luohttp://arxiv.org/abs/2605.27861v1From Detection to Mechanism: Cross-Attention Graph Neural Networks Enable Drug-Drug Interaction Type Prediction An Ablation Study with Acetylsalicylic Acid Validation2026-05-27T02:22:44ZPredicting whether two drugs interact (binary detection) is a substantially dif- ferent task from predicting the mechanism type of that interaction (multi-class classification). This study presents a systematic ablation study of three Graph Neural Network (GNN) architectures for drug-drug interaction (DDI) prediction on a publicly available benchmark dataset comprising 38,337 positive pairs across 86 interaction types. Three architectures are compared under identical training conditions (n = 61,339 pairs): a siamese dual Message Passing Neural Network (MPNN) with concatenation (Concat), a dual MPNN with four-head cross-attention (CrossAtt), and a ternary MPNN incorporating an interaction graph (Ternary). CrossAtt improves multi-class F1-macro by +0.186 absolute (+45%) over Concat, while improving binary AUC by only +0.012 (+1.3%) - confirming that atom-level inter-molecular communication specifically enables mechanism-type classification. The ternary architecture underperforms despite equivalent training data, with its failure consistent with a training instability hypothesis. Validation on ten acetylsali- cylic acid (ASA) drug pairs, held out prior to training, demonstrates 10/10 correct DDI-type predictions for CrossAtt versus 0/10 for Ternary. Two consistent failure cases are identified across all architectures, linking to structural limits established in a companion toxicity study.2026-05-27T02:22:44Z12 pages, 1 figureJuergen Dietrichhttp://arxiv.org/abs/2507.09466v2La-Proteina: Atomistic Protein Generation via Partially Latent Flow Matching2026-05-27T00:51:14ZRecently, many generative models for de novo protein structure design have emerged. Yet, only few tackle the difficult task of directly generating fully atomistic structures jointly with the underlying amino acid sequence. This is challenging, for instance, because the model must reason over side chains that change in length during generation. We introduce La-Proteina for atomistic protein design based on a novel partially latent protein representation: coarse backbone structure is modeled explicitly, while sequence and atomistic details are captured via per-residue latent variables of fixed dimensionality, thereby effectively side-stepping challenges of explicit side-chain representations. Flow matching in this partially latent space then models the joint distribution over sequences and full-atom structures. La-Proteina achieves state-of-the-art performance on multiple generation benchmarks, including all-atom co-designability, diversity, and structural validity, as confirmed through detailed structural analyses and evaluations. Notably, La-Proteina also surpasses previous models in atomistic motif scaffolding performance, unlocking critical atomistic structure-conditioned protein design tasks. Moreover, La-Proteina is able to generate co-designable proteins of up to 800 residues, a regime where most baselines collapse and fail to produce valid samples, demonstrating La-Proteina's scalability and robustness.2025-07-13T03:01:50ZTomas GeffnerKieran DidiZhonglin CaoDanny ReidenbachZuobai ZhangChristian DallagoEmine KucukbenliKarsten KreisArash Vahdathttp://arxiv.org/abs/2605.27573v1Cycle Based Computational Pipeline for Extracting Instantaneous Whisking Frequency2026-05-26T18:45:33ZWhisking is a rhythmic and adaptive behavior that rodents use to probe and interact with their environment, and the frequency of movement reflects both sensorimotor processing and internal brain states. A robust and traditional method of whisker frequency estimation uses power spectral analysis of whisker position spanning several cycles. To improve the temporal resolution of whisker movement, we here estimate the period for each cycle, hence indirectly extracting an instantaneous frequency. We do this using markerless estimation of whisker position and identifying the peak and trough for each cycle. The cycle period is extracted, and artifacts are rejected with a ripple exclusion validator based on peak prominence and sequential amplitude filtering. The method is compared with power spectral estimation, using the Fourier transform of a temporal window of 0.5 seconds. We find that frequency estimation using a fixed window does not capture transient variability, while the cycle by cycle method recovers higher, time-resolved frequencies. The cycle by cycle approach also reveals the expected cycle-level variability. Artifact rejection through subsequence filtering removed spurious frequencies above 30 Hz, aligning refined frequencies with established physiological bounds (4 to 28 Hz). This pipeline provides an alternative solution for real time compatible frequency estimation, which better captures temporal variation at the expense of precision in frequency estimation.2026-05-26T18:45:33ZGuanghui LiFangyuan LiBarbara Lykke LindRune W Berghttp://arxiv.org/abs/2512.05794v3Mechanistic Interpretability of Antibody Language Models Using SAEs2026-05-26T15:50:47ZSparse autoencoders (SAEs) are a mechanistic interpretability technique that have been used to provide insight into learned concepts within large protein language models. Here, we employ TopK and Ordered SAEs to investigate autoregressive antibody language models, and steer their generation. We show that TopK SAEs can reveal biologically meaningful latent features, but high feature-concept correlation does not guarantee causal control over generation. In contrast, Ordered SAEs impose a hierarchical structure that reliably identifies steerable features, but at the expense of more complex and less interpretable activation patterns. These findings advance the mechanistic interpretability of domain-specific protein language models and suggest that, while TopK SAEs suffice for mapping latent features to concepts, Ordered SAEs are preferable when precise generative steering is required.2025-12-05T15:18:50Zv3: 15 pages; corrected author list and affiliations in the main text; minor text changes; updated steering results following minor code changes; conclusions and findings remain unchanged; included link to data and code in the Data Availability sectionRebonto HaqueOliver M. TurnbullAnisha ParsanNithin ParsanJohn J. YangAnna L. BeukenhorstCharlotte M. Deanehttp://arxiv.org/abs/2605.21617v2$\textit{BlockFormer}$ : Transformer-based inference from interaction maps2026-05-26T12:41:09ZInference from interaction maps, such as centromere identification from genome-wide chromosome conformation capture techniques -- notably Hi-C -- can be formulated as a generic inverse problem: infer a set of parameters given a map summarizing pairwise interactions between entities through blocks of variable numbers and sizes. In this work, we introduce a data-driven approach that leverages shared structure between these maps, such as global alignment between localized patterns, while handling the variability in number and size of entities arising in real-world data. Our approach relies on a transformer architecture capable of handling such variability and a custom simulator to generate abundant, yet computationally cheap synthetic data for training. Applied to the problem of centromere localization, the method accurately recovers their genomic positions across a wide range of species of various genome sizes.2026-05-20T18:28:43ZEloïse TouronPedro L. C. RodriguesJulyan ArbelNelle VaroquauxMichael Arbelhttp://arxiv.org/abs/2605.26690v1Self-Improvement Imitation with Biologically Guided Search for Protein Design Under Oracle Budgets2026-05-26T08:29:36ZProtein sequence optimization under tight oracle budgets requires methods that explore vast combinatorial spaces while making each evaluation informative. Existing reinforcement learning and off-policy generative approaches often degrade under surrogate noise, and position-agnostic mutation proposals risk disrupting functionally critical residues. We introduce SILO, a trajectory-level self-improvement imitation framework for oracle-budgeted protein design. SILO uses a hierarchical edit policy that decomposes each mutation into a position choice followed by a residue choice. In each active-learning round, the policy samples candidate trajectories via incremental stochastic beam search without replacement (SBS), and a UCB-based proxy ensemble, combined with an alanine-scan fitness score (AFS), selects candidates with functionally relevant edits for in silico oracle evaluation. The policy is then updated by next-action cross-entropy imitation on the round's best oracle-labeled trajectories, avoiding value-function estimation. Across eight reproduced protein fitness landscapes and five strong baselines from prior work, SILO achieves the highest maximum and top-100 mean fitness on 8 of 8 landscapes within our evaluations, often exhibiting faster early-stage improvement. In low-data and noisy-proxy stress tests on two landscapes per setting, SILO remains competitive or best when several baselines degrade. Ablations show that SBS with AFS account for much of the gains, with iterative imitation providing additional improvement. Code is available at: https://github.com/grimmlab/SILO.git2026-05-26T08:29:36ZAshima KhannaDominik Grimmhttp://arxiv.org/abs/2605.25465v1AI-Driven SERS for Non-invasive and Label-Free Extracellular Vesicle Detection Across Cellular Origins in Tears and Sweat2026-05-25T06:17:24ZWearable sensing technology capable of point-of-care, continuous and non-invasive analysis of exosomes in biofluid such as tears and sweat is an essential part for future personalized medicine. Major detection and identification methods of cell secreted Extracellular Vesicles (EVs) often require labeling and are time-consuming, resulting in low efficiency in EV mechanism research and disease diagnosis. While the label-free Surface-enhanced Raman spectroscopy (SERS) has been combined with deep learning model for EV identification in blood, their application to non-invasive detection of EVs in tears and sweat are missing. Here, we filled this gap by developing an artificial intelligence (AI)-assisted Surface-enhanced Raman spectroscopy (SERS) method based on salt-induced nanoparticle aggregation for fast EV identification in tears and sweat with high accuracy. Significantly, our label-free detection and AI differentiation of EVs from 6 cell lines (HepG2, Hela, 143B, LO-2, BMSC, H8) achieved the identification of EVs in tear fluids from 7 different disease sources with accuracies >92%. Our results showed that this platform can not only distinguish EVs from multiple cell sources but also generate highly reproducible and selective EV signals in tear fluids without a need for chemical labeling or separation steps. Molecular dynamics simulations revealed that silver atoms (Ag) form electrostatic interactions with oxygen atoms of multiple amino acid residues in proteins, suggesting a high affinity. This strategy realizes ultra-sensitive and anti-interference detection of EVs, providing a new idea for the rapid diagnosis of clinical diseases.2026-05-25T06:17:24Z7 figures, 26 pagesYang LiXiaoming LyuLing XiaKuo ZhanHaoyu JiLei QinSeppo J. VainioJian-An Huanghttp://arxiv.org/abs/2502.01184v2FragmentNet: Adaptive Graph Fragmentation for Graph-to-Sequence Molecular Representation Learning2026-05-25T05:20:26ZMolecular representation learning methods typically tokenize molecules as individual atoms or use rigid, rule-based fragment decompositions, limiting their ability to capture meaningful chemical substructure context. We introduce FragmentNet, a graph-to-sequence model built around a novel adaptive, learned tokenizer that decomposes molecular graphs into chemically valid fragments of adjustable granularity, complemented by chemically aware spatial positional encodings that preserve molecular topology in the resulting sequence. Extending masked pre-training strategies from natural language processing to the molecular domain, we mask and reconstruct molecules at the level of chemically meaningful fragments rather than individual atoms. Evaluating across multiple property prediction benchmarks, we find that pre-training at fragment granularity leads to improved downstream performance on the majority of tasks, demonstrating that tokenization granularity is an important design choice for molecular representation learning.2025-02-03T09:21:49Z22 pages, 13 figures, 5 tablesAnkur SamantaRohan GuptaAditi MisraChristian McIntosh ClarkeJayakumar Rajadashttp://arxiv.org/abs/2605.25388v1ViroBench: Benchmarking Nucleotide Foundation Models on Viral Genomics Tasks2026-05-25T03:31:46ZNucleotide sequences constitute the fundamental genetic basis of biological systems, rendering viral genomic analysis critical for biomedical advancement. Despite progress in biological foundation models, specifically nucleotide foundation models (NFMs), the field lacks a unified standard for viral genomics to facilitate community development and enforce biosecurity constraints. To address this, we introduce ViroBench, the first comprehensive and large-scale benchmark specifically designed for NFMs in viral settings. ViroBench evaluates models across two critical dimensions: biological understanding and latent biosecurity risk, covering 18 diverse scenarios within 4 task types. Extensive evaluation of 66 NFMs across diverse architectures yields three critical conclusions. Firstly, NFMs exhibit a performance degradation in biological understanding under phylogenetic and temporal shifts, indicating weak extrapolation capabilities. Secondly, generation tasks reveal a decoupling between statistical likelihood and biological functional validity, posing latent biosecurity risks. Thirdly, controlled ablation studies reveal that taxonomic diversity in pretraining data outweighs parameter scale. Specifically, a lightweight baseline trained on diverse data achieves a 67.5% performance gain over its original model. Overall, ViroBench provides interpretable, diagnostic evaluations and a reproducible measurement framework for future research on viral nucleotide foundation models. The datasets and code are publicly available at https://github.com/QIANJINYDX/ViroBench.2026-05-25T03:31:46Z42 pages,15 figuresDongxin YeFang HuHan HuShu HuYang TanWanli OuyangStan Z. LiJie CuiNanqing Dong10.1145/3770855.3819057http://arxiv.org/abs/2605.25295v1Accelerated Simulation Algorithms for Extreme First-Passage Problems with General Emission Profiles2026-05-24T23:17:38ZFastest arrival events, where the first among many diffusing particles reaches a target, are central in triggering signal initiation in molecular stochastic systems. Classical approaches to simulate such events rely on full trajectory generation of all particles, leading to prohibitive computational costs in the large particle number regime. In this work, we present a general simulation framework for efficiently generating order statistics of arrival times by exploiting asymptotic first-passage distributions. This framework applies to diffusion processes in bounded domains with localized absorbing targets, for which short-time first-passage asymptotics are available, such as Brownian motion in dimensions one, two, and three. Starting with the case of instantaneous emission, we derive and implement a recursive inverse transform algorithm to simulate the first $k$ arrivals without tracking particle trajectories. We extend this algorithm to time-dependent emission profiles via an iterative approach, enabling the simulation of extreme statistics in systems with temporal injection, ranging from rapid to prolonged emission. Additionally, we provide asymptotic estimates of the mean fastest arrival time. To conclude, the present acceleration algorithm which bypasses Brownian simulations of trajectories can be used for spatial reaction networks, rare event detection, or diffusion-controlled activation.2026-05-24T23:17:38Z28pages, 7 figuresEmmanuel Akame MfoumouDavid Holcmanhttp://arxiv.org/abs/2605.25155v1Automated multi-dataset INST $^{13}$C metabolic flux analysis at microliter scale reveals robust fluxes but variable metabolite pools in $Corynebacterium~glutamicum$2026-05-24T16:27:40ZIsotopically non-stationary metabolic flux analysis (INST $^{13}$C-MFA) provides unique insights into cellular physiology but is typically limited by low throughput and high experimental costs. Here, we present a miniaturized and automated workflow that integrates transient isotope labeling experiments with advanced computational modeling to enable parallel INST $^{13}$C-MFA at microliter scale. The approach is demonstrated for an evolved $Corynebacterium~glutamicum$ strain capable of efficient growth on ethanol, a substrate for which isotopically stationary $^{13}$C-MFA is inherently limited due to low labeling diversity. Using robotic liquid handling, rapid hot isopropanol quenching, and LC-QToF-MS-based analytics, highly informative datasets were generated from parallel 48-well experiments with different ethanol tracers. Multi-dataset INST $^{13}$C-MFA unlocked joint estimation of intracellular fluxes and metabolite pool sizes and significantly improved flux precision compared to single-dataset analyses. While net fluxes were robust across datasets, pool size estimates exhibited variability and did not converge under joint inference, highlighting a fundamental methodological difference to single-dataset INST $^{13}$C-MFA. The resulting multi-dataset flux map reveals a central role of the glyoxylate shunt during growth on ethanol, consistent with metabolic adaption to C2-based substrate utilization. Overall, this work demonstrates that automated multi-dataset INST $^{13}$C-MFA is technically feasible and provides high-quality flux analysis at a fraction of the cost of conventional lab-scale bioreactor-based approaches. The presented workflow establishes a scalable framework for high-throughput quantitative fluxomics in microbial biotechnology and supports integration into iterative strain engineering and biofoundry pipelines.2026-05-24T16:27:40ZJochen NießerAnton StratmannMartin BeyßWolfgang WiechertKatharina NöhStephan Noackhttp://arxiv.org/abs/2511.05906v3Systemic physiological cliff at menopause revealed by temporal deconvolution of 300 million lab tests: a multi-cohort retrospective study2026-05-24T16:19:26ZMenopause fundamentally reshapes female physiology, yet current understanding is limited by small longitudinal cohorts that characterize it as a gradual transition. Large-scale biomedical datasets remain underutilized because the age of the final menstrual period (FMP) is rarely recorded. Here, we present a computational framework that leverages cross-sectional data to reconstruct systemic physiology as a function of time relative to FMP. We adapted a deconvolution framework from astronomy to recover systemic biological trajectories by deconvolving the population distribution of FMP age from chronological data. Applying this to two national cohorts with 300 million laboratory tests from 1.3 million females, we transformed cross-sectional measurements into high-resolution timelines anchored to the FMP. Our analysis reveals a step-like physiological cliff at the FMP across endocrine, skeletal, hepatic, renal, inflammatory, and lipid systems. These discontinuities are absent in males and highly concordant across independent populations. We demonstrate that systemic dysregulation begins over a decade prior to FMP, significantly expanding the window for preventive intervention. Furthermore, hormone replacement therapy (HRT) appears to markedly attenuate these abrupt physiological shifts. These findings further support a systemic and quantifiable view of the menopausal transition and provide a generalizable strategy for recovering hidden biological trajectories from human datasets, applicable to other life stages such as puberty or disease progression.2025-11-08T08:10:29Zmain text: pages 1-23, 5 figures, 1 table. supplemental: pages 24-51 8 figures, 5 tables, supplemental discussionGlen PridhamYoav HayutNoa Lavi-ShoseyovMichal NeemanNoa HovavYoel ToledanoUri Alonhttp://arxiv.org/abs/2605.25079v1Trans-dimensional Bayesian model averaging for $^{13}$C-based metabolic flux analysis: Evidence-based flux inference under structural model uncertainty2026-05-24T13:42:07ZAccurate quantification of intracellular metabolic fluxes is central to systems biology and biotechnology. Flux estimation relies on biochemical network models, with $^{13}$C metabolic flux analysis (MFA) being the state-of-the-art approach. However, isotope labeling data are often insufficient to uniquely support a single network formulation. In such cases, flux estimates become model-dependent, highlighting the need for methods that explicitly account for structural uncertainty. Bayesian model averaging (BMA) provides a principled framework for this purpose, but its application to $^{13}$C-MFA has so far been restricted to uncertainty in reaction bidirectionality within fixed network topologies. We introduce a scalable Bayesian inference framework for $^{13}$C-MFA, Bayesian model set averaging, that applies BMA to encompass uncertainty in reactions and pathways. Our approach combines reversible jump Markov chain Monte Carlo for trans-dimensional exploration of model spaces with diffusive nested sampling for robust estimation of model evidences, enabling averaging over large families of metabolic network models. Using illustrative and application-scale synthetic case studies, we demonstrate that the method yields robust flux estimates, reveals when multiple network configurations are statistically indistinguishable, and recovers data-supported model structures. Importantly, rather than committing to a single model, the framework manages structural uncertainty: under limited data, competing models are retained, whereas increasing data informativeness improved model and flux recovery. The approach scales to billions of model variants, providing a practical foundation for uncertainty- and misspecification-aware quantitative flux inference in $^{13}$C-MFA.2026-05-24T13:42:07ZJohann F. JadebeckAnton StratmannMartin BeyßKatharina Nöh