https://arxiv.org/api/5m7d7vIhMKQGQIHXk+dnyrye36w 2026-06-21T20:40:41Z 13258 195 15 http://arxiv.org/abs/2603.20420v2 CRANE: Correcting Errors in Raw Nanopore Signals Using Hidden Markov Models 2026-05-20T05:18:24Z

Nanopore sequencing can read substantially longer sequences of nucleic acid molecules, called reads, than other sequencing methods, which has led to advances in genomic analysis such as the gapless human genome assembly. By analyzing the raw electrical signal reads that nanopore sequencing generates from molecules, existing works can map these reads without translating them into DNA characters (i.e., basecalling), allowing for quick and efficient analysis of sequencing data. However, raw signals often contain errors due to noise and processing errors, which limits the overall accuracy of raw signal analysis. Our goal in this work is to detect and correct errors in raw signals to improve the accuracy of raw signal analyses. To this end, we propose CRANE, a mechanism that trains and utilizes a Hidden Markov Model (HMM) to accurately correct signal errors. Our extensive evaluation on various datasets shows that CRANE 1) consistently improves the overall accuracy of the underlying raw signal analysis tools, 2) minimizes the burden of optimizing analysis pipelines for newer nanopore technologies, and 3) does not introduce substantial computational overhead. We conclude that CRANE provides an effective mechanism to systematically identify and correct the errors in raw nanopore signals before further analysis, which can enable the development of a new class of error correction mechanisms purely designed for raw nanopore signals. Source Code: CRANE is available at https://github.com/STORMgroup/CRANE. We also provide the scripts to fully reproduce our results on our GitHub page

2026-03-20T18:41:07Z Simon Ambrozak Ulysse McConnell Bhargav Srinivasan Burak Ozkan Ernest Zhang Can Firtina http://arxiv.org/abs/2605.20692v1 Inferring infectiousness: a joint model of the within-host viral kinetics of SARS-CoV-2 2026-05-20T04:39:53Z

During an infectious disease outbreak, providing accurate answers to policy questions about transmission requires a detailed model of the natural history of infectiousness. Unfortunately, direct measures of infectiousness are generally unavailable. Instead, we often rely on indirect proxies, such as viral load measured by PCR or antigen tests, viral culture to detect replication-competent virus, or symptom onset, each of which reflects different aspects of viral dynamics or host response. However, these proxies vary in terms of the ease of collection, scalability, and their relationship to viral shedding and therefore underlying infectiousness. Here, we use data from five prospective, densely sampled cohorts with longitudinal data on multiple proxies of viral shedding for approximately 2,000 infections to develop a Bayesian joint model for the within-host viral kinetics of SARS-CoV-2 infection. Modeling the joint distribution allows us to infer the trajectory of infectious virus shedding -- the most direct correlate of infectiousness -- for individuals who contribute only PCR data, and to compute derived quantities that are inaccessible from any single proxy alone. These include the population-level probability and expected duration of ongoing infectiousness as a function of time since diagnosis, stratified by variant, vaccination status, and infection history; the residual risk of releasing an individual from isolation; and personalized, real-time estimates of infectiousness that are sequentially updated as new test results become available.

2026-05-20T04:39:53Z Christopher B. Boyer Stephen M. Kissler Seran Hakki Jakob Jonnerby Ajit Lalvani Marc Lipsitch http://arxiv.org/abs/2605.20523v1 Machine-Learning-Enhanced Non-Invasive Testing for MASLD Fibrosis: Shallow-Deep Neural Networks Versus FIB-4, Tabular Foundation Models, and Large Language Models 2026-05-19T21:51:02Z

Advanced fibrosis is a major determinant of liver-related morbidity in metabolic dysfunction-associated steatotic liver disease (MASLD). FIB-4 is widely used as a first-line non-invasive test, but its fixed formula may underuse diagnostic information contained in age, aspartate aminotransferase, alanine aminotransferase, and platelet count. We evaluated whether machine-learning-enhanced non-invasive testing (MLE-NIT) can improve advanced fibrosis detection while preserving this FIB-4 variable space. We used three biopsy-confirmed MASLD cohorts from China, Malaysia, and India (n=784). The Chinese cohort was split into 486 training and 54 internal validation/tuning patients; final performance was reported only on the Malaysian and Indian external cohorts. Models used five variables: age, FIB-4, aspartate aminotransferase, platelet count, and alanine aminotransferase. We compared FIB-4 with a shallow-deep neural network (s-DNN), TabPFN, and gpt-4o-2024-08-06. FIB-4 achieved external ROC-AUCs of 0.75 and 0.60 in Malaysia and India, respectively. TabPFN achieved 0.69 and 0.66, fine-tuned GPT-4o achieved 0.75 and 0.63, and the s-DNN achieved 0.77 and 0.67, respectively. The s-DNN contained only 354 trainable parameters, compared with 7,244,554 for TabPFN, yet provided a more balanced external operating profile. Calibration showed s-DNN Brier scores of 0.18 and 0.22, and permutation importance identified AST and FIB-4 as dominant variables. Compact non-linear MLE-NITs may enhance FIB-4-based fibrosis assessment without increasing clinical data requirements.

2026-05-19T21:51:02Z 26 pages, 4 figures, 3 tables. Preprint Athanasios Angelakis Gabriele De Vito Eleni-Myrto Trifylli Filomena Ferrucci http://arxiv.org/abs/2605.20454v1 Sparse Contextual Coupling Reshapes Diffusion Geometry in Multilayer Hypergraphs 2026-05-19T20:06:54Z

Many complex systems combine dense background structure with sparse contextual information. We introduce a diffusion-based framework for analyzing how sparse condition-specific layers reshape diffusion geometry in multilayer hypergraphs. Each layer is represented as a weighted hypergraph, layers are coupled through shared entities, and random walks on the coupled system induce multiscale diffusion distances between nodes. We apply the framework to disease-conditioned gene networks by coupling a dense MSigDB functional gene-set layer to sparse disease-specific DGIdb drug-gene hypergraphs, with disease-associated drugs selected from DDDB and HumanNet-GSP used to define external gene weights. Across Bipolar Disorder, Schizophrenia, Leukemia, and Breast Cancer, the disease-specific layer contains less than 2 percent of genes in the coupled system, yet substantially changes diffusion distances and community structure. Centrality analysis suggests that this disproportionate effect arises because DGIdb-associated genes occupy influential positions in the MSigDB-derived functional network. The resulting diffusion-derived communities are stable under subsampling and show coherent post hoc functional enrichment, including signaling and neurotransmission categories in neuropsychiatric diseases and immune, translational, and metabolic categories in cancer-associated diseases. Community-level comparisons further reveal disease similarities not reducible to direct DGIdb gene overlap, including a Breast Cancer-Schizophrenia relationship consistent with recent biomedical evidence. These results show that sparse contextual layers can induce interpretable nonlocal changes in higher-order network geometry.

2026-05-19T20:06:54Z Hao Ding Sanjukta Krishnagopal http://arxiv.org/abs/2502.07860v2 Design of an Automated Ethanol Vapor Generating System for Alcohol Use Disorder(AUD) Animal Studies 2026-05-19T19:44:17Z

Alcohol Use Disorder (AUD) is a prevalent addictive disorder affecting an estimated 29.5 million Americans. It is characterized by impaired control over alcohol consumption despite negative consequences. The number of diagnostic criteria met by an individual typically determines the severity of AUD. Research into AUD focuses on understanding individual susceptibility differences and developing preventive strategies. Alcohol vapor inhalation has emerged as a promising method for pathophysiological investigations in animals, allowing researchers to control the dose and duration of alcohol exposure. This approach is crucial for studying the escalation of voluntary alcohol-drinking behavior. Current commercial systems for alcohol vapor generation have limitations, including combustion risks and the need to adjust multiple parameters. Other methods, like bubbling or blow-over evaporation, face challenges in maintaining equilibrium and avoiding aerosolization. To address these issues, a new type of ethanol vapor generating system is proposed that relies solely on temperature control, creating a vacuum into which ethanol evaporates under thermodynamic control. This approach eliminates the need to adjust multiple parameters and offers improved accuracy and precision in vapor dose delivery. We validated the system as anticipated, achieving stable ethanol vapor after a few priming cycles. Using a 1.2 L cylinder, we obtained approximately 3.6 L of saturated vapor/air mix in 1 minute. Gravimetric results showed that each cycle produced about 100 mg/L or ~10,000 ppm vapor-to-air mixture. The intended use of the ethanol vapor generator is to provide a concentrated ethanol vapor / air mixture to be further diluted before delivering to the animals.

2025-02-11T17:12:30Z 6 pages, 2 figures, 1 table Alexander Pozhitkov Douglas Ramsay Peter A Noble http://arxiv.org/abs/2504.05454v2 GraphPINE: Graph Importance Propagation for Interpretable Drug Response Prediction 2026-05-19T17:55:07Z

Explainability is necessary for many tasks in biomedical research. Recent explainability methods have focused on attention, gradient, and Shapley value. These do not handle data with strong associated prior knowledge and fail to constrain explainability results based on known relationships between predictive features. We propose GraphPINE, a graph neural network (GNN) architecture leveraging domain-specific prior knowledge to initialize node importance optimized during training for drug response prediction. Typically, a manual post-prediction step examines literature (i.e., prior knowledge) to understand returned predictive features. While node importance can be obtained for gradient and attention after prediction, node importance from these methods lacks complementary prior knowledge; GraphPINE seeks to overcome this limitation. GraphPINE differs from other GNN gating methods by utilizing an LSTM-like sequential format. We introduce an importance propagation layer that unifies 1) updates for feature matrix and node importance and 2) uses GNN-based graph propagation of feature values. This initialization and updating mechanism allows for informed feature learning and improved graph representation. We apply GraphPINE to cancer drug response prediction using drug screening and gene data collected for over 5,000 gene nodes included in a gene-gene graph with a drug-target interaction (DTI) graph for initial importance. The gene-gene graph and DTIs were obtained from curated sources and weighted by article count discussing relationships between drugs and genes. GraphPINE achieves a PR-AUC of 0.894 and ROC-AUC of 0.796 across 952 drugs. Code is available at https://anonymous.4open.science/r/GraphPINE-40DE.

2025-04-07T19:42:12Z Yoshitaka Inoue Tianfan Fu Augustin Luna http://arxiv.org/abs/2605.19902v1 Hierarchical Contrastive Learning for Multi-Domain Protein-Ligand Binding 2026-05-19T14:33:02Z

Predicting protein-ligand binding affinity remains intractable for multi-domain proteins, where inter-domain dynamics govern molecular recognition. Existing geometric deep learning methods typically treat proteins as monolithic static graphs, suffering from rigid-body assumptions and aleatoric noise in flexible regions. To address this, we introduced HCLBind, a self-supervised framework that decouples geometric representation learning from affinity regression. HCLBind leverages a general-to-specific pre-training paradigm on the Q-BioLiP database to learn a robust physical grammar of binding. We propose a novel hierarchical decoy strategy: the model learns local physicochemical constraints through protein coordinate perturbation in single-domain proteins and global conformational geometry through inter-domain rotation in multi-domain complexes. Our hybrid architecture integrates a domain-gated graph attention network and cross-modal attention to explicitly prioritize domain interfaces. Furthermore, we employ LoRA on protein and ligand foundation models, ensuring efficient optimization while preserving evolutionary knowledge. Experiments on PDBBind demonstrate that HCLBind effectively learns discriminative interface features and provides robust uncertainty estimation, overcoming the limitations of standard supervised learning. The code is available at https://github.com/jiankliu/HCLBind.

2026-05-19T14:33:02Z Accepted by ISBRA2026 Shuo Zhang Rongqi Hong Huifeng Zhang Jian K. Liu http://arxiv.org/abs/2605.19677v1 Agentic Discovery of Cryomicroneedle Formulations 2026-05-19T11:09:46Z

Cryomicroneedles offer a route to minimally invasive intradermal delivery of living cells, but their cryogenic formulations must reconcile cell protection with constraints on toxicity and device fabrication. Here we report an AI-assisted, closed-loop workflow for cryomicroneedle cryoprotectant discovery that combines literature curation, Gaussian-process surrogate modelling, Bayesian optimization, and sequential wet-lab validation. A curated dataset of 198 mesenchymal stem-cell cryopreservation formulations from 42 studies was converted into 21 ingredient features and used to train an uncertainty-aware literature prior. This model captured moderate structure in the literature data but failed prospectively, motivating iterative wet-lab correction. Across ten validation iterations and 106 wet-lab observations, the model progressively adapted to cryomicroneedle-specific outcomes: batch RMSE decreased from 41.21 to 6.86 percentage points, later-stage rank correlations became consistently positive, and the cumulative wet-lab predicted-versus-measured summary reached $R^2 = 0.942$. The best validated formulation achieved 95.15\% post-thaw viability with low DMSO, ectoin, ethylene glycol, and fetal bovine serum. However, high viability alone did not ensure intact cryomicroneedle formation, highlighting the need for future multi-objective optimization. These results demonstrate that agent-assisted computational infrastructure can make data-efficient formulation discovery more accessible to labs with minimal data expertise in-house. Project code is available at https://github.com/baitmeister/ML-for-CryoMN.

2026-05-19T11:09:46Z Hao Li Lifu Du Nurul Hameed Shemonti Saha Authai Zlata Stefanovic Chenjie Xu http://arxiv.org/abs/2603.06740v2 ViroGym: Realistic Large-Scale Benchmarks for Evaluating Viral Proteins 2026-05-19T11:09:15Z

Protein language models (pLMs) have shown strong potential for zero-shot prediction of missense variant effects, yet systematic benchmarking on viral proteins remains limited, a critical gap given the need for proactive tools that can anticipate emerging mutations ahead of experimental validation. Here we introduce ViroGym, a comprehensive benchmark evaluating pLMs across three tasks: 79 deep mutational scanning (DMS) assays covering eukaryotic viruses with 552,065 mutated sequences across 7 phenotypic readouts, 21 influenza neutralisation tasks, and a real-world pandemic prediction task for SARS-CoV-2. We benchmark well-established pLMs on fitness landscapes, antigenic diversity, and pandemic forecasting, and find that the ProGen2 family consistently achieves the strongest performance across all three tasks. Crucially, DMS and neutralisation performance reliably identifies models that generalise to real-world emergence, even though the mutation sets they surface barely overlap, revealing that complementary in vitro benchmarks capture the evolutionary constraints needed for real-world mutation forecasting.

2026-03-06T08:38:43Z Yichen Zhou Jonathan Golob Amir Karimi Stefan Bauer Patrick Schwab http://arxiv.org/abs/2402.17086v4 Multicellular simulations with shape and volume constraints using optimal transport 2026-05-19T05:51:29Z

Many living and physical systems such as cell aggregates, tissues or bacterial colonies behave as unconventional systems of particles that are strongly constrained by volume exclusion and shape interactions. Understanding how these constraints lead to macroscopic self-organized structures is a fundamental question in e.g. developmental biology. To this end, various types of computational models have been developed. Here, we introduce a new framework based on optimal transport theory to model particle systems with arbitrary dynamical shapes and deformability properties. Our method builds upon the pioneering work of Brenier on incompressible fluids and its recent applications to materials science. It lets us specify the shapes and volumes of individual cells and supports a wide range of interaction mechanisms, while automatically taking care of the volume exclusion constraint at an affordable numerical cost. We showcase the versatility of this approach by reproducing several classical systems in computational biology. Our Python code is freely available at https://iceshot.readthedocs.io/.

2024-02-26T23:53:18Z Antoine Diez Jean Feydy http://arxiv.org/abs/2605.21522v1 Protein Thoughts: Interpretable Reasoning with Tree of Thoughts and Embedding-Space Flow Matching for Protein-Protein Interaction Discovery 2026-05-19T04:14:06Z

Protein-protein interactions (PPIs) govern nearly all cellular processes, yet computational methods for identifying binding partners typically produce ranked predictions without mechanistic justification. This creates a fundamental barrier to adoption because biologists cannot assess whether predictions reflect genuine biochemical insight or spurious correlations. We present \textbf{Protein Thoughts}, a framework that reformulates PPI discovery as an interpretable search problem with explicit reasoning. The system decomposes binding evidence into four biologically meaningful signals: sequence similarity reflecting evolutionary relationships, structural complementarity capturing geometric fit, interface balance, and chemical compatibility encoding residue-level interactions. Rather than collapsing these signals into an opaque score, we preserve their individual contributions through a transparent value function that enables both ranking and auditing. To navigate large candidate spaces efficiently, we introduce hypothesis-guided entropy-regularized Tree-of-Thoughts search. A fine-tuned language model generates search directives from embedding-derived features, classifying candidates as high-priority, exploratory, or skippable. These directives condition a Boltzmann policy that balances exploitation with entropy-driven exploration, while hypothesis-aware pruning prevents premature abandonment of promising candidates. For candidates exhibiting score disagreement, hypothesis-conditioned embedding-space flow matching transports protein embeddings toward the binder manifold. On the SHS148k benchmark, Protein Thoughts achieves mean best-binder rank of 11.2 versus 47.7 for an entropic tree search baseline, a 76% improvement, and for binding prediction the trained value function achieves $91.08 \pm 0.19$ Micro-F1, outperforming existing PPI methods on the same dataset.

2026-05-19T04:14:06Z Kingsley Yeon Xuefeng Liu Promit Ghosal http://arxiv.org/abs/2201.10895v5 A novel sustainable role of compost as a universal protective substitute for fish, chicken, pig, and cattle, and its estimation by structural equation modeling 2026-05-19T02:32:41Z

Natural decomposition of organic matter is essential in food systems, and compost is used worldwide as an organic fermented fertilizer. However, as a feature of the ecosystem, its effects on the animals are poorly understood. Here we show that oral administration of compost and/or its derived thermophilic Bacillaceae, i.e., Caldibacillus hisashii and Weizmannia coagulans, can modulate the prophylactic activities of various industrial animals. The fecal omics analyses in the modulatory process showed an improving trend dependent upon animal species, environmental conditions, and administration. However, structural equation modeling (SEM) estimated the grouping candidates of bacteria and metabolites as standard key components beyond the animal species. In particular, the SEM model implied a strong relationship among partly digesting fecal amino acids, increasing genus Lactobacillus as inhabitant beneficial bacteria and 2-aminoisobutyric acid involved in lantibiotics. These results highlight the potential role of compost for sustainable protective control in agriculture, fishery, and livestock industries.

2022-01-26T12:19:50Z Hirokuni Miyamoto Wataru Suda Hiroaki Kodama Hideyuki Takahashi Yumiko Nakanishi Shigeharu Moriya Kana Adachi Nao Kiriyama Masaya Wada Daisuke Sudo Shunsuke Ito Shunsuke Ito Minami Shibata Shinji Wada Takako Murano Hitoshi Taguchi Chie Shindo Arisa Tsuboi Naoko Tsuji Makiko Matsuura Chitose Ishii Teruno Nakaguma Toshiyuki Ito Toru Okada Teruo Matsushita Takashi Satoh Tamotsu Kato Atsushi Kurotani Hideaki Shima Yudai Inabu Haruki Yamano Yukihiro Tashiro Kenji Sakai Kenichi Mori Takashi Satoh Kenta Suzuki Takeshi Miura Hidetoshi Morita Shinji Fukuda Jun Kikuchi Hisashi Miyamoto Masahira Hattori Naoki Yamamoto Hiroshi Ohno http://arxiv.org/abs/2605.19071v1 Informational blueprints reveal condition-dependent gene regulatory architectures 2026-05-18T19:53:33Z

While coding regions in the genome have a direct interpretation in terms of protein products, significant fractions are non-coding and yet control essential biological functions. Unlike the genetic code, there is no "lookup table" that identifies where regulatory proteins, known as transcription factors (TFs), bind. Here, we extract these binding sites by distilling sequences of nucleotide letters into collective coordinates (hyperletters) representing the binding sites that are active under specific environmental conditions. Going beyond local information footprints between individual bases and expression levels, our $\textit{information blueprint}$ algorithm compresses the global information by optimising filters that simultaneously scan an entire promoter sequence. Inspired by renormalisation-group techniques, we identify TF binding sites as coarse-grained variables combining groups of correlated mutations with the highest collective impact on gene expression. We validate our approach on experimental data for $\textit{E. coli}$ and discover novel regulatory elements illustrating its deployment at scale across growth conditions.

2026-05-18T19:53:33Z Doruk Efe Gökmen Rosalind Wenshan Pan Tom Röschinger Stephen Quake Hernan Garcia Rob Phillips Vincenzo Vitelli http://arxiv.org/abs/2605.19050v1 Generative Pseudo-Force Fields for Molecular Generation 2026-05-18T19:14:53Z

Generating stable molecular conformations typically forces a tradeoff between the physical realism of energy-based relaxation and the sampling efficiency of data-driven generative models. While machine learning force fields (MLFFs) can sample stable conformations by relaxing molecular geometries according to physical forces, they require costly ab-initio training data. Conversely, diffusion models (DMs) learn from equilibrium data alone but are dependent on noise schedules and time-step conditioning. In this work, we propose generative pseudo-force fields (GPFFs) to bridge these paradigms by training an MLFF on a quadratic pseudo-potential energy surface relative to reference equilibrium structures. Because no ab-initio calculations are required for the perturbed geometries, non-equilibrium training data can be generated on the fly by perturbing the equilibria with Gaussian noise. We show that GPFFs constitute a time-step-agnostic variant of variance exploding DMs: the score comes from the predicted pseudo-forces but because force magnitudes implicitly encode the noise level, no time-step conditioning is needed. Our GPFF can hence be used as a drop-in replacement in standard diffusion sampling (ancestral, Heun) but also facilitates more efficient, adaptive variants and an MLFF inspired direct denoising scheme. Our proposed sampling algorithms support arbitrary structural priors and geometric constraints. On QM9, GPFF has 100 % validity at 256 neural function evaluations (NFE) and over 50 % at just 6 NFE, outperforming diffusion baselines across all samplers. Combined with custom priors, we showcase the fast and accurate generation process of our method in a molecular editor for a drug design setting, where a molecule is generated in real time.

2026-05-18T19:14:53Z Stefaan Simon Pierre Hessmann Khaled Kahouli Stefan Gugler Michael Plainer Frank Noé Klaus-Robert Müller Niklas Wolf Andreas Gebauer http://arxiv.org/abs/2602.04883v2 Protein Autoregressive Modeling via Multiscale Structure Generation 2026-05-18T18:23:50Z

We present protein autoregressive modeling (PAR), the first multi-scale autoregressive framework for protein backbone generation via coarse-to-fine next-scale prediction. Using the hierarchical nature of proteins, PAR generates structures that mimic sculpting a statue, forming a coarse topology and refining structural details over scales. To achieve this, PAR consists of three key components: (i) multi-scale downsampling operations that represent protein structures across multiple scales during training; (ii) an autoregressive transformer that encodes multi-scale information and produces conditional embeddings to guide structure generation; (iii) a flow-based backbone decoder that generates backbone atoms conditioned on these embeddings. Moreover, autoregressive models suffer from exposure bias, caused by the training and the generation procedure mismatch, and substantially degrades structure generation quality. We effectively alleviate this issue by adopting noisy context learning and scheduled sampling, enabling robust backbone generation. Notably, PAR exhibits strong zero-shot generalization, supporting flexible human-prompted conditional generation and motif scaffolding without requiring fine-tuning. On the unconditional generation benchmark, PAR effectively learns protein distributions and produces backbones of high design quality, and exhibits favorable scaling behavior. Together, these properties establish PAR as a promising framework for protein structure generation.

2026-02-04T18:59:49Z ICML 2026 Spotlight; ByteDance Seed Tech Report; Page: https://par-protein.github.io/ Yanru Qu Cheng-Yen Hsieh Zaixiang Zheng Ge Liu Quanquan Gu