https://arxiv.org/api/B7rtHBJY9DfbZvB0IVqn6DyPtWQ 2026-06-21T10:18:28Z 13258 60 15 http://arxiv.org/abs/2606.11651v1 DeepRHP: A Hybrid Variational Autoencoder for Designing Random Heteropolymers as Protein Mimics 2026-06-10T04:28:51Z

Synthetic random heteropolymers (RHPs), consisting of a predefined set of monomers, offer an approach toward the design of protein-like materials. These RHPs, if designed appropriately, can mimic protein behavior and function. As such, there is a need for computational tools to efficiently guide RHP design. We bridge this gap by developing DeepRHP, a modified variational autoencoder (VAE) model under a semi-supervised framework. By equipping a classical VAE with an additional feature-based VAE, DeepRHP forces the latent space to capture structures of critical chemical features as well as individual RHP sequence patterns. In this sense, our method is versatile by allowing any relevant features to be incorporated in a hybrid manner. We demonstrate the effectiveness of DeepRHP by suggesting potential monomer compositions that stabilize membrane proteins (e.g. Aquaporin Z) in non-native environments and cross-validating our prediction with published results. The concordance between our model and true RHP function suggests strong potential in utilizing hybrid autoencoder architectures to guide RHP design for proteins and other biological compounds.

2026-06-10T04:28:51Z Oral presentation at AAAI 2023 Workshop on AI to Accelerate Science and Engineering Shuni Li Zhiyuan Ruan Andy Shen Ivan Jayapurna Ting Xu Haiyan Huang http://arxiv.org/abs/2606.11646v1 Tree-Structured Orthonormal Decomposition of the Aitchison Simplex 2026-06-10T04:21:29Z

Compositional data -- vectors encoding relative proportions -- arise across scientific domains, including ecology, geochemistry, and genomics. The features in these data often come with known hierarchical structure (e.g., taxonomies, phylogenies, ontologies), yet existing methods either ignore this structure, discard the intrinsic Aitchison geometry, are designed for binary trees, or yield incomplete coordinate systems. We describe PolyILR, a canonical orthonormal decomposition of the Aitchison tangent space aligned with any tree topology. Our construction defines a weighted local geometry at each internal node capturing full branching structure, then lifts these to a global orthonormal basis where every coordinate corresponds to a specific tree location. On microbiome and single-cell benchmarks, PolyILR yields stable, interpretable features and enables inference at multiscale tree resolution. We also establish a novel theoretical connection to softmax classifiers, suggesting possible applications to probabilistic modeling.

2026-06-10T04:21:29Z Accepted at ICML 2026. To appear in PMLR vol. 306 Daisuke Yamada Qijun Zhang Travis Pence Barbara B. Bendlin Federico Rey Vikas Singh http://arxiv.org/abs/2605.00545v2 Beyond Continuity: Simulation-free Reconstruction of Discrete Branching Dynamics from Single-cell Snapshots 2026-06-10T02:19:11Z

Inferring cellular trajectories from destructive snapshots is complicated by the challenges of stochasticity and non-conservative mass dynamics such as cell proliferation and apoptosis. Existing unbalanced Optimal Transport (OT) methods treat mass as a continuous fluid, performing inference at the population level. However, this macroscopic view often fails to capture the discrete, jump-like nature of birth-death events at single-cell resolution, which is essential for understanding lineage branching and fate decisions. We present Unbalanced Schrödinger Bridge (USB), a simulation-free framework for learning underlying dynamics that effectively integrates both stochastic and unbalanced effects which also models the discrete, jump-like birth-death dynamics at single-cell resolution. Theoretically, USB provides a tractable solution to the Branching Schrödinger Bridge (BSB) problem, offering a rigorous microscopic interpretation where individual cells undergo both Brownian motion and discrete birth-death jumps. Technically, the method implements an efficient solver by introducing a simulation-free training objective that effectively scales to high-dimensional omics data. Empirically, we demonstrate on both simulated and real-world datasets that USB not only achieves trajectory reconstruction performance better than or comparable to deterministic baselines but also uniquely enables realistic discrete simulation of birth-death dynamics at single-cell resolution.

2026-05-01T09:57:13Z Junda Ying Yuxuan Wang Bowen Yang Peijie Zhou Lei Zhang http://arxiv.org/abs/2606.11510v1 Continuous biome representations from Earth observation embeddings 2026-06-09T23:14:00Z

Biotic communities vary continuously across space, yet biome maps impose categorical boundaries that compress this variation, particularly at ecotones where transitional communities are ecologically distinct. Could Earth observation (EO) foundation models, which encode spectral, spatial, and temporal information with dense embeddings, convert discrete biome maps into continuous representations that better capture ecological variation? Here, we fit a linear classifier on Clay v1.5 satellite image embeddings to predict biome labels from a categorical map. The softmax output yields a continuous probability vector whose dimensions correspond to named biome classes. We evaluate this approach using six Brazilian biomes, 1.3 million embeddings, and 10,015 withheld forest inventory plots spanning 4,672 plant species. The continuous biome representation outperforms discrete biome labels for predicting species occurrence (mean per-species AUC 0.618 vs. 0.570 across 10 spatial cross-validation folds). Decomposing this gain shows that continuity in the graded probability output, rather than label reassignment, accounts for the improvement; the pattern holds across all distances from biome boundaries. The raw 1024-dimensional embedding remains the strongest predictor we tested (mean AUC 0.646 vs. 0.618), but the continuous representation recovers most of the embedding's gain over discrete labels. This simple approach provides a probabilistic replacement for categorical map labels, preserving their meaning while encoding graded variation that discrete maps suppress.

2026-06-09T23:14:00Z 8 pages, 4 figures Maxwell B. Joseph Planet Labs PBC Flávia De Souza Mendes Planet Labs PBC Dieu My T. Nguyen Planet Labs PBC Camile Sothe Planet Labs PBC Christopher B. Anderson Planet Labs PBC http://arxiv.org/abs/2606.11508v1 Probabilistic Contrastive Pretraining for Multi-task ADME Property Prediction 2026-06-09T23:03:05Z

Accurate prediction of absorption, distribution, metabolism, and excretion (ADME) properties is critical to drug discovery, but remains challenging because ADME endpoints are noisy, interdependent, and often data-limited. We propose a molecular graph-transformer pretraining framework that combines chemistry-specific self-supervision with contrastive mutual information machine learning (cMIM). Our method encodes molecular graphs into latent variables, reconstructs SMILES strings from the graph-derived latent codes, and augments the contrastive objective with domain-specific self-supervised chemistry tasks. Rather than treating these tasks as auxiliary regularizers with separately tuned loss weights, we formulate reconstruction, contrastive discrimination, and chemistry-specific supervision as unit-weighted log-probability factors in a single probabilistic latent-variable objective. For fine-tuning, we propose a multi-task GNN readout architecture with task-specific multilayer perceptron heads, preserving shared representation learning while mitigating negative transfer and improving the modeling of heterogeneous, nonlinear task relationships. Across Biogen, ExpansionRX, and ChEMBL-MT, the resulting Contrastive KERMT pretraining improves over the KERMT baseline by 7.6%, 9.9%, and 9.5% respectively (averaged over significantly-improved endpoints). Adding ADME-adjacent molecules to the pretraining corpus further improves transfer, and the contrastive component sharpens chemically meaningful latent neighborhoods.

2026-06-09T23:03:05Z Yifan Xue Srimukh Prasad Veccham Saee Paliwal Tyler Shimko Micha Livne http://arxiv.org/abs/2606.11426v1 Sharpness characterizes Hill functions 2026-06-09T20:23:20Z

While long treated as empirical fits, Hill functions have been postulated to be the universal Hopfield barrier for sharpness of input-output responses by Martinez-Corral, Nam, DePace, and Gunawardena. A Hopfield barrier is a fundamental limit on how well biological systems can process information without expending energy. Their case rested on numerical findings for Hill coefficients $4$ and $6$. We give a precise formulation and proof of this: measuring sharpness by the supremum of the derivative in semi-log scale, any rational function $r(x)=(α_0+α_1 x+ \cdots +α_n x^n)/(β_0 + β_1 x+ \cdots + β_n x^n)$ with real coefficients $0\leq α_i\leq β_i$ has sharpness at most $n/4$, with equality if and only if $r$ is a Hill function with Hill coefficient $n$.

2026-06-09T20:23:20Z 10 pages, 2 figures Marc Stephan http://arxiv.org/abs/2606.11415v1 Spatially Masked Regression Reveals Local and Distributed Predictability in Electrophysiological Recordings 2026-06-09T20:05:44Z

Neural recordings are often interpreted as local measurements, yet the signal at any one sensor can also reflect structured activity distributed across the broader network. This raises a basic question: to what extent does an electrode's signal reflect local versus distributed information in the underlying system? More specifically, how much of an electrode's activity is carried by its immediate neighborhood, and how much is embedded more broadly across the array? We address this with a Spatially Masked Regression (SMR) framework that reconstructs each electrode's timeseries from the remaining electrodes while excluding a configurable neighborhood around the target. By progressively increasing this mask, spatial locality becomes an experimental control for quantifying how much predictive information survives after nearby channels are withheld. We apply SMR to intracranial EEG with heterogeneous electrode coverage and to scalp EEG with standardized montages over sensorimotor cortex. Using distance correlation between original and reconstructed signals, we find strong within-subject reconstruction in both modalities, substantial residual predictability even when local neighbors are excluded, and markedly stronger cross-subject transfer in EEG than in iEEG. Masking shows that nearby electrodes contribute strongly to reconstruction but do not account for all of it, indicating that individual channels reflect both local redundancy and broader distributed structure. Surrogates that preserve selected marginal or spectral properties while disrupting phase structure or temporal ordering substantially reduce performance, supporting the conclusion that SMR depends on structured temporal and cross-channel organization rather than on marginal statistics alone. These results position SMR as an interpretable framework for quantifying the balance between local and distributed information in recordings.

2026-06-09T20:05:44Z Maryam Ostadsharif Memar Nima Dehghani http://arxiv.org/abs/2606.11144v1 OncoTraj: a public benchmark for longitudinal resistance prediction in EGFR-mutant non-small-cell lung cancer on osimertinib 2026-06-09T17:33:24Z

Resistance to first-line osimertinib in EGFR-mutant non-small-cell lung cancer (NSCLC) is the canonical example of predictable clonal evolution under therapeutic pressure, yet no public benchmark exists for training or evaluating computational models on the corresponding longitudinal patient trajectories. We introduce OncoTraj, a public benchmark of 813 EGFR-mutant NSCLC patients receiving first-line osimertinib, harmonized from three real-world clinical-genomic sources: MSK-CHORD (672 patients), AACR Project GENIE BPC NSCLC (34 patients), and the FLAURA molecular-resistance supplement (107 patients). OncoTraj defines three locked tasks: (A) binary classification of progression by a fixed 12-month landmark, (B) regression of time-to-first-progression in days, and (C) six-class classification of the dominant resistance mechanism. We release the harmonized dataset, patient-level train/validation/test splits with an audited no-leakage guarantee, an open-source evaluation harness, and six reference baselines spanning a majority-class predictor, logistic regression, random forest, XGBoost, an LSTM, and a multi-task transformer. With v1's single-timepoint snapshot features, no task clears chance on clean within-source evaluation: the uniformity of this ceiling across every model class localizes the limit to the input modality (single-snapshot tissue NGS rather than serial ctDNA), not the algorithm. The benchmark does recover a reproducible literature-consistent association: TP53 co-mutation raises the 12-month progression rate from 29% to 59% cohort-wide. OncoTraj establishes a reproducible, leakage-audited baseline and converts the modality limit into concrete design requirements for a serial-ctDNA-enriched v2.

2026-06-09T17:33:24Z 24 pages, 7 figures, 4 tables. Code, data, and trained model weights: https://github.com/span-ai-labs/oncotraj. Python package: pip install oncotraj. Dataset: https://huggingface.co/datasets/span-ai-labs/oncotraj-v1 Abhijoy Sarkar Aarchi Singh Thakur http://arxiv.org/abs/2606.10955v1 A kinetic model of shear-induced rupture of short dsDNA 2026-06-09T14:58:44Z

Force-induced dissociation of short double-stranded DNA (dsDNA) is central to single-molecule biophysics and DNA nanotechnology, yet a physically grounded kinetic description of shear-induced rupture for finite-length constructs remains lacking. Here we develop a master equation framework built on a force-dependent nucleation-zipper pathway with single-base transitions, enabling direct calculation of dissociation rates and transition state distances over a broad force range. Applied to a DNA-gold nanoparticle-DNA construct under constant shear force, the model accurately reproduces the experimental room-temperature data in the covered force regime and provides a unified interpretation of prior measurements on similarly sheared duplexes across all force regimes. A central result is that the three-dimensional helical geometry of dsDNA is essential for correctly defining the end to end distance under shear in the rod-like polymer model of short dsDNA. We further show that the extracted transition state distances are robust to variations in ssDNA polymer parameters within the experimentally relevant regime. Finally, we analyze the temperature dependence of the transition state distance and discuss how our framework captures globally-heated rupture while identifying the additional complications introduced by localized plasmonic heating in gold nanoparticle-coupled constructs. These results provide a predictive kinetic foundation for interpreting force-rupture experiments and for designing force- and temperature-actuated DNA nanostructures.

2026-06-09T14:58:44Z Supporting Information is provided at the end of the main text Ayman Hussein Ralf Bundschuh http://arxiv.org/abs/2606.10873v1 Spatial Model Selection and Uncertainty Quantification: Comparing Continuous and Discrete Wound Healing Models 2026-06-09T13:50:23Z

All data-driven modeling tasks (e.g., parameter estimation, uncertainty quantification, and data forecasting) require the selection of a mathematical model. An overlooked aspect of model selection is modality; for example, there are no guidelines on when to use a partial differential equation (PDE) model or an agent-based model (ABM) for spatial processes. To address this, we created a model selection pipeline that uses approximate Bayesian computations to perform parameter estimation, uncertainty quantification, and model selection (using both information criteria and out-of-sample forecasting). Applying the pipeline to artificial datasets (generated from ABMs) reveals that while both modalities yield comparable parameter estimation performance, the ABM estimates exhibit higher uncertainty, and the PDE models compute more than 1,000$\times$ faster. Surprisingly, the mean-field PDE is often selected over the true generative ABM model using both information criteria and data forecasting. Applying the pipeline to public wound healing data indicates that a PDE model with cell pulling and a time delay is the most appropriate model for this data, however, this model has high levels of parametric uncertainty. This methodology establishes a preliminary framework for selecting the appropriate modeling modality for spatial biological data.

2026-06-09T13:50:23Z John T. Nardini Jana L. Gevertz http://arxiv.org/abs/2606.02386v2 AgentPLM: Agentic Protein Language Models with Reasoning-Augmented Decoding for Protein Sequence Design 2026-06-09T09:11:47Z

Protein language models (PLMs) are passive oracles: they generate sequences in a single forward pass with no mechanism to consult external biophysical feedback or redirect generation when a candidate violates thermodynamic or structural constraints. We introduce AgentPLM, which addresses this by equipping a pre-trained PLM with i) Reasoning-Augmented Decoding (RAD), which interleaves autoregressive generation with tool calls (ESMFold, FoldX, AutoDock Vina), and ii) Contrastive Agent Policy Optimisation (CAPO), a trajectory-level extension of direct preference optimisation that trains the policy end-to-end to learn when oracle feedback is informative rather than merely imitating high-fitness sequences. We evaluate AgentPLM on benchmark tasks spanning de novo enzyme design, antibody optimisation, thermostability, PPI interface design, and zero-shot fitness prediction with standardised oracle APIs and controlled sequence-identity splits. AgentPLM achieves state-of-the-art results with a gain in antibody top-10% hit rate over the strongest passive baseline, providing mechanistic evidence of online error correction without explicit backtracking.

2026-06-01T15:35:02Z Workshop on Generative and Agentic AI for Biology, 43rd International Conference on Machine Learning (ICML 2026) Sahil Rahman Maxx Richard Rahman http://arxiv.org/abs/2503.19158v3 Integrating Biological-Informed Recurrent Neural Networks for Glucose-Insulin Dynamics Modeling 2026-06-09T08:24:50Z

Type 1 Diabetes (T1D) management is a complex task due to many variability factors. Artificial Pancreas (AP) systems have alleviated patient burden by automating insulin delivery through advanced control algorithms. However, the effectiveness of these systems depends on accurate modeling of glucose-insulin dynamics, which traditional mathematical models often fail to capture due to their inability to adapt to patient-specific variations. This study introduces a Biological-Informed Recurrent Neural Network (BIRNN) framework to address these limitations. The BIRNN leverages a Gated Recurrent Units (GRU) architecture augmented with physics-informed loss functions that embed physiological constraints, ensuring a balance between predictive accuracy and consistency with biological principles. The framework is validated using the commercial UVA/Padova simulator, outperforming traditional linear models in glucose prediction accuracy and reconstruction of unmeasured states, even under circadian variations in insulin sensitivity. The results demonstrate the potential of BIRNN for personalized glucose regulation and future adaptive control strategies in AP systems.

2025-03-24T21:26:12Z Accepted for publication in the proceedings of the Engineering Diabetes Technologies (EDT 2025). 7 pages, 2 figures and 1 table IFAC-PapersOnLine, 59(2), 2025, pp. 91-96 Stefano De Carli Nicola Licini Davide Previtali Fabio Previdi Antonio Ferramosca 10.1016/j.ifacol.2025.06.016 http://arxiv.org/abs/2606.10543v1 Flexible Flows for Biological Sequence Design 2026-06-09T08:11:14Z

Designing functional biological sequences requires navigating vast discrete spaces under strict evolutionary and biophysical constraints. Discrete Flow Matching (DFM) offers a generative framework over such spaces, but existing approaches rely on biologically uninformative couplings and offer limited flexibility for variable-length sequence generation and fine-grained control. We propose a structured coupling that encodes domain-specific preferences among sequence elements, biasing the source distribution toward plausible regions without modifying the flow objective or training procedure. Building on this, we introduce a latent edit-based rate parameterization that models variable-length generation via edit operations conditioned on a shared global latent, akin to a latent variable model, while remaining tractable. We further introduce a latent classifier-free guidance mechanism that steers generation coherently in continuous latent space, along with Dirichlet-prior temperature scaling for test-time control over edit operations. Our method achieves state-of-the-art performance across diverse biological sequence tasks, including density estimation, unconditional and conditional DNA sequence generation, and peptide sequence generation.

2026-06-09T08:11:14Z Yogesh Verma Dani Korpela Harri Lähdesmäki Vikas Garg http://arxiv.org/abs/2606.10410v1 A Comprehensive Inference-Time Augmentation Framework in Physiological Signals: Application to PPG-Based AF Detection 2026-06-09T04:35:16Z

Objective: Accurate classification of physiological signals in real-world deployments is challenged by sensor noise, motion artifacts, and distribution shifts between training and deployment data. Inference-time augmentation (ITA), which applies augmentations during inference rather than retraining, offers a simple, model-agnostic mechanism to improve robustness. However, ITA application to physiological signals has remained narrow in scope, relying on limited augmentation methods with fixed, unoptimized parameters. This work proposes a unified ITA framework to address that gap. Approach: The framework incorporates 13 augmentation methods spanning time-domain, amplitude-domain, frequency-domain, and artifact-injection transformations, with hyperparameters optimized via Bayesian optimization. We evaluate on atrial fibrillation (AF) detection from 30-second PPG signals using GPT-PPG and ResNet across five datasets comprising more than 400 patients and ${\sim}$9,800 hours of recording. Main results: Standard ITA consistently improved AUROC (up to 8.5% for GPT-PPG and 0.7% for ResNet) and AUPRC (up to 10.6% for GPT-PPG and 0.8% for ResNet). Selective ITA further reduced average FPR by up to 4.4% (GPT-PPG) and 1.3% (ResNet) on non-AF datasets. Significance: These findings establish ITA as a practical, model-agnostic approach for improving PPG-based AF classification reliability in deployment settings where retraining is not feasible, with broader applicability to physiological signal analysis.

2026-06-09T04:35:16Z 22 pages, 11 figures, 4 tables. Under review at Physiological Measurement Davood Fattahi Runze Yan Saurabh Kataria Zhaoliang Chen Xiao Hu http://arxiv.org/abs/2606.10407v1 Time-frequency localization of bird calls in dense soundscapes 2026-06-09T04:31:30Z

Passive acoustic monitoring enables large-scale observation of wildlife, but most bioacoustic classifiers only predict species presence in a time window without localizing vocalizations precisely in time or frequency, limiting downstream analyses. We formulate bird vocalization detection as an object detection task on spectrograms and train YOLO11 models to localize bird calls in dense tropical soundscapes from Singapore. We additionally introduce an open-source browser-based annotation tool and propose Intersection over Minimum (IoMin), an evaluation metric that better handles ambiguous acoustic boundaries than standard IoU and is better suited to the problem at hand. The best YOLO model nearly doubles baseline performance on in-distribution soundscapes from Singapore (81.8% vs. 42.1% IoMin@50 F1-score) while still outperforming the baseline on unseen out-of-distribution recordings from Hawaii (58.6% vs. 48.6%). These results suggest that object detection frameworks are a promising approach to time-frequency localization of animal vocalizations in complex soundscapes.

2026-06-09T04:31:30Z Simen Hexeberg Fanghui Tong Hari Vishnu Mandar Chitre