https://arxiv.org/api/B7rtHBJY9DfbZvB0IVqn6DyPtWQ2026-06-21T10:18:28Z132586015http://arxiv.org/abs/2606.11651v1DeepRHP: A Hybrid Variational Autoencoder for Designing Random Heteropolymers as Protein Mimics2026-06-10T04:28:51ZSynthetic random heteropolymers (RHPs), consisting of a predefined set of monomers, offer an approach toward the design of protein-like materials. These RHPs, if designed appropriately, can mimic protein behavior and function. As such, there is a need for computational tools to efficiently guide RHP design. We bridge this gap by developing DeepRHP, a modified variational autoencoder (VAE) model under a semi-supervised framework. By equipping a classical VAE with an additional feature-based VAE, DeepRHP forces the latent space to capture structures of critical chemical features as well as individual RHP sequence patterns. In this sense, our method is versatile by allowing any relevant features to be incorporated in a hybrid manner. We demonstrate the effectiveness of DeepRHP by suggesting potential monomer compositions that stabilize membrane proteins (e.g. Aquaporin Z) in non-native environments and cross-validating our prediction with published results. The concordance between our model and true RHP function suggests strong potential in utilizing hybrid autoencoder architectures to guide RHP design for proteins and other biological compounds.2026-06-10T04:28:51ZOral presentation at AAAI 2023 Workshop on AI to Accelerate Science and EngineeringShuni LiZhiyuan RuanAndy ShenIvan JayapurnaTing XuHaiyan Huanghttp://arxiv.org/abs/2606.11646v1Tree-Structured Orthonormal Decomposition of the Aitchison Simplex2026-06-10T04:21:29ZCompositional data -- vectors encoding relative proportions -- arise across scientific domains, including ecology, geochemistry, and genomics. The features in these data often come with known hierarchical structure (e.g., taxonomies, phylogenies, ontologies), yet existing methods either ignore this structure, discard the intrinsic Aitchison geometry, are designed for binary trees, or yield incomplete coordinate systems. We describe PolyILR, a canonical orthonormal decomposition of the Aitchison tangent space aligned with any tree topology. Our construction defines a weighted local geometry at each internal node capturing full branching structure, then lifts these to a global orthonormal basis where every coordinate corresponds to a specific tree location. On microbiome and single-cell benchmarks, PolyILR yields stable, interpretable features and enables inference at multiscale tree resolution. We also establish a novel theoretical connection to softmax classifiers, suggesting possible applications to probabilistic modeling.2026-06-10T04:21:29ZAccepted at ICML 2026. To appear in PMLR vol. 306Daisuke YamadaQijun ZhangTravis PenceBarbara B. BendlinFederico ReyVikas Singhhttp://arxiv.org/abs/2605.00545v2Beyond Continuity: Simulation-free Reconstruction of Discrete Branching Dynamics from Single-cell Snapshots2026-06-10T02:19:11ZInferring cellular trajectories from destructive snapshots is complicated by the challenges of stochasticity and non-conservative mass dynamics such as cell proliferation and apoptosis. Existing unbalanced Optimal Transport (OT) methods treat mass as a continuous fluid, performing inference at the population level. However, this macroscopic view often fails to capture the discrete, jump-like nature of birth-death events at single-cell resolution, which is essential for understanding lineage branching and fate decisions. We present Unbalanced Schrödinger Bridge (USB), a simulation-free framework for learning underlying dynamics that effectively integrates both stochastic and unbalanced effects which also models the discrete, jump-like birth-death dynamics at single-cell resolution. Theoretically, USB provides a tractable solution to the Branching Schrödinger Bridge (BSB) problem, offering a rigorous microscopic interpretation where individual cells undergo both Brownian motion and discrete birth-death jumps. Technically, the method implements an efficient solver by introducing a simulation-free training objective that effectively scales to high-dimensional omics data. Empirically, we demonstrate on both simulated and real-world datasets that USB not only achieves trajectory reconstruction performance better than or comparable to deterministic baselines but also uniquely enables realistic discrete simulation of birth-death dynamics at single-cell resolution.2026-05-01T09:57:13ZJunda YingYuxuan WangBowen YangPeijie ZhouLei Zhanghttp://arxiv.org/abs/2606.11510v1Continuous biome representations from Earth observation embeddings2026-06-09T23:14:00ZBiotic communities vary continuously across space, yet biome maps impose categorical boundaries that compress this variation, particularly at ecotones where transitional communities are ecologically distinct. Could Earth observation (EO) foundation models, which encode spectral, spatial, and temporal information with dense embeddings, convert discrete biome maps into continuous representations that better capture ecological variation? Here, we fit a linear classifier on Clay v1.5 satellite image embeddings to predict biome labels from a categorical map. The softmax output yields a continuous probability vector whose dimensions correspond to named biome classes. We evaluate this approach using six Brazilian biomes, 1.3 million embeddings, and 10,015 withheld forest inventory plots spanning 4,672 plant species. The continuous biome representation outperforms discrete biome labels for predicting species occurrence (mean per-species AUC 0.618 vs. 0.570 across 10 spatial cross-validation folds). Decomposing this gain shows that continuity in the graded probability output, rather than label reassignment, accounts for the improvement; the pattern holds across all distances from biome boundaries. The raw 1024-dimensional embedding remains the strongest predictor we tested (mean AUC 0.646 vs. 0.618), but the continuous representation recovers most of the embedding's gain over discrete labels. This simple approach provides a probabilistic replacement for categorical map labels, preserving their meaning while encoding graded variation that discrete maps suppress.2026-06-09T23:14:00Z8 pages, 4 figuresMaxwell B. JosephPlanet Labs PBCFlávia De Souza MendesPlanet Labs PBCDieu My T. NguyenPlanet Labs PBCCamile SothePlanet Labs PBCChristopher B. AndersonPlanet Labs PBChttp://arxiv.org/abs/2606.11508v1Probabilistic Contrastive Pretraining for Multi-task ADME Property Prediction2026-06-09T23:03:05ZAccurate prediction of absorption, distribution, metabolism, and excretion (ADME) properties is critical to drug discovery, but remains challenging because ADME endpoints are noisy, interdependent, and often data-limited. We propose a molecular graph-transformer pretraining framework that combines chemistry-specific self-supervision with contrastive mutual information machine learning (cMIM). Our method encodes molecular graphs into latent variables, reconstructs SMILES strings from the graph-derived latent codes, and augments the contrastive objective with domain-specific self-supervised chemistry tasks. Rather than treating these tasks as auxiliary regularizers with separately tuned loss weights, we formulate reconstruction, contrastive discrimination, and chemistry-specific supervision as unit-weighted log-probability factors in a single probabilistic latent-variable objective. For fine-tuning, we propose a multi-task GNN readout architecture with task-specific multilayer perceptron heads, preserving shared representation learning while mitigating negative transfer and improving the modeling of heterogeneous, nonlinear task relationships. Across Biogen, ExpansionRX, and ChEMBL-MT, the resulting Contrastive KERMT pretraining improves over the KERMT baseline by 7.6%, 9.9%, and 9.5% respectively (averaged over significantly-improved endpoints). Adding ADME-adjacent molecules to the pretraining corpus further improves transfer, and the contrastive component sharpens chemically meaningful latent neighborhoods.2026-06-09T23:03:05ZYifan XueSrimukh Prasad VecchamSaee PaliwalTyler ShimkoMicha Livnehttp://arxiv.org/abs/2606.11426v1Sharpness characterizes Hill functions2026-06-09T20:23:20ZWhile long treated as empirical fits, Hill functions have been postulated to be the universal Hopfield barrier for sharpness of input-output responses by Martinez-Corral, Nam, DePace, and Gunawardena. A Hopfield barrier is a fundamental limit on how well biological systems can process information without expending energy. Their case rested on numerical findings for Hill coefficients $4$ and $6$. We give a precise formulation and proof of this: measuring sharpness by the supremum of the derivative in semi-log scale, any rational function $r(x)=(α_0+α_1 x+ \cdots +α_n x^n)/(β_0 + β_1 x+ \cdots + β_n x^n)$ with real coefficients $0\leq α_i\leq β_i$ has sharpness at most $n/4$, with equality if and only if $r$ is a Hill function with Hill coefficient $n$.2026-06-09T20:23:20Z10 pages, 2 figuresMarc Stephanhttp://arxiv.org/abs/2606.11415v1Spatially Masked Regression Reveals Local and Distributed Predictability in Electrophysiological Recordings2026-06-09T20:05:44ZNeural recordings are often interpreted as local measurements, yet the signal at any one sensor can also reflect structured activity distributed across the broader network. This raises a basic question: to what extent does an electrode's signal reflect local versus distributed information in the underlying system? More specifically, how much of an electrode's activity is carried by its immediate neighborhood, and how much is embedded more broadly across the array? We address this with a Spatially Masked Regression (SMR) framework that reconstructs each electrode's timeseries from the remaining electrodes while excluding a configurable neighborhood around the target. By progressively increasing this mask, spatial locality becomes an experimental control for quantifying how much predictive information survives after nearby channels are withheld. We apply SMR to intracranial EEG with heterogeneous electrode coverage and to scalp EEG with standardized montages over sensorimotor cortex. Using distance correlation between original and reconstructed signals, we find strong within-subject reconstruction in both modalities, substantial residual predictability even when local neighbors are excluded, and markedly stronger cross-subject transfer in EEG than in iEEG. Masking shows that nearby electrodes contribute strongly to reconstruction but do not account for all of it, indicating that individual channels reflect both local redundancy and broader distributed structure. Surrogates that preserve selected marginal or spectral properties while disrupting phase structure or temporal ordering substantially reduce performance, supporting the conclusion that SMR depends on structured temporal and cross-channel organization rather than on marginal statistics alone. These results position SMR as an interpretable framework for quantifying the balance between local and distributed information in recordings.2026-06-09T20:05:44ZMaryam Ostadsharif MemarNima Dehghanihttp://arxiv.org/abs/2606.11144v1OncoTraj: a public benchmark for longitudinal resistance prediction in EGFR-mutant non-small-cell lung cancer on osimertinib2026-06-09T17:33:24ZResistance to first-line osimertinib in EGFR-mutant non-small-cell lung cancer (NSCLC) is the canonical example of predictable clonal evolution under therapeutic pressure, yet no public benchmark exists for training or evaluating computational models on the corresponding longitudinal patient trajectories. We introduce OncoTraj, a public benchmark of 813 EGFR-mutant NSCLC patients receiving first-line osimertinib, harmonized from three real-world clinical-genomic sources: MSK-CHORD (672 patients), AACR Project GENIE BPC NSCLC (34 patients), and the FLAURA molecular-resistance supplement (107 patients). OncoTraj defines three locked tasks: (A) binary classification of progression by a fixed 12-month landmark, (B) regression of time-to-first-progression in days, and (C) six-class classification of the dominant resistance mechanism. We release the harmonized dataset, patient-level train/validation/test splits with an audited no-leakage guarantee, an open-source evaluation harness, and six reference baselines spanning a majority-class predictor, logistic regression, random forest, XGBoost, an LSTM, and a multi-task transformer. With v1's single-timepoint snapshot features, no task clears chance on clean within-source evaluation: the uniformity of this ceiling across every model class localizes the limit to the input modality (single-snapshot tissue NGS rather than serial ctDNA), not the algorithm. The benchmark does recover a reproducible literature-consistent association: TP53 co-mutation raises the 12-month progression rate from 29% to 59% cohort-wide. OncoTraj establishes a reproducible, leakage-audited baseline and converts the modality limit into concrete design requirements for a serial-ctDNA-enriched v2.2026-06-09T17:33:24Z24 pages, 7 figures, 4 tables. Code, data, and trained model weights: https://github.com/span-ai-labs/oncotraj. Python package: pip install oncotraj. Dataset: https://huggingface.co/datasets/span-ai-labs/oncotraj-v1Abhijoy SarkarAarchi Singh Thakurhttp://arxiv.org/abs/2606.10955v1A kinetic model of shear-induced rupture of short dsDNA2026-06-09T14:58:44ZForce-induced dissociation of short double-stranded DNA (dsDNA) is central to single-molecule biophysics and DNA nanotechnology, yet a physically grounded kinetic description of shear-induced rupture for finite-length constructs remains lacking. Here we develop a master equation framework built on a force-dependent nucleation-zipper pathway with single-base transitions, enabling direct calculation of dissociation rates and transition state distances over a broad force range. Applied to a DNA-gold nanoparticle-DNA construct under constant shear force, the model accurately reproduces the experimental room-temperature data in the covered force regime and provides a unified interpretation of prior measurements on similarly sheared duplexes across all force regimes. A central result is that the three-dimensional helical geometry of dsDNA is essential for correctly defining the end to end distance under shear in the rod-like polymer model of short dsDNA. We further show that the extracted transition state distances are robust to variations in ssDNA polymer parameters within the experimentally relevant regime. Finally, we analyze the temperature dependence of the transition state distance and discuss how our framework captures globally-heated rupture while identifying the additional complications introduced by localized plasmonic heating in gold nanoparticle-coupled constructs. These results provide a predictive kinetic foundation for interpreting force-rupture experiments and for designing force- and temperature-actuated DNA nanostructures.2026-06-09T14:58:44ZSupporting Information is provided at the end of the main textAyman HusseinRalf Bundschuhhttp://arxiv.org/abs/2606.10873v1Spatial Model Selection and Uncertainty Quantification: Comparing Continuous and Discrete Wound Healing Models2026-06-09T13:50:23ZAll data-driven modeling tasks (e.g., parameter estimation, uncertainty quantification, and data forecasting) require the selection of a mathematical model. An overlooked aspect of model selection is modality; for example, there are no guidelines on when to use a partial differential equation (PDE) model or an agent-based model (ABM) for spatial processes. To address this, we created a model selection pipeline that uses approximate Bayesian computations to perform parameter estimation, uncertainty quantification, and model selection (using both information criteria and out-of-sample forecasting). Applying the pipeline to artificial datasets (generated from ABMs) reveals that while both modalities yield comparable parameter estimation performance, the ABM estimates exhibit higher uncertainty, and the PDE models compute more than 1,000$\times$ faster. Surprisingly, the mean-field PDE is often selected over the true generative ABM model using both information criteria and data forecasting. Applying the pipeline to public wound healing data indicates that a PDE model with cell pulling and a time delay is the most appropriate model for this data, however, this model has high levels of parametric uncertainty. This methodology establishes a preliminary framework for selecting the appropriate modeling modality for spatial biological data.2026-06-09T13:50:23ZJohn T. NardiniJana L. Gevertzhttp://arxiv.org/abs/2606.02386v2AgentPLM: Agentic Protein Language Models with Reasoning-Augmented Decoding for Protein Sequence Design2026-06-09T09:11:47ZProtein language models (PLMs) are passive oracles: they generate sequences in a single forward pass with no mechanism to consult external biophysical feedback or redirect generation when a candidate violates thermodynamic or structural constraints. We introduce AgentPLM, which addresses this by equipping a pre-trained PLM with i) Reasoning-Augmented Decoding (RAD), which interleaves autoregressive generation with tool calls (ESMFold, FoldX, AutoDock Vina), and ii) Contrastive Agent Policy Optimisation (CAPO), a trajectory-level extension of direct preference optimisation that trains the policy end-to-end to learn when oracle feedback is informative rather than merely imitating high-fitness sequences. We evaluate AgentPLM on benchmark tasks spanning de novo enzyme design, antibody optimisation, thermostability, PPI interface design, and zero-shot fitness prediction with standardised oracle APIs and controlled sequence-identity splits. AgentPLM achieves state-of-the-art results with a gain in antibody top-10% hit rate over the strongest passive baseline, providing mechanistic evidence of online error correction without explicit backtracking.2026-06-01T15:35:02ZWorkshop on Generative and Agentic AI for Biology, 43rd International Conference on Machine Learning (ICML 2026)Sahil RahmanMaxx Richard Rahmanhttp://arxiv.org/abs/2503.19158v3Integrating Biological-Informed Recurrent Neural Networks for Glucose-Insulin Dynamics Modeling2026-06-09T08:24:50ZType 1 Diabetes (T1D) management is a complex task due to many variability factors. Artificial Pancreas (AP) systems have alleviated patient burden by automating insulin delivery through advanced control algorithms. However, the effectiveness of these systems depends on accurate modeling of glucose-insulin dynamics, which traditional mathematical models often fail to capture due to their inability to adapt to patient-specific variations. This study introduces a Biological-Informed Recurrent Neural Network (BIRNN) framework to address these limitations. The BIRNN leverages a Gated Recurrent Units (GRU) architecture augmented with physics-informed loss functions that embed physiological constraints, ensuring a balance between predictive accuracy and consistency with biological principles. The framework is validated using the commercial UVA/Padova simulator, outperforming traditional linear models in glucose prediction accuracy and reconstruction of unmeasured states, even under circadian variations in insulin sensitivity. The results demonstrate the potential of BIRNN for personalized glucose regulation and future adaptive control strategies in AP systems.2025-03-24T21:26:12ZAccepted for publication in the proceedings of the Engineering Diabetes Technologies (EDT 2025). 7 pages, 2 figures and 1 tableIFAC-PapersOnLine, 59(2), 2025, pp. 91-96Stefano De CarliNicola LiciniDavide PrevitaliFabio PrevidiAntonio Ferramosca10.1016/j.ifacol.2025.06.016http://arxiv.org/abs/2606.10543v1Flexible Flows for Biological Sequence Design2026-06-09T08:11:14ZDesigning functional biological sequences requires navigating vast discrete spaces under strict evolutionary and biophysical constraints. Discrete Flow Matching (DFM) offers a generative framework over such spaces, but existing approaches rely on biologically uninformative couplings and offer limited flexibility for variable-length sequence generation and fine-grained control. We propose a structured coupling that encodes domain-specific preferences among sequence elements, biasing the source distribution toward plausible regions without modifying the flow objective or training procedure. Building on this, we introduce a latent edit-based rate parameterization that models variable-length generation via edit operations conditioned on a shared global latent, akin to a latent variable model, while remaining tractable. We further introduce a latent classifier-free guidance mechanism that steers generation coherently in continuous latent space, along with Dirichlet-prior temperature scaling for test-time control over edit operations. Our method achieves state-of-the-art performance across diverse biological sequence tasks, including density estimation, unconditional and conditional DNA sequence generation, and peptide sequence generation.2026-06-09T08:11:14ZYogesh VermaDani KorpelaHarri LähdesmäkiVikas Garghttp://arxiv.org/abs/2606.10410v1A Comprehensive Inference-Time Augmentation Framework in Physiological Signals: Application to PPG-Based AF Detection2026-06-09T04:35:16ZObjective: Accurate classification of physiological signals in real-world deployments is challenged by sensor noise, motion artifacts, and distribution shifts between training and deployment data. Inference-time augmentation (ITA), which applies augmentations during inference rather than retraining, offers a simple, model-agnostic mechanism to improve robustness. However, ITA application to physiological signals has remained narrow in scope, relying on limited augmentation methods with fixed, unoptimized parameters. This work proposes a unified ITA framework to address that gap.
Approach: The framework incorporates 13 augmentation methods spanning time-domain, amplitude-domain, frequency-domain, and artifact-injection transformations, with hyperparameters optimized via Bayesian optimization. We evaluate on atrial fibrillation (AF) detection from 30-second PPG signals using GPT-PPG and ResNet across five datasets comprising more than 400 patients and ${\sim}$9,800 hours of recording.
Main results: Standard ITA consistently improved AUROC (up to 8.5% for GPT-PPG and 0.7% for ResNet) and AUPRC (up to 10.6% for GPT-PPG and 0.8% for ResNet). Selective ITA further reduced average FPR by up to 4.4% (GPT-PPG) and 1.3% (ResNet) on non-AF datasets.
Significance: These findings establish ITA as a practical, model-agnostic approach for improving PPG-based AF classification reliability in deployment settings where retraining is not feasible, with broader applicability to physiological signal analysis.2026-06-09T04:35:16Z22 pages, 11 figures, 4 tables. Under review at Physiological MeasurementDavood FattahiRunze YanSaurabh KatariaZhaoliang ChenXiao Huhttp://arxiv.org/abs/2606.10407v1Time-frequency localization of bird calls in dense soundscapes2026-06-09T04:31:30ZPassive acoustic monitoring enables large-scale observation of wildlife, but most bioacoustic classifiers only predict species presence in a time window without localizing vocalizations precisely in time or frequency, limiting downstream analyses. We formulate bird vocalization detection as an object detection task on spectrograms and train YOLO11 models to localize bird calls in dense tropical soundscapes from Singapore. We additionally introduce an open-source browser-based annotation tool and propose Intersection over Minimum (IoMin), an evaluation metric that better handles ambiguous acoustic boundaries than standard IoU and is better suited to the problem at hand. The best YOLO model nearly doubles baseline performance on in-distribution soundscapes from Singapore (81.8% vs. 42.1% IoMin@50 F1-score) while still outperforming the baseline on unseen out-of-distribution recordings from Hawaii (58.6% vs. 48.6%). These results suggest that object detection frameworks are a promising approach to time-frequency localization of animal vocalizations in complex soundscapes.2026-06-09T04:31:30ZSimen HexebergFanghui TongHari VishnuMandar Chitre