Multimodality Stacking with Blockwise missing values and application to the PIONeeR biomarkers study for prediction of resistance to immunotherapy

2026-05-24T12:48:38Z

Integrating multimodal datasets in clinical oncology is frequently hindered by high dimensionality and blockwise missingness, where entire data sources are unavailable for specific patient subsets. Standard survival models often struggle with these gaps, leading to biased results or patient exclusion. We introduce Multimodality Stacking with Blockwise missing values (MSB), a late-fusion framework for survival analysis that independently models modality-specific features before aggregating predictions via a cross-validated stacking meta-learner. MSB was validated on the PIONeeR study (n=443 patients, 378 biomarkers across eight heterogeneous sources) to predict progression-free survival in advanced non-small cell lung cancer patients receiving immunotherapy. MSB yielded higher predictive performance (C-index) than baseline algorithms. Improvements varied by baseline strength: linear models showed a 15.9% increase (p<0.001 for the Wilcoxon signed-rank test), random survival forests gained 5.4% (p=0.002), and gradient boosting methods improved by 2.1% (p=0.030). Beyond discrimination, MSB reduced the generalization gap (train-test difference in 5 folds cross-validation repeated 3 times: 0.055 vs 0.380 for linear models). Permutation importance analysis identified routine laboratory markers, clinical features, and PD-L1 expression as primary predictive drivers. Missing block indicators showed negligible importance, suggesting the model learned from biomarker values rather than data availability patterns. MSB provides a statistically validated framework for multimodal survival prediction with blockwise missingness. By enabling systematic biomarker evaluation without requiring complete data, MSB offers a practical tool for predictive modeling in biomedical research, pending external validation. Implementation is available at https://github.com/MohamedBoussena/MSB under Inria license.

Explainable Multi-Task Retinal Imaging Reveals Microvascular Signals for Systemic Risk Stratification in Type 2 Diabetes: A Pilot Study

2026-05-24T07:32:58Z

Retinal imaging provides a non-invasive window into systemic microvascular health and has emerged as a potential biomarker for systemic diseases. However, whether retinal features encode biologically meaningful systemic signals that can be reliably interpreted using explainable artificial intelligence (XAI) remains unclear. An explainable multi-task deep learning framework was developed to investigate associations between retinal microvascular features and systemic abnormalities in Type 2 Diabetes Mellitus. A total of 11,011 fundus images from 2,719 individuals were analysed using a shared neural network with task-specific heads for glycaemic status, kidney abnormality, and multi-system involvement. Model interpretability was evaluated using Gradient-weighted Class Activation Mapping (Grad-CAM), anatomical masking, and vessel alignment analysis. The framework demonstrated task-dependent predictive performance, with the best discrimination observed for kidney abnormality (AUC up to 0.63), whereas glycaemic status prediction showed limited performance (AUC = 0.49-0.61). Explainability analyses consistently localized model attention to retinal vessels and peripapillary regions. Masking experiments showed that occlusion of vascular regions caused the greatest performance decline, indicating that retinal vessels were the primary predictive source. Different architectures exhibited heterogeneous attention patterns, suggesting multiple representational pathways for systemic signal encoding. This pilot study demonstrates that retinal microvascular features contain measurable signals associated with systemic abnormalities, particularly microvascular damage. By integrating multi-task learning with quantitative XAI validation, this framework advances retinal imaging toward interpretable digital biomarkers for systemic risk stratification in diabetes.

Querying structural and functional niches on spatial transcriptomics data

2026-05-24T01:58:39Z

Cells in multicellular organisms coordinate to form structural and functional niches. With spatial transcriptomics (ST) enabling gene expression profiling in spatial contexts, it has been revealed that spatial niches serve as cohesive and recurrent units in physiological and pathological processes. These observations suggest universal tissue organization principles encoded by conserved niche patterns, and call for a query-based niche analytical paradigm beyond current computational tools. In this work, we defined the niche-query task, which is to identify similar niches across ST samples given a niche of interest (NOI). We further developed QueST, a specialized method for solving this task. QueST models each niche as a subgraph, uses contrastive learning to learn discriminative niche embeddings, and incorporates adversarial training to mitigate batch effects. In simulations and benchmark datasets, QueST outperformed existing methods repurposed for niche querying, accurately capturing niche structures in heterogeneous environments and demonstrating strong generalizability across diverse sequencing platforms. Applied to tertiary lymphoid structures in renal and lung cancers, QueST revealed functionally distinct niches associated with patient prognosis and uncovered conserved and divergent spatial architectures across cancer types. Applied to a combinatorial spatial perturbation dataset, QueST demonstrated a complete de novo discovery-oriented workflow, characterizing previously unresolved tumor nodules through querying. These results demonstrate that QueST enables systematic, quantitative profiling of spatial niches across samples, providing a powerful tool to dissect spatial tissue architecture in health and disease.

GEESE: Genotype-aware End-to-End Spatio-temporal Embedding for Behavioral Phenotyping

2026-05-23T03:16:41Z

Behavioral phenotyping of genetic animal models currently requires labor-intensive manual feature engineering that limits reproducibility and scalability. We present GEESE, an end-to-end deep learning framework that learns behavioral representations directly from 3D pose dynamics without hand-crafted features. Using a pretrained time series foundation model, we encode movement sequences into a behavioral manifold that supports both behavior classification and genotype prediction. Evaluated across three autism-associated genetic models (CNTNAP2, CHD8, FMR1), our deep learning approach surpasses hand-crafted feature baselines in both tasks, revealing that learned representations capture genotype-specific behavioral signatures. The framework generalizes across genetic backgrounds, and an all-cohort model identifies both genetic background and genotype from movement patterns alone. We further provide HONK, an interactive intelligent tool enabling researchers without programming expertise to perform behavioral phenotyping from pose data through natural language interaction.

7 Tesla Quantitative MRI and Machine Learning for Exploratory Motor Subtype Stratification and Diagnosis in Parkinson's Disease

2026-05-22T20:03:20Z

Parkinson's disease (PD) is a highly heterogeneous disease, including which motor symptoms are dominating. Imaging biomarkers that support subtype stratification could also improve biological understanding and study design, and enable personalized treatment strategies. This study evaluates whether deep-learning based automatic brain segmentation, in addition to quantitative maps from 7 Tesla MRI, can highlight differences between Healthy Controls (HC), Postural Instability and Gait Difficulty (PIGD) and Tremor Dominant (TD), and subsequently be used for objective PD stratification. The performance of machine learning classifiers may be improved with feature selection. 21 HC, and 24 people with PD (PwP) were included. The U-Net training was assessed with DSC. Two classification approaches using 5-fold cross-validation were defined across three tasks: (1) HC vs PwP; (2) PIGD vs TD; (3) multiclass, HC vs PIGD vs TD. Approach A used all extracted features. Approach B found the optimal subset of features for the classification tasks. The U-Net achieved mean DSC of 0.86 for all ROIs during training. Approach A: Task 1 best accuracy of 0.69 and best AUC of 0.73. Task 2 accuracy 0.69, AUC 0.90. Task 3 accuracy 0.62, AUC 0.66. Approach B: Task 1 accuracy of 0.82 and AUC of 0.93. Task 2 accuracy 1.00, AUC 1.00. Task 3 accuracy 0.73, AUC 0.91. DL-based segmentation combined with qMRI feature selection improved classification relative to using all features, supporting the potential of interpretable, low-dimensional imaging signatures for PD diagnosis support and phenotype stratification. Larger, multi-site studies are warranted to assess generalizability and stability.

Learning dynamical systems with biochemically informed neural ordinary differential equations

2026-05-22T19:43:41Z

Ordinary differential equation models of biochemical reactions are often formulated as stoichiometric systems in which the dynamics arise from a collection of interacting processes. A central challenge is that the functional form of each process is rarely known a priori and may be difficult to infer from data. We propose biochemically informed neural ordinary differential equations (BINODEs), a neural-ODE framework that retains the stoichiometric structure of mechanistic models while representing individual processes by neural networks. In BINODEs, the outputs of neural network processes (NNPs) are mapped to state derivatives through a linear layer analogous to a stoichiometric matrix. This architecture allows biological side information, such as process-specific inputs, sign constraints, and monotonicity assumptions, to be built directly into the model. We characterize the approximation properties of NNPs for several standard biochemical rate laws and show that the proposed framework recovers both trajectories and process-level structure in Monod, Lotka--Volterra, pharmacokinetic, and ultradian endocrine models. These results suggest that BINODEs offer a useful compromise between mechanistic interpretability and data-driven flexibility for modeling partially known biochemical or biological dynamical systems.

Using timescale as a state coordinate reveals the metastable geometry of behavior

2026-05-22T18:52:57Z

Animal behavior unfolds across many timescales, from fast movement patterns to slow changes in internal states such as hunger, arousal, and circadian phase. These slow variables are rarely measured directly and must instead be inferred from their effects on the faster movements that can be observed. Here we propose treating timescale itself as an explicit coordinate of the state representation, constructing a time-frequency state space where fast movements and slow modulations appear simultaneously. We find that slow modes emerge as linear arms radiating from a stationary-weighted hub in the leading non-trivial eigenvectors of the transfer operator, with one arm per metastable basin across three systems of increasing complexity. In a synthetic system, the framework recovers a hidden bistable driver across nearly three decades of dwell time, while a fixed-timescale analysis of the same trajectory finds no separable slow modes. In nematode locomotion, it reproduces the canonical run-pirouette organization. In freely moving fruit flies, where fast leg kinematics are orders of magnitude faster than the behavioral states they compose, the multi-timescale operator identifies four metastable behavioral basins directly from the postural time series, without first decomposing into a sequence of stereotyped actions. We further find that these basins exhibit a broad, heavy-tailed distribution of residence times. Treating timescale as a state coordinate thus exposes a predictable geometric form for the slow organization of behavior, providing a general route for extracting collective modes from partially observed biological time series without first organizing the dynamics into discrete events.

Particle Image Velocimetry of 3D printed vascular fluidic phantom devices

2026-05-22T17:34:29Z

Altered hemodynamics play a key role in cerebrovascular diseases such as aneurysms and stenosis. However, in vivo imaging lacks the spatial resolution required to resolve flow dynamics in small vessels. This study presents an experimental framework to investigate microscale hemodynamics using transparent 3D printed vascular models and particle image velocimetry (PIV). Optically transparent microfluidic models with straight and pathological (aneurysmal and stenotic) geometries were fabricated via additive manufacturing up to a minimum diameter size of 500 microns and characterized using optical microscopy. Flow experiments were conducted under steady laminar conditions, and local velocity fields and wall shear stress (WSS) were measured using microPIV. Measured velocities have been compared with analytical Hagen Poiseuille predictions, obtaining mean relative errors of 5 to 17 percent. The platform reliably captured key flow features and spatial variations in velocity. Overall, the results demonstrate that transparent 3D printed vascular models combined with microPIV provide a robust experimental approach for studying microscale cerebrovascular hemodynamics.

On the Design of an Analog-Dyadic Converter CRN

2026-05-22T15:20:03Z

The Chemical Reaction Networks (CRN) interpreted through the differential semantics, even when restricted to elementary reactions with mass action law kinetics, form a Turing-complete language. This means that any computable real function can thus be programmed, and in fact compiled, in an abstract CRN that will compute it with an arbitrarily high precision. In this computational framework, the information carriers are the molecular concentrations, the required precision is given as input, and the output concentration is guaranteed to satisfy the required precision. On the other hand, one can be interested in estimating the derivative of an unknown input signal or in reading the concentration value of an input molecular species. By nature, such problems can only be approximated with a finite precision. Hence, the computation framework proposed previously cannot be applied and we need to design and analyze custom CRNs to perform these tasks. In this paper, we present an analog-dyadic converter CRN which takes as input one molecular concentration (in [0, 1] but not necessarily computable), and produces as output a sequence of ''on'' and ''off'' spikes corresponding to some extent to the sequence of bits in the dyadic representation of the input concentration. We provide a detailed analysis of the source of errors and their behavior when varying the reactions rate constants. We conclude by sketching a possible design for a reader module that takes as input an arbitrary concentration and a desired precision and outputs a dyadic encoding approximating the value of the concentration with the desired precision. We leave as an open question to prove the correctness of our construction.

ProtDBench: A Unified Benchmark of Protein Binder Design and Evaluation

2026-05-22T14:01:23Z

Recent advances in de novo protein binder design have enabled increasing experimental validation, yet reported in silico metrics remain difficult to interpret or compare across studies due to non-standardized evaluation protocols. We introduce ProtDBench, a standardized and throughput-aware evaluation framework for protein binder design. ProtDBench defines unified benchmark tasks, evaluation protocols, and success criteria, enabling systematic analysis of how evaluation design influences observed performance. Using a large wet-lab annotated dataset, we analyze commonly used structure prediction models as evaluation verifiers, revealing substantial verifier-dependent bias and limited agreement under identical filtering protocols. We then benchmark representative open-source generative binder design methods across ten diverse protein targets under a fixed evaluation protocol. Beyond per-sequence success rates, ProtDBench incorporates throughput-aware metrics based on a fixed 24-hour budget, as well as cluster-level success criteria to account for structural diversity. Together, these results expose systematic differences induced by filtering rules, success definitions, and throughput-aware evaluation between computational efficiency, success rate, and structural diversity. Overall, ProtDBench provides a fair and reproducible evaluation pipeline that supports systematic and controlled comparison of protein binder design methods under realistic evaluation settings.

Abstract relational structures in models of biology

2026-05-22T02:26:34Z

The mathematical formalisms used to model biological systems induce both latent and ambiguous assumptions that can limit or distort their representational capabilities. Developing formalisms that can represent systems more precisely is fundamental to comprehending their intricacies and complexities. Here we introduce the systems hypergraph, a general and extendable formalism for representing abstract relational systems. A systems hypergraph combines a hypergraph, representing multidimensional relations among objects, with a hierarchical system of attributes representing system properties and their interdependencies. The attribute structure ensures that dependencies between system properties are patent and unambiguous, thereby clarifying assumptions and avoiding redundancy in data association. As an application we consider two formalisms widely used in systems biology - chemical reaction networks and stochastic Petri nets - and study their natural representation as systems hypergraphs. This allows us to relate the two formalisms rigorously, demonstrating in particular that stochastic Petri nets are strictly more general than chemical reaction networks in contrast to their commonly assumed equivalence. More broadly our work demonstrates the power of abstraction, and in particular its role in mediating between objects and relations in mathematical representations of biological complexity.

A Mathematical Reconstruction of Endothelial Cell Networks

2026-05-21T20:17:19Z

Endothelial cells form the linchpin of vascular and lymphatic systems, creating intricate networks that are pivotal for angiogenesis, controlling vessel permeability, and maintaining tissue homeostasis. Despite their critical roles, there is no rigorous mathematical framework to represent the connectivity structure of endothelial networks. Here, we develop a pioneering mathematical formalism called $π$-graphs to model the multi-type junction connectivity of endothelial networks. We define $π$-graphs as abstract objects consisting of endothelial cells and their junction sets, and introduce the key notion of $π$-isomorphism that captures when two $π$-graphs have the same connectivity structure. We prove several propositions relating the $π$-graph representation to traditional graph-theoretic representations, showing that $π$-isomorphism implies isomorphism of the corresponding unnested endothelial graphs, but not vice versa. We also introduce a temporal dimension to the $π$-graph formalism and explore the evolution of topological invariants in spatial embeddings of $π$-graphs. Finally, we outline a topological framework to represent the spatial embedding of $π$-graphs into geometric spaces. The $π$-graph formalism provides a novel tool for quantitative analysis of endothelial network connectivity and its relation to function, with the potential to yield new insights into vascular physiology and pathophysiology.

Molecular Lead Optimization via Agentic Tool Planning

2026-05-21T19:12:19Z

Drug discovery is a lengthy and resource-intensive process composed of multiple stages. Among these stages, lead optimization plays a critical role in transforming early hit compounds into viable drug candidates. This stage requires improving ADMET-related properties through subtle structural refinement while preserving key molecular substructures responsible for binding affinity to disease targets. Recent advances in artificial intelligence have shown promise in accelerating various aspects of drug discovery; however, most existing approaches to lead optimization rely on one-step molecular optimization, which fail to account for the long-term consequences of sequential design decisions. To address this limitation, we propose TRACE, a trajectory-aware, LLM-reasoning agent for molecular lead optimization that formulates tool selection as a sequential decision-making problem over action trajectories. Given a lead molecule and an optimization objective, TRACE makes trajectory-aware decisions over molecular optimization tools, enabling forward-looking refinement under structural constraints. Experiments on multiple ADMET optimization tasks show that our agent achieves higher optimization success, larger property improvements, and higher validity, while preserving molecular similarity compared to baseline models.

Uncertainty-aware classification and triage of structural heart disease using electrocardiography and echocardiography metrics

2026-05-21T18:55:12Z

Machine learning methods provide a methodological innovation that can help screen for cardiovascular disease through noninvasive and readily available measurement modalities. Recent investments in using electrocardiogram (ECG) data to screen for structural heart disease (SHD) are one example, where ECGs provide a low-cost, available modality for screening. This has led to the EchoNext dataset, a paired ECG-echocardiogram data repository for testing new methods of SHD detection. However, relatively few studies have investigated how more probabilistic classification through Bayesian inference may improve uncertainty quantification in this setting. Moreover, few studies have considered how triage systems can be developed to alleviate healthcare bottlenecks, such as the review of data from underserved, rural clinics by expert sonographers for SHD assessment. In this study, we leverage existing ECG-echocardiogram data to compare frequentist and Bayesian neural network classifiers. We show that the Bayesian approach is comparable or better than frequentist methods in SHD classification, and that they have a more robust uncertainty quantification attached to them. We provide an example of how this uncertainty-aware classification scheme can be used for screening SHD, providing a proof-of-concept for how machine learning can help with triage in getting individuals expert sonographer input when SHD is highly likely or measurements are highly uncertain.

FederatedRSF : Federated Random Survival Forests for Partially Overlapping Medical Data

2026-05-21T18:32:25Z

Multi-center survival prediction can improve robustness and generalizability, yet privacy regulations and institutional governance often prevent pooling patient-level clinical and genomic data across institutions. In practice, deployment is further complicated by feature-space heterogeneity, in which sites collect different covariates or use different sequencing panels, resulting in only partially overlapping feature sets. We present FederatedRSF, a Python package that implements federated random survival forests, aggregating locally trained survival trees and redistributing only feature-compatible trees to each site, enabling inference with partial overlap without sharing raw data. We evaluate FederatedRSF on the GBSG2 breast cancer cohort distributed with the scikit-survival package, simulating feature heterogeneity across clients by withholding subsets of features, and assessing discrimination using Harrell's concordance index (C-Index) under repeated cross-validation and site-splits. The results demonstrated that the federated model can achieve performance comparable to that of the centralized training setting.