https://arxiv.org/api/YSnUAb+DxqK5VTh171qnmOGVVFk2026-06-21T16:13:50Z1325813515http://arxiv.org/abs/2504.10564v3FLOWR: Flow Matching for Structure-Aware De Novo, Interaction- and Fragment-Based Ligand Generation2026-05-29T09:52:15ZWe introduce FLOWR, a novel structure-based framework for the generation and optimization of three-dimensional ligands. FLOWR integrates continuous and categorical flow matching with equivariant optimal transport, enhanced by an efficient protein pocket conditioning. Alongside FLOWR, we present SPINDR, a thoroughly curated dataset comprising ligand-pocket co-crystal complexes specifically designed to address existing data quality issues. Empirical evaluations demonstrate that FLOWR surpasses current state-of-the-art diffusion- and flow-based methods in terms of PoseBusters-validity, pose accuracy, and interaction recovery, while offering a significant inference speedup, achieving up to 70-fold faster performance. In addition, we introduce FLOWR:multi, a highly accurate multi-purpose model allowing for the targeted sampling of novel ligands that adhere to predefined interaction profiles and chemical substructures for fragment-based design without the need of re-training or any re-sampling strategies2025-04-14T17:18:09ZJulian CremerRoss IrwinAlessandro TiboJon Paul JanetSimon OlssonDjork-Arné Cleverthttp://arxiv.org/abs/2605.30831v1The Geometry of Activity Cliffs: Representation Dependence and Multi-Scale Characterization of Activity Landscapes2026-05-29T04:37:00ZActivity cliffs, structurally similar compounds with large potency differences, are widely treated as intrinsic features of chemical datasets. We argue that apart from target biology, much of our cliff understanding is a consequence of the geometry induced by the chosen molecular representation, not a property of a molecule pair itself.
We designed a six-step pipeline to systematically test this hypothesis. The pipeline consists of: assessing pairwise distance geometry, cliff enrichment, activity gradient distribution, persistent homology of the cliff subspace, predictive benchmarking for a chosen pair of an embedding and a metric, and eventually, analysis of the matched molecular pairs and stereoisomers. We applied the pipeline to fifteen configurations of embeddings and metrics to build a benchmark across three distinctive datasets known of activity cliffs challenges.
No representation excels on all criteria: Morgan Tanimoto provides the strongest cliff enrichment and cross-scaffold generalization; MolFormer cosine provides the only meaningful stereochemical sensitivity; MACCS and RDKit Dice fingerprints are most sensitive to matched-molecular-pair transformations; ChemBERTa fails uniformly due to embedding collapse.
These findings are not a ranking. They reflect the fact that different representations encode different aspects of molecular recognition, and that choosing one implicitly defines what an activity cliff actually is.2026-05-29T04:37:00ZPawel Dabrowski-TumanskiBartosz TopolskiDariusz PlewczynskiTomasz Jetkahttp://arxiv.org/abs/2605.30591v1Obesity and Sociodemographic Factors in Luminal Breast Cancer2026-05-28T21:38:06ZLuminal breast cancers represent the most prevalent molecular subtype of breast carcinoma, with Luminal A tumors generally associated with more favorable clinical outcomes than Luminal B tumors. Obesity-related inflammation and prolonged exposure to exogenous steroids have been implicated in the progression of luminal malignancies. This study evaluated 1,928 patients with Luminal A breast cancer and 1,610 patients with Luminal B breast cancer to examine associations among body mass index (BMI), age, ethnic background, menopausal status, and receptor expression, including estrogen receptor (ER), progesterone receptor (PR), and human epidermal growth factor receptor 2 (HER2). Patients with Luminal B tumors demonstrated a significantly greater mean BMI compared with those with Luminal A tumors. In addition, Luminal B tumors were more frequently observed among patients of African ancestry relative to White and Hispanic populations. Multivariable analyses revealed that elevated BMI and African ancestry were independently associated with increased odds of Luminal B carcinoma, whereas postmenopausal status was associated with lower risk. Mediation analysis further indicated that BMI partially explained the association between ancestry and Luminal B disease. These findings suggest that obesity and population-specific factors may contribute to the development of more aggressive luminal breast cancer phenotypes.2026-05-28T21:38:06Z33 pages, 7 figuresVacanti AndersonParamahansa PramanikHaley K. Robinsonhttp://arxiv.org/abs/2605.30518v1Gaussian Mixture Model-Based Focused Refinement for Enhanced Flexible Structure Determination in CryoEM and CryoET2026-05-28T19:55:17ZDynamic conformational changes of proteins are crucial for their cellular functions. Here we present a unified refinement pipeline for flexible protein structures in both CryoEM and in situ CryoET. Using a Gaussian mixture model-based focused alignment procedure, we improve resolution of small domains in highly dynamic proteins and reveal intricate conformational changes. The method corrects the per-subunit motion of TRPV1 and captures the rotary dynamics of ATP synthase within mitochondria.2026-05-28T19:55:17Z18 pages, 5 figuresMuyuan Chenhttp://arxiv.org/abs/2605.30275v1Digitally enriching a screening population for pancreatic cancer using routine blood-based measures and clinical histories2026-05-28T17:32:29ZEarlier detection of pancreatic cancer is key to enabling wider access to curative treatment and reducing cancer deaths; however, screening is presently not viable. Latent indicators of pathology are evident in an individual's disease and blood test trajectories and may predict the development of pancreatic cancer. Longitudinal sequences of coded diagnoses and blood test values accrued by patients throughout their clinical interactions were used to train a custom Transformer-based neural network with a multi-head attention mechanism to predict risk of pancreatic cancer with a multi-year lead time and risk-stratify populations for targeted screening. The cohort comprised 6,017 adults with pancreatic cancer and 177,081 controls (overall median age 75, 45% female) with median 12 years (interquartile range 6.9-16.2) of medical history prior to pancreatic cancer diagnosis. External validation via leave-one-site-out, out-of-sample testing predicting pancreatic cancer 1-, 2-, and 3-years prior to diagnosis demonstrated mean area under the receiver operating characteristic of 0.837 (95% confidence interval 0.827-0.848), 0.797 (95% confidence interval 0.782-0.813), and 0.760 (95% confidence interval 0.745-0.776), respectively. Estimated pancreatic cancer risks were well-calibrated (calibration plot slope 1.08, intercept of -0.077; Brier score 0.025), and a Bayesian population pancreatic cancer prevalence update allows estimated cancer risk outputs to be transportable across settings. At testing, a screening threshold of >3.3% risk of pancreatic cancer in 1-year offered a diagnostic odds ratio of 18.2. Our work therefore lays the foundation for a first population-level digital enrichment tool to widen access to curative-intent management of pancreatic cancer.2026-05-28T17:32:29ZChris VargheseLeo Y. Li-HanRicha BishtEllen LarsonFrank LeeRyan M. CarrTanios S. Bekaii-SaabShounak MajumderJohn D. HalamkaMark TrutyAjit H. GoenkaHojjat SalehinejadCornelius A. Thielshttp://arxiv.org/abs/2605.30109v1Training Ecosystems: A Computational Approach to Uncovering Learning Behavior in Unconventional Contexts2026-05-28T15:47:20ZRecent progress in diverse intelligence has shown simple learning capacities below the organism level - single cells and even molecular networks. However, there are still many knowledge gaps around learning capacity above the organism level, and about memory implemented purely by dynamical interactions without explicit memory media. We demonstrate that minimal ecological dynamics (in silico) are sufficient for several kinds of learning, assayed as changes in both, magnitude of response, and of recovery time. Systematic exploration of over 220,000 parameter combinations in a simulated classic predator-prey model revealed that, when perturbed by stimuli, recovery time exhibits habituation, sensitization, and a form of discrete number learning in a scale-invariant manner. Robustness analysis revealed that habituation and sensitization persist under stochastic perturbations, while discrete number learning is disrupted even at low noise levels. Dimensionality reduction revealed that the incidence of learning capacity is primarily determined by ecological interaction strengths. Clear, unique clustering patterns in parameter space allow high prediction accuracy for novel parameter combinations that enable learning. Response magnitude revealed a striking asymmetry: 90.6% of parameter combinations exhibited recovery time sensitization paired with habituation of response magnitude, while the opposite pattern was extremely rare. These findings highlight a set of phenomena at the intersection of ecology, basal cognition, and mathematics with many implications for a wide range of systems describable by similar kinds of equations. These properties provide numerous efforts in biology and engineering with a substrate that has considerable, pre-patterned, propensity for learning, which ultimately arises from mathematics, not depending on the details of physics or biology.2026-05-28T15:47:20Z26 pages, 14 figuresAdrita SamantaHananel HazanMichael Levinhttp://arxiv.org/abs/2605.30399v1A Novel Computer Vision Approach for Assessing Fish Responses to Intrusive Objects in Aquaculture2026-05-28T14:46:51ZThe aquaculture industry needs to address several challenges to secure sustainable seafood production that can serve an increasing global demand. One major challenge is to ensure good fish health and acceptable welfare during production since the improvement of fish welfare is of vital importance in current and future production systems. In this study, this is addressed by developing and implementing methods to identify fish behaviors in response to intrusive objects both on individual and on a group basis. A novel approach for detecting, tracking, and estimating the 3D position of individual fish has thus been developed, and specifically designed to track the caudal fins of farmed fish in industrial sea cages. The tracking data was subjected to a novel stereo-vision method adapted to estimate fish positions, velocities, accelerations, and turning and pitch angles. Datasets obtained from industrial-scale fish farms were then analyzed to identify the impact of structures of varying shapes, sizes, and colors on fish behavior.
The method was trained using manually labeled caudal fins, and used YOLOv8 with ByteTrack as an object detector and tracker, SuperGlue for matching detections in the left and right frames, and triangulation to reconstruct the 3D positions of the fish. Different image pre-processing and augmentation methods for enhancing object detection accuracy were tested and their performance compared, while RAFT-Stereo was tested for depth estimation purposes. The obtained results both validate the method's performance against previous research efforts, and demonstrate the novelty and potential of this method in providing more insight into behavioral dynamics in sea-cages.2026-05-28T14:46:51ZHanne-Grete AlvheimStian Mjelde JakobsenMartin FøreEleni Kelasidihttp://arxiv.org/abs/2605.29907v1Stochastic network epidemic model and particle filter: General framework and application to influenza in Japan2026-05-28T13:25:31ZParameter inference and state estimation in stochastic and partially observed biological systems remain major problems in mathematical biology. In this work, we introduce a two-dimensional lattice graph model for the spread of infectious diseases. Estimating states and parameters in graph-based stochastic epidemic systems is particularly challenging because of randomness and incomplete observations. To address these issues, we propose a particle filter based data assimilation framework for the sequential estimation of both model states and unknown parameters. Two methodologies are developed: one based on the number of infected agents and another based on partial spatial location's information of infected agents on a two-dimensional lattice. The performance of the two methods are firstly analyzed and validated using synthetic data, and the first method is then applied to influenza data collected from different prefectures in Japan between July 2024 and December 2025. One-week-ahead forecasting simulations are also performed using current weekly data. The findings highlight the effectiveness of the proposed PF framework for real-time epidemic monitoring, forecasting, and adaptive public health decision-making.2026-05-28T13:25:31ZIhtisham Ul HaqSerge Richardhttp://arxiv.org/abs/2605.29587v1FPLIER: Federated Pathway-Level Information Extractor2026-05-28T08:30:23ZIn transcriptomics, gene-set-aware factorization methods such as the Pathway Level Information Extractor (PLIER) are most effective when trained on large, heterogeneous expression compendia. Yet, many clinically relevant cohorts cannot be pooled into a single dataset due to privacy and governance constraints. We present FPLIER, a federated extension of PLIER that enables distributed training across multiple data holders while incorporating publicly available datasets. Through secure aggregation, FPLIER produces training updates algebraically equivalent to those of a centralized pooled-data approach while keeping expression data local. We evaluate FPLIER across multiple scenarios in two simulated consortia (from the K-CLIER and MultiPLIER studies) and demonstrate stable convergence. We further conduct a systematic analysis of membership inference attacks targeting both intermediate training statistics and the released model. Our results show that privacy risk is governed by the rank of the training expression matrix. Incorporating public data or reducing data dimensionality increases this rank, moving the system toward a full-rank regime in which training and non-training samples become indistinguishable to the attacker, and membership-inference performance approaches random guessing.2026-05-28T08:30:23ZAccepted for publication at the ACM BCB '26 conferenceDaniele MalpettiChristian BerchtoldFrancesco GualdiMarco ScutariLaura AzzimontiFrancesca Mangili10.1145/3807503.3819364http://arxiv.org/abs/2605.29329v1Mixing Vector Model for Copolymer Inference via Mixed Integer Linear Programming2026-05-28T04:05:30ZA novel two-phase molecule inference framework, mol-infer, has recently been developed to infer chemical graphs with prescribed abstract structures and desired property values through mixed integer linear programming (MILP) under the two-layered model, with guaranteed optimality and exactness relative to the given learned prediction function and structural constraints. In this study, we extend this framework to copolymers by introducing a simple feature representation, called the mixing vector (MV) model. In the proposed model, a copolymer feature vector is represented as a convex combination of MILP-tractable monomer descriptors weighted by the mixing ratio of the constituent monomers. This representation does not require explicit sequence-class information and is therefore naturally compatible with MILP-based inverse design. Under this model, we construct prediction functions for several copolymer property datasets using artificial neural networks, reduced quadratic multiple linear regression, and random forests. The proposed representation achieves practically useful predictive performance across multiple physicochemical property datasets; in particular, the best test R^2 score exceeds 0.7 for nine of the ten datasets and exceeds 0.9 for six datasets. We also formulate a multi-monomer inverse-design problem under the MV representation with a prescribed mixing ratio and show that the resulting MILP instances remain tractable, even for three-monomer settings. Finally, we perform an external consistency check by re-evaluating the inferred candidates and comparing the re-computed property values with those predicted by the learned model. Overall, the proposed framework gives a tractable first step toward model-level exact inverse design of copolymers under the two-layered model.2026-05-28T04:05:30ZJianshen ZhuRaveena RaiTaiyo SohkawaNaveed Ahmed AzamKazuya HaraguchiLiang ZhaoTatsuya Akutsuhttp://arxiv.org/abs/2605.30382v1On the Connection Between Differential Population Growth Rate and Epidemic Reproduction Numbers2026-05-28T02:08:57ZDuring pandemics, public health agencies need to rapidly assess whether a new viral variant is more transmissible than existing lineages. For co-circulating variants, relative fitness can be expressed as a selective coefficient, as the differential population growth rate (DPGR) estimated from genomic surveillance, or, with additional assumptions, as a contrast in epidemic reproduction numbers $R_t$. We show that DPGR estimates a pairwise growth-rate difference. Under a specified generation-interval model, this difference can be transformed into reproduction-number space; in the equal-generation-time SIR special case, it reduces to a scaled difference in variant-specific $R_t$. Related growth-rate contrasts also appear in multinomial logistic and growth-advantage random-walk models, although those methods differ from DPGR in likelihood, smoothing, priors, and data inputs. We evaluate the theory across five SARS-CoV-2 and influenza analyses totaling more than 2,200 matched data points. SIR simulation recovers the expected mapping when the true $R_t$ is known, and retrospective SARS-CoV-2 analyses show sustained DPGR signals 43 to 65 days before variant dominance, with 95\% sign accuracy in our analysis. DPGR is approximately transitive across lineage triplets, near zero for selected functionally similar sublineages, and directionally consistent across countries. These results connect sequence-count-based fitness estimates to reproduction-number contrasts through an assumption-explicit growth-rate bridge.2026-05-28T02:08:57Z23 pages, 5 figuresHong Qinhttp://arxiv.org/abs/2512.00283v3BioArc: Discovering Optimal Neural Architectures for Biological Foundation Models2026-05-27T21:38:15ZFoundation models have revolutionized various fields such as natural language processing (NLP) and computer vision (CV). While efforts have been made to transfer the success of the foundation models in general AI domains to biology, existing works focus on directly adopting the existing foundation model architectures from general machine learning domains without a systematic design considering the unique physicochemical and structural properties of each biological data modality. This leads to suboptimal performance, as these repurposed architectures struggle to capture the long-range dependencies, sparse information, and complex underlying ``grammars'' inherent to biological data. To address this gap, we introduce BioArc, a novel framework designed to move beyond intuition-driven architecture design towards principled, automated architecture discovery for biological foundation models. Leveraging Neural Architecture Search (NAS), BioArc systematically explores a vast architecture design space, evaluating architectures across multiple biological modalities while rigorously analyzing the interplay between architecture, tokenization, and training strategies. This large-scale analysis identifies novel, high-performance architectures, allowing us to distill a set of empirical design principles to guide future model development. Furthermore, to make the best of this set of discovered principled architectures, we propose and compare several architecture prediction methods that effectively and efficiently predict optimal architectures for new biological tasks. Overall, our work provides a foundational resource and a principled methodology to guide the creation of the next generation of task-specific and foundation models for biology.2025-11-29T02:36:54ZAccepted at the 43nd International Conference on Machine Learning (ICML 2026)Yi FangHaoran XuJiaxin HanSirui DingYizhi WangYue WangXuan Wanghttp://arxiv.org/abs/2601.10912v5Graph Neural Network Reveals the Cortical Morphology of Local Brain Aging in Normal Cognition and Alzheimer's Disease2026-05-27T18:31:30ZEstimating brain age (BA) from T1-weighted magnetic resonance images (MRIs) provides a powerful framework for quantifying anatomical brain aging. Whereas global BA (GBA) summarizes overall brain health, local BA (LBA) provides cortically specific patterns of aging at the subject level. Although previous studies have examined anatomical contributors to GBA, to our knowledge, no framework has been established to estimate LBA using cortical morphology. To address this gap, we introduce a graph neural network (GNN) that uses morphometric features$\unicode{x2013}$cortical thickness, surface area, curvature, gray/white matter intensity ratio (GWR), sulcal depth$\unicode{x2013}$to estimate LBA across the cortical surface at high spatial resolution (mean inter-vertex distance = 1.37 mm). Trained on cortical surface meshes extracted from the MRIs of cognitively normal (CN) adults (N = 14,423), our model achieves lower mean absolute error (MAE) than the existing state-of-the-art while identifying more biologically plausible patterns of aging in Alzheimer's disease (AD) on the ADNI dataset. Association cortices emerge as primary sites of morphometric aging in CNs, whereas mild cognitive impairment is characterized by widespread aging that is pronounced in the parahippocampal gyrus. AD subjects demonstrate significant aging across the entire cortex, particularly within medial temporal regions and associated cortical networks. Feature ablation highlights curvature and GWR as preferentially sensitive to AD pathology. Regional LBA gaps are significantly associated with neuropsychological measures of AD-related cognitive impairment, linking cortical aging patterns to clinical outcomes. These results demonstrate that GNN-based modeling of cortical morphometry enables biologically interpretable mapping of local brain aging with greater interpretability than prior work.2026-01-16T00:06:39ZCode and supplementary tables are available at https://github.com/irimia-laboratory/Graph_UNetSamuel D. AndersonRayneJordan JomskyRayneNikhil N. ChaudhariRayneNahian F. ChowdhuryRayne XiaoyuRayne ZhengAndrei IrimiaAlzheimers Disease Neuroimaging Initiativehttp://arxiv.org/abs/2605.28739v1BIRDNet: Mining and Encoding Boolean Implication Knowledge Graphs as Interpretable Deep Neural Networks2026-05-27T16:59:01ZTabular data in knowledge-rich domains often carries a latent prior in the form of Boolean implication relationships (BIRs) between pairs of features. We mine such relationships with a sparse-exception binomial test. The mined implications form a typed directed graph, equivalent to a propositional rule base of 2-literal clauses. We encode this graph as the connectivity of a layered neural network, called BIRDNet, in which each hidden unit corresponds to one mined rule and binds only to its two features. We show two consequences of this design: First, the architecture is sparse by construction: at most $2/d$ of the weights in each BIR layer are active, where $d$ is the input dimension. Second, the model is interpretable: every trained unit keeps a stable symbolic identity, so rules can be read off the network without surrogate models. Unlike most neurosymbolic models, BIRDNet does not consume an external rule base; its structural prior is mined from the data. We evaluate BIRDNet on six transcriptomic and proteomic benchmarks. Our results show that BIRDNet stays within 0.02 AUROC of the strongest dense baseline, at a small accuracy cost, while using up to $96\times$ fewer active parameters than an architecture-matched dense MLP. First-layer rules recover known biological signatures across multiple cancer subtypes and tissue types, including canonical amplicons, lineage-defining co-expression modules, and immune-infiltration markers. Data and code are available at: https://github.com/MAHI-Group/BIRDNet.2026-05-27T16:59:01Z5 pages; 1 figure, 4 tablesTirtharaj Dashhttp://arxiv.org/abs/2605.28652v1Widespread quasi-steady state assumption in biological interaction modeling mischaracterizes system transitions2026-05-27T15:53:40ZFrom molecular, cellular, to ecological systems, the modeling of biological processes often stands on the assumption that fast components immediately reach the equilibrium at each moment (quasi-steady state) and only slow components govern the relevant system dynamics. This quasi-steady state approximation (QSSA) simplifies the modeling but discards the effects of the relaxation towards each quasi-steady state. Unclear is the QSSA's suitability around the transition point, a specific condition where the system changes to a qualitatively different state. In this regard, we here derived a theoretical framework for the near-transition dynamics of biological systems, explicitly considering the relaxation processes overlooked by the QSSA. Numerical simulations verify our predictions for cellular decision-making, metabolic oscillations, and ecological cycles. Despite the extreme slowdown near the transition point, the QSSA alone misestimates the duration of the transition from one state to another. Moreover, the QSSA erroneously predicts the transition point itself for the onset of oscillations, while the relaxation dynamics facilitates or suppresses the oscillation onset with a counterintuitive time-delay effect. Common feedback interactions between biological components are pivotal to those relaxation effects. Our study provides an analytical foundation to understand the rich transient or rhythmic dynamics of interacting biological components near the transitions.2026-05-27T15:53:40ZMain manuscript and supplementary information providedPan-Jun Kim