BioModelsRAG: A Biological Modeling Assistant Using RAG (Retrieval Augmented Generation)

2026-01-30T07:58:57Z

The BioModels database is one of the premier databases for computational models in systems biology. The database contains over 1000 curated models and an even larger number of non-curated models. All the models are stored in the machine-readable format, SBML. Although SBML can be translated into the human readable Antimony format, analyzing the models can still be time consuming. In order to bridge this gap, a LLM (large language model) assistant was created to analyze the BioModels and allow interaction between the user and the model using natural language. By doing so, a user can easily and rapidly extract the salient points in a given model. Our analysis workflow involved 'chunking' BioModels and converting them to plain text using llama3, and then embedding them in a ChromaDB database. The user-provided query was also embedded, and a similarity search was performed between the query and the BioModels in ChromaDB to extract the most relevant BioModels. The BioModels were then used as context to create the most accurate output in the chat between the user and the LLM. This approach greatly minimized the chance of hallucination and kept the LLM focused on the problem at hand.

Computational investigation of single herbal drugs in Ayurveda for diabetes and obesity using knowledge graph and network pharmacology

2026-01-29T12:45:43Z

Metabolic diseases such as type 2 diabetes and obesity represent a rapidly escalating global health burden, yet current therapeutic strategies largely target isolated symptoms or single molecular pathways. To this end, we developed an integrated computational pipeline leveraging knowledge graph, pathway analysis and network pharmacology to elucidate the multi-target mechanisms of Ayurvedic Single Herbal Drugs (SHDs). SHDs associated with diabetes and obesity were curated from the Ayurvedic Pharmacopoeia of India, followed by phytochemical identification using IMPPAT database, yielding a shortlist of 11 SHDs and their 188 phytochemicals after drug-likeness and bioavailability filtering. Subsequently, molecular targets of the phytochemicals in SHDs, disease-associated genes and therapeutic targets of FDA-approved drugs, were curated via integration of data from several databases. Pathway enrichment analysis revealed significant functional overlap between SHD-associated and disease-associated pathways. All curated data were embedded into a Neo4j-based knowledge graph, enabling SHD-disease intersection analysis that prioritized key disease-relevant targets, including PTPN1, GLP1R, and DPP4. Also, the SHD-Target-FDA-approved drug profile elucidated the molecular and mechanistic aspects of the SHDs as a phytochemical cocktail, and is in alignment with the clinically studied synergistic FDA-approved drug combinations. Network pharmacology based protein-protein interaction analysis identified PPARG as another central regulator. Using a quantitative framework, we identified phytochemical pairs within SHDs, which were structurally dissimilar and target-wise distinct, yet acted on shared or different disease-associated pathways, indicating complementary and potentially synergistic interactions. Molecular docking analysis of two selected druggable targets identified putative lead phytochemicals.

Control systems for synthetic biology and a case-study in cell fate reprogramming

2026-01-27T23:53:33Z

This paper gives an overview of the use of control systems engineering in synthetic biology, motivated by applications such as cell therapy and cell fate reprogramming for regenerative medicine. A ubiquitous problem in these and other applications is the ability to control the concentration of specific regulatory factors in the cell accurately despite environmental uncertainty and perturbations. The paper describes the origin of these perturbations and how they affect the dynamics of the biomolecular ``plant'' to be controlled. A variety of biomolecular control implementations are then introduced to achieve robustness of the plant's output to perturbations and are grouped into feedback and feedforward control architectures. Although sophisticated control laws can be implemented in a computer today, they cannot be necessarily implemented inside the cell via biomolecular processes. This fact constraints the set of feasible control laws to those realizable through biomolecular processes that can be engineered with synthetic biology. After reviewing biomolecular feedback and feedforward control implementations, mostly focusing on the author's own work, the paper illustrates the application of such control strategies to cell fate reprogramming. Within this context, a master regulatory factor needs to be controlled at a specific level inside the cell in order to reprogram skin cells to pluripotent stem cells. The article closes by highlighting on-going challenges and directions of future research for biomolecular control design.

TwinPurify: Purifying gene expression data to reveal tumor-intrinsic transcriptional programs via self-supervised learning

2026-01-27T15:04:22Z

Advances in single-cell and spatial transcriptomic technologies have transformed tumor ecosystem profiling at cellular resolution. However, large scale studies on patient cohorts continue to rely on bulk transcriptomic data, where variation in tumor purity obscures tumor-intrinsic transcriptional signals and constrains downstream discovery. Many deconvolution methods report strong performance on synthetic bulk mixtures but fail to generalize to real patient cohorts because of unmodeled biological and technical variation. Here, we introduce TwinPurify, a representation learning framework that adapts the Barlow Twins self-supervised objective, representing a fundamental departure from the deconvolution paradigm. Rather than resolving the bulk mixture into discrete cell-type fractions, TwinPurify instead learns continuous, high-dimensional tumor embeddings by leveraging adjacent-normal profiles within the same cohort as "background" guidance, enabling the disentanglement of tumor-specific signals without relying on any external reference. Benchmarked against multiple large cancer cohorts across RNA-seq and microarray platforms, TwinPurify outperforms conventional representation learning baselines like auto-encoders in recovering tumor-intrinsic and immune signals. The purified embeddings improve molecular subtype and grade classification, enhance survival model concordance, and uncover biologically meaningful pathway activities compared to raw bulk profiles. By providing a transferable framework for decontaminating bulk transcriptomics, TwinPurify extends the utility of existing clinical datasets for molecular discovery.

Multi-omics network reconstruction with collaborative graphical lasso

2026-01-27T14:36:37Z

Motivation: In recent years, the availability of multi-omics data has increased substantially. Multi-omics data integration methods mainly aim to leverage different molecular layers to gain a complete molecular description of biological processes. An attractive integration approach is the reconstruction of multi-omics networks. However, the development of effective multi-omics network reconstruction strategies lags behind. Results: In this study, we introduce collaborative graphical lasso, a novel approach that extends graphical lasso by incorporating collaboration between omics layers, thereby improving multi-omics data integration and enhancing network inference. Our method leverages a collaborative penalty term, which harmonizes the contribution of the omics layers to the reconstruction of the network structure. This promotes a cohesive integration of information across modalities, and it is introduced alongside a dual regularization scheme that separately controls sparsity within and between layers. To address the challenge of model selection in this framework, we propose XStARS, a stability-based criterion for multi-dimensional hyperparameter tuning. We assess the performance of collaborative graphical lasso and the corresponding model selection procedure through simulations, and we apply them to publicly available multi-omics data. This application demonstrated collaborative graphical lasso recovers established biological interactions while suggesting novel, biologically coherent connections. Availability and implementation: We implemented collaborative graphical lasso as an R package, available on CRAN as coglasso. The results of the manuscript can be reproduced running the code available at https://github.com/DrQuestion/coglasso_reproducible_code

Largest connected component in duplication-divergence growing graphs with symmetric coupled divergence

2026-01-26T15:17:56Z

The largest connected component in duplication-divergence growing graphs with symmetric coupled divergence is studied. Finite-size scaling reveals a phase transition occurring at a divergence rate $δ_c$. The $δ_c$ found stands near the locus of zero in Euler characteristic for finite-size graphs, known to be indicative of the largest connected component transition. The role of non-interacting vertices in shaping this transition with their presence ($d=0$) and absence ($d=1$) in duplication is also discussed, suggesting a particular transformation of the time variable considered, which yields a singularity locus in the natural logarithm of the absolute value of Euler characteristic in finite-size graphs near to that obtained with $d=1$ but from the model with $d=0$. The findings may suggest implications for bond percolation in these growing graph models.

Crossing the Functional Desert: Cascade-Driven Assembly and Feasibility Transitions in Early Life

2026-01-24T19:50:52Z

The origin of life poses a problem of combinatorial feasibility: How can temporally supported functional organization arise in exponentially branching assembly spaces when unguided exploration behaves as a memoryless random walk? We show that nonlinear threshold-cascade dynamics in connected interaction networks provide a minimal, substrate-agnostic mechanism that can soften this obstruction. Below a critical connectivity threshold, cascades die out locally and structured input-output response mappings remain sparse and transient-a "functional desert" in which accumulation is dynamically unsupported. Near the critical percolation threshold, system-spanning cascades emerge, enabling discriminative functional responses. We illustrate this transition using a minimal toy model and generalize the argument to arbitrary networked systems. Also near criticality, cascades introduce finite-timescale structural and functional coherence, directional bias, and weak dynamical path-dependence into otherwise memoryless exploration, allowing biased accumulation. This connectivity-driven transition-functional percolation-requires only generic ingredients: interacting units, nonlinear thresholds, influence transmission, and non-zero coherence times. The mechanism does not explain specific biochemical pathways, but it identifies a necessary dynamical regime in which structured functional organization can emerge and be temporarily supported, providing a physical foundation for how combinatorial feasibility barriers can be crossed through network dynamics alone.

A mathematical framework to study organising principles in graphical representations of biochemical processes

2026-01-23T18:38:08Z

The complexity of molecular and cellular processes forces experimental studies to focus on subsystems. To study the functioning of biological systems across levels of structural and functional organisation, we require tools to compose and organise networks with different levels of detail and abstraction. Systems Biology Graphical Notation (SBGN) is a standardised notational system that visualises biochemical processes as networks. Despite their widespread adoption, SBGN languages remain purely visual and lack an underlying mathematical framework, limiting their compositional analysis, abstraction, and integration with formal modelling approaches. SBGN comprises three complementary visual languages-Process Description (SBGN-PD), Activity Flow (SBGN-AF), and Entity Relationship (SBGN-ER)-each operating at a different level of abstraction. In this manuscript, we introduce a category-theoretic formalism for SBGN-PD, a visual language to describe biochemical processes as biochemical reaction networks. Using the theory of structured cospans, we construct a symmetric monoidal double category whose horizontal 1-morphisms correspond to SBGN-PD diagrams. We also analyse how a designated subnetwork influences the surrounding network and how external entities, in turn, affect the internal reactions of the subnetwork. Our work addresses a key gap between biological visualisation and mathematical structure. It provides precise organising principles for SBGN-PD, including compositionality, enabling the construction of large biochemical reaction networks from smaller ones, and zooming out, allowing the abstraction of detailed biochemical mechanisms while preserving their functional interfaces. Throughout the paper, the proposed framework is illustrated using standard SBGN-PD examples, demonstrating its applicability to large-scale biochemical reaction networks.

Optimizing information transmission in optogenetic Wnt signaling

2026-01-23T18:34:58Z

Populations of cells regulate gene expression in response to external signals, but their ability to make reliable collective decisions is limited by both intrinsic noise in molecular signaling and variability between individual cells. In this work, we use optogenetic control of the canonical Wnt pathway as an example to study how reliably information about an external signal is transmitted to a population of cells, and determine an optimal encoding strategy to maximize information transmission from Wnt signals to gene expression. We find that it is possible to reach an information capacity beyond 1 bit only through an appropriate, discrete encoding of signals: using either no Wnt, a short Wnt pulse, or a sustained Wnt signal. By averaging over an increasing number of outputs, we systematically vary the effective noise in the pathway. As the effective noise decreases, the optimal encoding comprises more discrete input signals. These signals do not need to be fine-tuned to achieve near-optimal information transmission. The optimal code transitions into a continuous code in the small-noise limit, which can be shown to be consistent with the Jeffreys prior. We visualize the performance of different signal encodings using decoding maps. Our results suggest optogenetic Wnt signaling allows for regulatory control beyond a simple binary switch, and provides a framework to apply ideas from information processing to single-cell in vitro experiments.

Latent Causal Diffusions for Single-Cell Perturbation Modeling

2026-01-20T16:15:38Z

Perturbation screens hold the potential to systematically map regulatory processes at single-cell resolution, yet modeling and predicting transcriptome-wide responses to perturbations remains a major computational challenge. Existing methods often underperform simple baselines, fail to disentangle measurement noise from biological signal, and provide limited insight into the causal structure governing cellular responses. Here, we present the latent causal diffusion (LCD), a generative model that frames single-cell gene expression as a stationary diffusion process observed under measurement noise. LCD outperforms established approaches in predicting the distributional shifts of unseen perturbation combinations in single-cell RNA-sequencing screens while simultaneously learning a mechanistic dynamical system of gene regulation. To interpret these learned dynamics, we develop an approach we call causal linearization via perturbation responses (CLIPR), which yields an approximation of the direct causal effects between all genes modeled by the diffusion. CLIPR provably identifies causal effects under a linear drift assumption and recovers causal structure in both simulated systems and a genome-wide perturbation screen, where it clusters genes into coherent functional modules and resolves causal relationships that standard differential expression analysis cannot. The LCD-CLIPR framework bridges generative modeling with causal inference to predict unseen perturbation effects and map the underlying regulatory mechanisms of the transcriptome.

A generalized work theorem for stopped stochastic chemical reaction networks

2026-01-19T09:09:52Z

We establish a generalized work theorem for stochastic chemical reaction networks (CRNs). By using a compensated Poisson jump process, we identify a martingale structure in a generalized entropy defined relative to an auxiliary backward process and extend nonequilibrium work relations to processes stopped at bounded arbitrary times. Our results apply to discrete, mesoscopic chemical reaction networks and remain valid for singular initial conditions and state-dependent termination events. We show how martingale properties emerge directly from the structure of reaction propensities without assuming detailed balance. Stochastic simulations of a simple chemical kinetic proofreading network are used to explore the dependence of the exponentiated entropy production on initial conditions and model parameters, validating our new work theorem relationships. Our results provide new quantitative tools for analyzing biological circuits ranging from metabolic to gene regulation pathways.

Information Transmission and Processing in G-Protein-Coupled-Receptor Complexes

2026-01-18T14:48:39Z

G-protein-coupled receptors (GPCRs) are central to cellular information processing, yet the physical principles governing their switching behavior remain incompletely understood. We present a first principles theoretical framework, grounded in nonequilibrium thermodynamics, to describe GPCR switching as observed in light-controlled impedance assays. The model identifies two fundamental control parameters: (1) ATP/GTP-driven chemical flux through the receptor complex, and (2) the free-energy difference between phosphorylated and dephosphorylated switch states. Together, these parameters defin the switch configuration. The model predicts that GPCRs can occupy one of three quasi-stable configurations, each corresponding to a local maximum in information transmission. Active states support chemical flux and exist in an on or off switch configuration, whereas inactive states lack flux, introducing a distinction absent in conventional phosphorylation models. The model takes two ligand-derived inputs: fixed structural features and inducible conformations (e.g. cis or trans). It shows that phosphatase activity, modeled as an energy barrier, chiefly governs on/off occupancy, whereas the kinase sustains flux without directly determining the switch configuration. Comparison with experimental data confirms the predicted existence of multiple quasi-stable states modulated by ligand conformation. Importantly, this framework generalizes beyond GPCRs to encompass a wider class of biological switching systems driven by nonequilibrium chemical flux.

Fluctuation Theorems from a Continuous-Time Markov Model of Information-Thermodynamic Capacity in Biochemical Signal Cascades

2026-01-17T07:31:39Z

Biochemical signaling cascades transmit intracellular information while dissipating energy under nonequilibrium conditions. We model a cascade as a code string and apply information-entropy ideas to quantify an optimal transmission rate. A time-normalized entropy functional is maximized to define a capacity-like quantity governed by a conserved multiplier. To place the theory on a rigorous stochastic-thermodynamic footing, we formulate stepwise signaling as a continuous-time Markov jump process with forward and reverse competing rates. The embedded jump chain yields well-defined transition probabilities that justify time-scale-based expressions. Under local detailed balance, the log ratio of forward and reverse rates can be interpreted as entropy production per event, enabling a trajectory-level derivation of detailed and integral fluctuation theorems. We further connect the information-theoretic capacity to the mean dissipation rate and outline finite-time fluctuation structure via the scaled cumulant generating function (SCGF) and Gallavotti--Cohen symmetry, including a worked example using MAPK/ERK timescales.

The Protective Effects of the Ethyl Acetate Part of Er Miao San on Adjuvant Arthritis Rats by Regulating the Function of Bone Marrow-Derived Dendritic Cells

2026-01-16T09:35:35Z

Aims. /e aim of this study was to evaluate the protective effects of Er Miao San (EMS) and the regulative function of bone marrow-derived dendritic cells (BMDCs) on adjuvant arthritis (AA) in rats. Methods. /e ethyl acetate part of EMS (3 g/kg, 1.5 g/kg, and 0.75 g/kg) was orally administered from day 15 after immunization to day 29. /e polyarthritis index and paw swelling were measured, the ankle joint pathological changes were observed using hematoxylin-eosin (HE) staining, and the spleen and thymus index were determined. Moreover, T and B cell proliferation were determined using the CCK-8 assay. /e expression of BMDC surface costimulatory molecules and inflammatory factors were determined using flow cytometry and ELISA kits, respectively. Results. Compared with the AA model rats, the ethyl acetate fraction of EMS obviously reduced paw swelling (from 1.0 to 0.7) and the polyarthritis index (from 12 to 9) (P < 0.01) and improved the severity of histopathology (P < 0.01). /e treatment using ethyl acetate fraction of EMS significantly reduced the spleen and thymus index (P < 0.01) and inhibited T and B cell proliferation (P < 0.01). Moreover, EMS significantly modulated the expression of surface costimulatory molecules in BMDCs, including CD40, CD80, CD86, and major histocompatibility complex class II (MHC-II) (P < 0.01). /e results also showed that the ethyl acetate part of EMS significant inhibited the levels of proinflammatory cytokines interleukin- (IL-) 23 tumor necrosis factor- (TNF-) α and inflammatory factor prostaglandin (PG) E2 in the supernatant of BMDCs. However, the level of antiinflammatory cytokine IL-10 was significantly increased (P < 0.01). Conclusion. /ese results suggest that the ethyl acetate part of EMS has better protective effects on AA rats, probably by regulating the function of BMDCs and modulating the balance of cytokines.

Markovian Promoter Models: A Mechanistic Alternative to Hill Functions in Gene Regulatory Networks

2026-01-14T22:54:59Z

Gene regulatory networks are typically modeled using ordinary differential equations (ODEs) with phenomenological Hill functions to represent transcriptional regulation. While computationally efficient, Hill functions lack mechanistic grounding and cannot capture stochastic promoter dynamics. We present a hybrid Markovian-ODE framework that explicitly models discrete promoter states while maintaining computational tractability. Uniquely, we parameterize this model using fractional dwell times derived from ChEC-seq data, enabling the inference of in vivo kinetic rates from steady-state chromatin profiling. Our approach tracks individual transcription factor binding events as a continuous-time Markov chain, linked to deterministic molecular dynamics. We validate this framework on seven gene regulatory systems spanning basic to advanced complexity: the GAL system, repressilator, Goodwin oscillator, toggle switch, incoherent feed-forward loop, p53-Mdm2 oscillator, and NF-$κ$B pathway. Comparison with stochastic simulation algorithm (SSA) ground truth demonstrates that Markovian promoter models achieve similar accuracy to full stochastic simulations while being 10-100$\times$ faster. Our framework provides a mechanistic foundation for gene regulation modeling and enables investigation of promoter-level stochasticity in complex regulatory networks.