https://arxiv.org/api/F3f4kjCDriA3SezIM96lySf16vI2026-06-13T23:11:41Z159115015http://arxiv.org/abs/2510.09804v1Rapid Development of Omics Data Analysis Applications through Vibe Coding2025-10-10T19:06:27ZBuilding custom data analysis platforms traditionally requires extensive software engineering expertise, limiting accessibility for many researchers. Here, I demonstrate that modern large language models (LLMs) and autonomous coding agents can dramatically lower this barrier through a process called 'vibe coding', an iterative, conversational style of software creation where users describe goals in natural language and AI agents generate, test, and refine executable code in real-time. As a proof of concept, I used Vibe coding to create a fully functional proteomics data analysis website capable of performing standard tasks, including data normalization, differential expression testing, and volcano plot visualization. The entire application, including user interface, backend logic, and data upload pipeline, was developed in less than ten minutes using only four natural-language prompts, without any manual coding, at a cost of under $2. Previous works in this area typically require tens of thousands of dollars in research effort from highly trained programmers. I detail the step-by-step generation process and evaluate the resulting code's functionality. This demonstration highlights how vibe coding enables domain experts to rapidly prototype sophisticated analytical tools, transforming the pace and accessibility of computational biology software development.2025-10-10T19:06:27ZJesse G. Meyerhttp://arxiv.org/abs/2510.09757v1A path towards AI-scale, interoperable biological data2025-10-10T18:04:19ZBiology is at the precipice of a new era where AI accelerates and amplifies the ability to study how cells operate, organize, and work as systems, revealing why disease happens and how to correct it. Organizations globally are prioritizing AI to accelerate basic research, drug discovery, personalized medicine, and synthetic biology. However, despite these opportunities, scientific data have proven a bottleneck, and progress has been slow and fragmented. Unless the scientific community takes a technology-led, community-focused approach to scaling and harnessing data, we will fail to capture this opportunity to drive new insights and biological discovery. The data bottleneck presents a unique paradox. It is increasingly simple to generate huge data volumes, thanks to expanding imaging datasets and plummeting sequencing costs, but scientists lack standards and tooling for large biological datasets, preventing integration into a multimodal foundational dataset that unlocks generalizable models of cellular and tissue function. This contradiction highlights two interrelated problems: abundant data that's difficult to manage, and a lack of data resources with necessary quality and utility to realize AI's potential in biology. Science must forge a collective approach enabling distributed contributions to combine into cohesive, powerful datasets transcending individual purposes. Here, we present a technological and data generation roadmap for scaling scientific impact. We outline AI's opportunity, mechanisms to scale data generation, the need for multi-modal measurements, and means to pool resources, standardize approaches, and collectively build the foundation enabling AI's full potential in biological discovery.2025-10-10T18:04:19Z8 pages, 2 imagesBrian AevermannAndrea CalifanoChi-Li ChiuNathan ClackWilliam M. ClemonsJonah Cool Florence D. D'OraziJoseph L. DeRisiJoshua E. EliasElizabeth FahsbenderScott E. FraserCarlos G. GonzalezMatthias HauryTheofanis KaraletsosShana O. KelleyAly A. KhanAlan R. LoweEmma LundbergRyan A. McClureStephani OtteEvan O. PaullLoïc A. RoyerDana SadgatSandra L. SchmidSamantha ScovannerCathy StolitzkaJason R. SwedlowJoan WongGarabet YeretssianPatricia BrennanAmbrose J. Carrhttp://arxiv.org/abs/2510.07357v1Allelopathic effects of Rumex azoricus on lettuce: impacts on seed germination and early growth2025-10-08T15:45:52ZMembers of the Rumex genus possess allelochemical compounds that vary depending on the plant part and extract concentrations. Therefore, this study aimed to investigate the allelopathic effects of extracts from the roots, stems, and leaves of Rumex azoricus at concentrations of 0%, 25%, 50%, and 100% on the seed germination of a lettuce plant in a laboratory setting. The results indicated that stem extract was most effective for enhancing germination percentage (68.67%), germination speed (5.12 seeds/time interval), and subsequent traits related to germination percentage (50%), germination speed (3.6 seeds/time interval), as well as subsequent traits in control seeds. The 25% extract concentration improved germination percentage (68%) and germination speed (5.05 seeds/time interval), along with subsequent traits compared to control (0%), which exhibited the lowest germination percentage (50%), germination speed (3.6 seeds/time interval), and related traits. The combined results also demonstrated that 25% stem extract significantly increased germination percentage (80%), speed (5.85 seeds/time interval), root length (1.2 cm), root fresh weight (0.032 mg), shoot length (2.2 cm), and shoot fresh weight (0.06 mg) in contrast to control seeds, which showed the minimum germination percentage (50%), speed (3.6 seeds/time interval), root length (0.17 cm), root fresh weight (0.006 mg), shoot length (0.95 cm), and shoot fresh weight (0.03 mg). The allelopathic effects of R. azoricus extract varied depending on the plant part and concentration; both stem and leaf extracts at low concentrations were the most effective, whereas root extracts at all concentrations produced results similar to those of control seeds.2025-10-08T15:45:52ZRevista Brasileira de Agropecuária Sustentável (RBAS); v. 15; n. 1; p. 45-54; Agosto; 2025Abdulrahman IbrahimMariana Casari ParreiraAram Akram MohammedHawar Halshoyhttp://arxiv.org/abs/2510.02205v2Charting dissipation across the microbial world2025-10-08T15:29:40ZThe energy dissipated by a living organism is commonly identified with heat generation. However, as cells exchange metabolites with their environment they also dissipate energy in the form of chemical entropy. How dissipation is distributed between exchanges of heat and chemical entropy is largely unexplored. Here, we analyze an extensive experimental database recently created [1] to investigate how microbes partition dissipation between thermal and chemical entropy during growth. We find that aerobic respiration exchanges little chemical entropy and dissipation is primarily due to heat production, as commonly assumed. However, we also find several types of anaerobic metabolism that produce as much chemical entropy as heat. Counterintuitively, instances of anaerobic metabolisms such as acetotrophic methanogenesis and sulfur respiration are endothermic. We conclude that, because of their metabolic versatility, microbes are able to exploit all combinations of heat and chemical entropy exchanges that result in a net production of entropy.2025-10-02T16:52:41ZTommaso CossettoJonathan RodenfelsPablo Sartorihttp://arxiv.org/abs/2510.06781v1The Epigenetic Tapestry: A Review of DNA Methylation and Non-Coding RNA's Interplay with Genetic Threads, Weaving a Network Impacting Gene Expression and Disease Manifestations2025-10-08T09:06:10ZThe emerging field of epigenetics has recently unveiled a dynamic landscape in which gene expression is not determined solely by genetic sequences but also by intricate regulatory mechanisms. This review examines the interactions between these regulatory mechanisms, including DNA methylation and non-coding RNAs (ncRNAs), that orchestrate gene expression fine-tuning for cellular homeostasis and the pathogenesis of a multitude of diseases. We explore long non-coding RNAs (lncRNAs) such as telomeric repeat-containing RNA (TERRA) and Fendrr, highlighting their role in protein regulation to ensure proper gene activation or silencing. Additionally, we explain the therapeutic potential of brain-derived neurotrophic factor (BDNF)-related microRNA 132, which has shown promise in treating chronic illnesses by restoring BDNF levels. Finally, this review covers the role of DNA methyltransferases and ncRNAs in cancer, focusing on how lncRNAs contribute to X chromosome inactivation and interact with chromatin-modifying complexes and DNA methyltransferase inhibitors to reduce cancer cell aggressiveness. By amalgamating the wide array of research in this field, we aim to provide glimpses into the complex entangling of genetics and environment as they control gene expressions.2025-10-08T09:06:10Z31 pages, unaffiliated review articleYu-Li HeYoushin Lohhttp://arxiv.org/abs/2510.05775v1Disentangling peri-urban river hypoxia2025-10-07T10:44:31ZEpisodes of low dissolved oxygen concentration--hypoxia--threaten the functioning of and the services provided by aquatic ecosystems, particularly those of urban rivers. Here, we disentangle oxygen-related processes in the highly modified Elbe River flowing through the major German city of Hamburg, where low oxygen levels are frequently observed. We use a process-based biochemical model that describes particulate and dissolved organic matter, micro-algae, their pathogens, and the key reactions that produce or consume oxygen: photosynthesis, re-aeration, respiration, mineralization, and nitrification. The model analysis reveals pronounced spatial variability in the relative importance of these processes. Photosynthesis and respiration are more prominent upstream of the city, while mineralization, nitrification, and re-aeration prevail downstream. The city, characterized by rapid changes in bathymetry, marks a transitional area: pathogen-related micro-algal lysis may increase organic material, explaining the shift towards heterotrophic processes downstream. As the primary driver of seasonal changes, the model analysis reveals a differential temperature sensitivity of biochemical rates. These results may be extrapolated to other urban rivers, and also provide valuable information for estuarine water quality management.2025-10-07T10:44:31Z23 pages, 12 figures, includes supplementary materialOvidio García-OlivaCarsten LemmenXiangyu LiKai Wirtzhttp://arxiv.org/abs/2510.05705v1The Software Observatory: aggregating and analysing software metadata for trend computation and FAIR assessment2025-10-07T09:15:02ZIn the ever-changing realm of research software development, it is crucial for the scientific community to grasp current trends to identify gaps that can potentially hinder scientific progress. The adherence to the FAIR (Findable, Accessible, Interoperable, Reusable) principles can serve as a proxy to understand those trends and provide a mechanism to propose specific actions.
The Software Observatory at OpenEBench (https://openebench.bsc.es/observatory) is a novel web portal that consolidates software metadata from various sources, offering comprehensive insights into critical research software aspects. Our platform enables users to analyse trends, identify patterns and advancements within the Life Sciences research software ecosystem, and understand its evolution over time. It also evaluates research software according to FAIR principles for research software, providing scores for different indicators.
Users have the ability to visualise this metadata at different levels of granularity, ranging from the entire software landscape to specific communities to individual software entries through the FAIRsoft Evaluator. Indeed, the FAIRsoft Evaluator component streamlines the assessment process, helping developers efficiently evaluate and obtain guidance to improve their software's FAIRness.
The Software Observatory represents a valuable resource for researchers and software developers, as well as stakeholders, promoting better software development practices and adherence to FAIR principles for research software.2025-10-07T09:15:02ZEva Martín del PicoJosep Lluís GelpíSalvador Capella-Gutiérrezhttp://arxiv.org/abs/2511.14771v1Efficient Constraining of Transcoding in DNA-Based Image Storage2025-10-07T08:22:32ZDNA has emerged as a promising alternative for long-term data storage due to its high capacity, durability, and low-energy potential. However, storing data in DNA presents several challenges. First, it requires complex and costly biochemical processes, making efficient compression crucial to reducing DNA synthesis time and cost. Second, these processes are prone to errors that must be avoided and/or corrected. In particular, homopolymers (repetitions of the same nucleotide) are a wellknown source of errors during the sequencing step. Avoiding such repetitions helps mitigate errors but introduces a constraint that may increase the data compression rate. In this paper, we propose two transcoding methods that address these two key challenges: reducing data rate and minimizing errors. The first method strictly enforces the error-minimization constraint by eliminating homopolymers of a certain length, at the cost of an increased data rate. In contrast, the second method accepts a slight increase in homopolymers. However, we show that these increases remain limited (2.14% increase in compression rate for the first method and 0.39% homopolymer rate for the second). These two approaches demonstrate that it is possible to efficiently constrain transcoding while balancing error minimization and compression performance.2025-10-07T08:22:32Z2025 IEEE International Conference on Image Processing (ICIP), Sep 2025, Rennes, FranceSara Al SayyedAline RoumyThomas Maugeyhttp://arxiv.org/abs/2510.01302v1Hybrid Predictive Modeling of Malaria Incidence in the Amhara Region, Ethiopia: Integrating Multi-Output Regression and Time-Series Forecasting2025-10-01T16:16:47ZMalaria remains a major public health concern in Ethiopia, particularly in the Amhara Region, where seasonal and unpredictable transmission patterns make prevention and control challenging. Accurately forecasting malaria outbreaks is essential for effective resource allocation and timely interventions. This study proposes a hybrid predictive modeling framework that combines time-series forecasting, multi-output regression, and conventional regression-based prediction to forecast the incidence of malaria. Environmental variables, past malaria case data, and demographic information from Amhara Region health centers were used to train and validate the models. The multi-output regression approach enables the simultaneous prediction of multiple outcomes, including Plasmodium species-specific cases, temporal trends, and spatial variations, whereas the hybrid framework captures both seasonal patterns and correlations among predictors. The proposed model exhibits higher prediction accuracy than single-method approaches, exposing hidden patterns and providing valuable information to public health authorities. This study provides a valid and repeatable malaria incidence prediction framework that can support evidence-based decision-making, targeted interventions, and resource optimization in endemic areas.2025-10-01T16:16:47ZKassahun AzezewAmsalu TesemaBitew MekuriaAyenew KassieAnimut EmbialeAyodeji Olalekan SalauTsega Asresahttp://arxiv.org/abs/2510.00195v1Identification of post-COVID-19 symptoms using brain structural MRI features: a machine learning approach2025-09-30T19:12:16ZIdentifying long COVID symptoms is a challenging task, primarily due to the reliance on patient reports and the lack of disease specific biomarkers. The objective of this study is to identify individual long COVID symptoms, post COVID 19 conditions (PCC) participants, and participants' sex, and to identify the associated brain regions by developing an explainable machine learning algorithm using brain MRI features. This study implements secondary analysis using an anonymized, publicly accessible dataset that categorizes participants into three groups: the PCC group, the Unimpaired Post COVID 19 group (UPC), and the Healthy Non COVID group (HNC), each with corresponding symptoms, demographics, and brain structural MRI features. The aim is to develop and cross validate a support vector classifier (SVC) algorithm to identify the occurrence of various target labels from the dataset. The SVC classifier identified the occurrence of long-COVID symptoms with various performances for different target labels. The model performance and influential area are identified and discussed in light of previous research. The demonstrated approach offers an alternative modality for determining the occurrence of long COVID symptoms based on neuroimaging biomarkers.2025-09-30T19:12:16ZAbdi Rezahttp://arxiv.org/abs/2509.25591v1Building the EHR Foundation Model via Next Event Prediction2025-09-29T23:27:51ZElectronic Health Records (EHRs) contain rich temporal dynamics that conventional encoding approaches fail to adequately capture. While Large Language Models (LLMs) show promise for EHR modeling, they struggle to reason about sequential clinical events and temporal dependencies. We propose Next Event Prediction (NEP), a framework that enhances LLMs' temporal reasoning through autoregressive fine-tuning on clinical event sequences. By reformulating EHRs as timestamped event chains and predicting future medical events, NEP explicitly models disease progression patterns and causal relationships. Extensive evaluations across oncology survival prediction and clinical diagnosis tasks demonstrate NEP's superiority, outperforming specialized EHR models by 4.6% AUROC and general-purpose LLMs by 7.2% C-index in temporal reasoning tasks. Our analyses reveal dual benefits: state-of-the-art prediction accuracy combined with clinically interpretable attention patterns that align with known disease pathways.2025-09-29T23:27:51ZZekai ChenArda PekisKevin Brownhttp://arxiv.org/abs/2509.22424v1Desiderata for a biomedical knowledge network: opportunities, challenges and future Directions2025-09-26T14:44:45ZKnowledge graphs, collectively as a knowledge network, have become critical tools for knowledge discovery in computable and explainable knowledge systems. Due to the semantic and structural complexities of biomedical data, these knowledge graphs need to enable dynamic reasoning over large evolving graphs and support fit-for-purpose abstraction, while establishing standards, preserving provenance and enforcing policy constraints for actionable discovery. A recent meeting of leading scientists discussed the opportunities, challenges and future directions of a biomedical knowledge network. Here we present six desiderata inspired by the meeting: (1) inference and reasoning in biomedical knowledge graphs need domain-centric approaches; (2) harmonized and accessible standards are required for knowledge graph representation and metadata; (3) robust validation of biomedical knowledge graphs needs multi-layered, context-aware approaches that are both rigorous and scalable; (4) the evolving and synergistic relationship between knowledge graphs and large language models is essential in empowering AI-driven biomedical discovery; (5) integrated development environments, public repositories, and governance frameworks are essential for secure and reproducible knowledge graph sharing; and (6) robust validation, provenance, and ethical governance are critical for trustworthy biomedical knowledge graphs. Addressing these key issues will be essential to realize the promises of a biomedical knowledge network in advancing biomedicine.2025-09-26T14:44:45Z6 pages, 2 figuresChunlei WuHongfang LiuJason FlannickMark A. MusenAndrew I. SuLawrence HunterThomas M. PowersCathy H. Wuhttp://arxiv.org/abs/2503.23494v3Interpretable structural-semantic decoding reveals language-like organisation of regulatory information in DNA2025-09-26T10:17:40ZDecoding how linear DNA encodes regulatory information remains a central challenge. Existing decoding approaches lack interpretability and struggle to reveal the underlying coding principles. Here, we present the interpretability-first, structural artificial intelligence (AI) framework for DNA (ISAF4DNA), which uses state-aware symbolic encoding and couples structural unit discovery with semantic validation to form a closed-loop structural-semantic decoder. When applied to N6-methyladenine (6mA) datasets from 63 species, ISAF4DNA reveals a language-like organization of regulatory information: (i) a conserved motif-derivation pathway AT -> GAT/ATC -> GATC; (ii) two forms of redundant syntax: anchor-type structures with a conserved core and selective flanks, and fuzzy-type clusters composed of distributed units with positional tolerance; and (iii) differential deployment trends between prokaryotes and multicellular eukaryotes. Together, these observations motivate the development of a testable framework, EpigenoLinguistics, that treats motifs as lexical units, redundancy as syntax, and deployment as pragmatics. This framework advances the ``DNA as language'' concept from a metaphor to a falsifiable framework with supporting evidence, thereby bridging biology and computational linguistics. ISAF4DNA advances the application of AI techniques in biology from black-box predictions to mechanism-level signals, augments database annotations, and guides regulatory-element design, with principles extensible to other modifications.2025-03-30T16:03:02ZLi YangDongbo Wanghttp://arxiv.org/abs/2509.19206v1A decentralized future for the open-science databases2025-09-23T16:28:21ZContinuous and reliable access to curated biological data repositories is indispensable for accelerating rigorous scientific inquiry and fostering reproducible research. Centralized repositories, though widely used, are vulnerable to single points of failure arising from cyberattacks, technical faults, natural disasters, or funding and political uncertainties. This can lead to widespread data unavailability, data loss, integrity compromises, and substantial delays in critical research, ultimately impeding scientific progress. Centralizing essential scientific resources in a single geopolitical or institutional hub is inherently dangerous, as any disruption can paralyze diverse ongoing research. The rapid acceleration of data generation, combined with an increasingly volatile global landscape, necessitates a critical re-evaluation of the sustainability of centralized models. Implementing federated and decentralized architectures presents a compelling and future-oriented pathway to substantially strengthen the resilience of scientific data infrastructures, thereby mitigating vulnerabilities and ensuring the long-term integrity of data. Here, we examine the structural limitations of centralized repositories, evaluate federated and decentralized models, and propose a hybrid framework for resilient, FAIR, and sustainable scientific data stewardship. Such an approach offers a significant reduction in exposure to governance instability, infrastructural fragility, and funding volatility, and also fosters fairness and global accessibility. The future of open science depends on integrating these complementary approaches to establish a globally distributed, economically sustainable, and institutionally robust infrastructure that safeguards scientific data as a public good, further ensuring continued accessibility, interoperability, and preservation for generations to come.2025-09-23T16:28:21Z21 Pages, 2 figuresGaurav SharmaViorel MunteanuNika Mansouri GhiasiJineta BanerjeeSusheel VarmaLuca FoschiniKyle EllrottOnur MutluDumitru CiorbăRoel A. OphoffViorel BostanChristopher E MasonJason H. MooreDespoina SousoniArunkumar KrishnanChristopher E. MasonMihai DimianGustavo StolovitzkyFabio G. LiberanteTaras K. OleksykSerghei Mangulhttp://arxiv.org/abs/2509.19393v1MetaQuestion: A web application for expert knowledge elicitation addressing plant health and applied plant ecology2025-09-22T22:39:20Z1. Expert knowledge elicitation provides information to characterize ecological systems and management options. Linking expert knowledge elicitation with a curated question catalog supports a community of practice for ongoing improvement of question quality.
2. The MetaQuestion web app we introduce here draws on the PlantQuest catalog of questions addressing applied plant ecology and management options, making the catalog available in a flexible form for organizers of expert knowledge elicitation. Organizers can select among questions in the catalog, modify them as needed, and generate an instrument customized to their elicitation project. MetaQuestion makes available PlantQuest questions specialized for the study of invasive species such as pathogens and arthropod pests, such as geographic analyses of prevalence and network analysis of the movement of plant materials.
3. Experts answer questions in the customized instrument and their responses are compiled. For settings where internet access may be sporadic, there are options to download the instrument for experts' work and then upload responses later. MetaQuestion provides the resulting dataset in a CSV file for analysis in users' choice of software
4. Development of the PlantQuest catalog and the MetaQuestion app is ongoing, incorporating lessons learned from applications of the app. The MetaQuestion app could also be adapted to address questions from other subject areas.2025-09-22T22:39:20ZRobert FontanChristopher M. PerezAshish AdhikariRomaric A. Mouafo-TchindaAaron I. Plex SuláJacobo RobledoBerea A. EthertonManoj ChoudharyMuhammad Aqeel SarwarZunaira Afzal NaveedKaren A. Garrett