https://arxiv.org/api/F3f4kjCDriA3SezIM96lySf16vI 2026-06-13T23:11:41Z 1591 150 15 http://arxiv.org/abs/2510.09804v1 Rapid Development of Omics Data Analysis Applications through Vibe Coding 2025-10-10T19:06:27Z

Building custom data analysis platforms traditionally requires extensive software engineering expertise, limiting accessibility for many researchers. Here, I demonstrate that modern large language models (LLMs) and autonomous coding agents can dramatically lower this barrier through a process called 'vibe coding', an iterative, conversational style of software creation where users describe goals in natural language and AI agents generate, test, and refine executable code in real-time. As a proof of concept, I used Vibe coding to create a fully functional proteomics data analysis website capable of performing standard tasks, including data normalization, differential expression testing, and volcano plot visualization. The entire application, including user interface, backend logic, and data upload pipeline, was developed in less than ten minutes using only four natural-language prompts, without any manual coding, at a cost of under $2. Previous works in this area typically require tens of thousands of dollars in research effort from highly trained programmers. I detail the step-by-step generation process and evaluate the resulting code's functionality. This demonstration highlights how vibe coding enables domain experts to rapidly prototype sophisticated analytical tools, transforming the pace and accessibility of computational biology software development.

2025-10-10T19:06:27Z Jesse G. Meyer http://arxiv.org/abs/2510.09757v1 A path towards AI-scale, interoperable biological data 2025-10-10T18:04:19Z

Biology is at the precipice of a new era where AI accelerates and amplifies the ability to study how cells operate, organize, and work as systems, revealing why disease happens and how to correct it. Organizations globally are prioritizing AI to accelerate basic research, drug discovery, personalized medicine, and synthetic biology. However, despite these opportunities, scientific data have proven a bottleneck, and progress has been slow and fragmented. Unless the scientific community takes a technology-led, community-focused approach to scaling and harnessing data, we will fail to capture this opportunity to drive new insights and biological discovery. The data bottleneck presents a unique paradox. It is increasingly simple to generate huge data volumes, thanks to expanding imaging datasets and plummeting sequencing costs, but scientists lack standards and tooling for large biological datasets, preventing integration into a multimodal foundational dataset that unlocks generalizable models of cellular and tissue function. This contradiction highlights two interrelated problems: abundant data that's difficult to manage, and a lack of data resources with necessary quality and utility to realize AI's potential in biology. Science must forge a collective approach enabling distributed contributions to combine into cohesive, powerful datasets transcending individual purposes. Here, we present a technological and data generation roadmap for scaling scientific impact. We outline AI's opportunity, mechanisms to scale data generation, the need for multi-modal measurements, and means to pool resources, standardize approaches, and collectively build the foundation enabling AI's full potential in biological discovery.

2025-10-10T18:04:19Z 8 pages, 2 images Brian Aevermann Andrea Califano Chi-Li Chiu Nathan Clack William M. Clemons Jonah Cool Florence D. D'Orazi Joseph L. DeRisi Joshua E. Elias Elizabeth Fahsbender Scott E. Fraser Carlos G. Gonzalez Matthias Haury Theofanis Karaletsos Shana O. Kelley Aly A. Khan Alan R. Lowe Emma Lundberg Ryan A. McClure Stephani Otte Evan O. Paull Loïc A. Royer Dana Sadgat Sandra L. Schmid Samantha Scovanner Cathy Stolitzka Jason R. Swedlow Joan Wong Garabet Yeretssian Patricia Brennan Ambrose J. Carr http://arxiv.org/abs/2510.07357v1 Allelopathic effects of Rumex azoricus on lettuce: impacts on seed germination and early growth 2025-10-08T15:45:52Z

Members of the Rumex genus possess allelochemical compounds that vary depending on the plant part and extract concentrations. Therefore, this study aimed to investigate the allelopathic effects of extracts from the roots, stems, and leaves of Rumex azoricus at concentrations of 0%, 25%, 50%, and 100% on the seed germination of a lettuce plant in a laboratory setting. The results indicated that stem extract was most effective for enhancing germination percentage (68.67%), germination speed (5.12 seeds/time interval), and subsequent traits related to germination percentage (50%), germination speed (3.6 seeds/time interval), as well as subsequent traits in control seeds. The 25% extract concentration improved germination percentage (68%) and germination speed (5.05 seeds/time interval), along with subsequent traits compared to control (0%), which exhibited the lowest germination percentage (50%), germination speed (3.6 seeds/time interval), and related traits. The combined results also demonstrated that 25% stem extract significantly increased germination percentage (80%), speed (5.85 seeds/time interval), root length (1.2 cm), root fresh weight (0.032 mg), shoot length (2.2 cm), and shoot fresh weight (0.06 mg) in contrast to control seeds, which showed the minimum germination percentage (50%), speed (3.6 seeds/time interval), root length (0.17 cm), root fresh weight (0.006 mg), shoot length (0.95 cm), and shoot fresh weight (0.03 mg). The allelopathic effects of R. azoricus extract varied depending on the plant part and concentration; both stem and leaf extracts at low concentrations were the most effective, whereas root extracts at all concentrations produced results similar to those of control seeds.

2025-10-08T15:45:52Z Revista Brasileira de Agropecuária Sustentável (RBAS); v. 15; n. 1; p. 45-54; Agosto; 2025 Abdulrahman Ibrahim Mariana Casari Parreira Aram Akram Mohammed Hawar Halshoy http://arxiv.org/abs/2510.02205v2 Charting dissipation across the microbial world 2025-10-08T15:29:40Z

The energy dissipated by a living organism is commonly identified with heat generation. However, as cells exchange metabolites with their environment they also dissipate energy in the form of chemical entropy. How dissipation is distributed between exchanges of heat and chemical entropy is largely unexplored. Here, we analyze an extensive experimental database recently created [1] to investigate how microbes partition dissipation between thermal and chemical entropy during growth. We find that aerobic respiration exchanges little chemical entropy and dissipation is primarily due to heat production, as commonly assumed. However, we also find several types of anaerobic metabolism that produce as much chemical entropy as heat. Counterintuitively, instances of anaerobic metabolisms such as acetotrophic methanogenesis and sulfur respiration are endothermic. We conclude that, because of their metabolic versatility, microbes are able to exploit all combinations of heat and chemical entropy exchanges that result in a net production of entropy.

2025-10-02T16:52:41Z Tommaso Cossetto Jonathan Rodenfels Pablo Sartori http://arxiv.org/abs/2510.06781v1 The Epigenetic Tapestry: A Review of DNA Methylation and Non-Coding RNA's Interplay with Genetic Threads, Weaving a Network Impacting Gene Expression and Disease Manifestations 2025-10-08T09:06:10Z

The emerging field of epigenetics has recently unveiled a dynamic landscape in which gene expression is not determined solely by genetic sequences but also by intricate regulatory mechanisms. This review examines the interactions between these regulatory mechanisms, including DNA methylation and non-coding RNAs (ncRNAs), that orchestrate gene expression fine-tuning for cellular homeostasis and the pathogenesis of a multitude of diseases. We explore long non-coding RNAs (lncRNAs) such as telomeric repeat-containing RNA (TERRA) and Fendrr, highlighting their role in protein regulation to ensure proper gene activation or silencing. Additionally, we explain the therapeutic potential of brain-derived neurotrophic factor (BDNF)-related microRNA 132, which has shown promise in treating chronic illnesses by restoring BDNF levels. Finally, this review covers the role of DNA methyltransferases and ncRNAs in cancer, focusing on how lncRNAs contribute to X chromosome inactivation and interact with chromatin-modifying complexes and DNA methyltransferase inhibitors to reduce cancer cell aggressiveness. By amalgamating the wide array of research in this field, we aim to provide glimpses into the complex entangling of genetics and environment as they control gene expressions.

2025-10-08T09:06:10Z 31 pages, unaffiliated review article Yu-Li He Youshin Loh http://arxiv.org/abs/2510.05775v1 Disentangling peri-urban river hypoxia 2025-10-07T10:44:31Z

Episodes of low dissolved oxygen concentration--hypoxia--threaten the functioning of and the services provided by aquatic ecosystems, particularly those of urban rivers. Here, we disentangle oxygen-related processes in the highly modified Elbe River flowing through the major German city of Hamburg, where low oxygen levels are frequently observed. We use a process-based biochemical model that describes particulate and dissolved organic matter, micro-algae, their pathogens, and the key reactions that produce or consume oxygen: photosynthesis, re-aeration, respiration, mineralization, and nitrification. The model analysis reveals pronounced spatial variability in the relative importance of these processes. Photosynthesis and respiration are more prominent upstream of the city, while mineralization, nitrification, and re-aeration prevail downstream. The city, characterized by rapid changes in bathymetry, marks a transitional area: pathogen-related micro-algal lysis may increase organic material, explaining the shift towards heterotrophic processes downstream. As the primary driver of seasonal changes, the model analysis reveals a differential temperature sensitivity of biochemical rates. These results may be extrapolated to other urban rivers, and also provide valuable information for estuarine water quality management.

2025-10-07T10:44:31Z 23 pages, 12 figures, includes supplementary material Ovidio García-Oliva Carsten Lemmen Xiangyu Li Kai Wirtz http://arxiv.org/abs/2510.05705v1 The Software Observatory: aggregating and analysing software metadata for trend computation and FAIR assessment 2025-10-07T09:15:02Z

In the ever-changing realm of research software development, it is crucial for the scientific community to grasp current trends to identify gaps that can potentially hinder scientific progress. The adherence to the FAIR (Findable, Accessible, Interoperable, Reusable) principles can serve as a proxy to understand those trends and provide a mechanism to propose specific actions. The Software Observatory at OpenEBench (https://openebench.bsc.es/observatory) is a novel web portal that consolidates software metadata from various sources, offering comprehensive insights into critical research software aspects. Our platform enables users to analyse trends, identify patterns and advancements within the Life Sciences research software ecosystem, and understand its evolution over time. It also evaluates research software according to FAIR principles for research software, providing scores for different indicators. Users have the ability to visualise this metadata at different levels of granularity, ranging from the entire software landscape to specific communities to individual software entries through the FAIRsoft Evaluator. Indeed, the FAIRsoft Evaluator component streamlines the assessment process, helping developers efficiently evaluate and obtain guidance to improve their software's FAIRness. The Software Observatory represents a valuable resource for researchers and software developers, as well as stakeholders, promoting better software development practices and adherence to FAIR principles for research software.

2025-10-07T09:15:02Z Eva Martín del Pico Josep Lluís Gelpí Salvador Capella-Gutiérrez http://arxiv.org/abs/2511.14771v1 Efficient Constraining of Transcoding in DNA-Based Image Storage 2025-10-07T08:22:32Z

DNA has emerged as a promising alternative for long-term data storage due to its high capacity, durability, and low-energy potential. However, storing data in DNA presents several challenges. First, it requires complex and costly biochemical processes, making efficient compression crucial to reducing DNA synthesis time and cost. Second, these processes are prone to errors that must be avoided and/or corrected. In particular, homopolymers (repetitions of the same nucleotide) are a wellknown source of errors during the sequencing step. Avoiding such repetitions helps mitigate errors but introduces a constraint that may increase the data compression rate. In this paper, we propose two transcoding methods that address these two key challenges: reducing data rate and minimizing errors. The first method strictly enforces the error-minimization constraint by eliminating homopolymers of a certain length, at the cost of an increased data rate. In contrast, the second method accepts a slight increase in homopolymers. However, we show that these increases remain limited (2.14% increase in compression rate for the first method and 0.39% homopolymer rate for the second). These two approaches demonstrate that it is possible to efficiently constrain transcoding while balancing error minimization and compression performance.

2025-10-07T08:22:32Z 2025 IEEE International Conference on Image Processing (ICIP), Sep 2025, Rennes, France Sara Al Sayyed Aline Roumy Thomas Maugey http://arxiv.org/abs/2510.01302v1 Hybrid Predictive Modeling of Malaria Incidence in the Amhara Region, Ethiopia: Integrating Multi-Output Regression and Time-Series Forecasting 2025-10-01T16:16:47Z

Malaria remains a major public health concern in Ethiopia, particularly in the Amhara Region, where seasonal and unpredictable transmission patterns make prevention and control challenging. Accurately forecasting malaria outbreaks is essential for effective resource allocation and timely interventions. This study proposes a hybrid predictive modeling framework that combines time-series forecasting, multi-output regression, and conventional regression-based prediction to forecast the incidence of malaria. Environmental variables, past malaria case data, and demographic information from Amhara Region health centers were used to train and validate the models. The multi-output regression approach enables the simultaneous prediction of multiple outcomes, including Plasmodium species-specific cases, temporal trends, and spatial variations, whereas the hybrid framework captures both seasonal patterns and correlations among predictors. The proposed model exhibits higher prediction accuracy than single-method approaches, exposing hidden patterns and providing valuable information to public health authorities. This study provides a valid and repeatable malaria incidence prediction framework that can support evidence-based decision-making, targeted interventions, and resource optimization in endemic areas.

2025-10-01T16:16:47Z Kassahun Azezew Amsalu Tesema Bitew Mekuria Ayenew Kassie Animut Embiale Ayodeji Olalekan Salau Tsega Asresa http://arxiv.org/abs/2510.00195v1 Identification of post-COVID-19 symptoms using brain structural MRI features: a machine learning approach 2025-09-30T19:12:16Z

Identifying long COVID symptoms is a challenging task, primarily due to the reliance on patient reports and the lack of disease specific biomarkers. The objective of this study is to identify individual long COVID symptoms, post COVID 19 conditions (PCC) participants, and participants' sex, and to identify the associated brain regions by developing an explainable machine learning algorithm using brain MRI features. This study implements secondary analysis using an anonymized, publicly accessible dataset that categorizes participants into three groups: the PCC group, the Unimpaired Post COVID 19 group (UPC), and the Healthy Non COVID group (HNC), each with corresponding symptoms, demographics, and brain structural MRI features. The aim is to develop and cross validate a support vector classifier (SVC) algorithm to identify the occurrence of various target labels from the dataset. The SVC classifier identified the occurrence of long-COVID symptoms with various performances for different target labels. The model performance and influential area are identified and discussed in light of previous research. The demonstrated approach offers an alternative modality for determining the occurrence of long COVID symptoms based on neuroimaging biomarkers.

2025-09-30T19:12:16Z Abdi Reza http://arxiv.org/abs/2509.25591v1 Building the EHR Foundation Model via Next Event Prediction 2025-09-29T23:27:51Z

Electronic Health Records (EHRs) contain rich temporal dynamics that conventional encoding approaches fail to adequately capture. While Large Language Models (LLMs) show promise for EHR modeling, they struggle to reason about sequential clinical events and temporal dependencies. We propose Next Event Prediction (NEP), a framework that enhances LLMs' temporal reasoning through autoregressive fine-tuning on clinical event sequences. By reformulating EHRs as timestamped event chains and predicting future medical events, NEP explicitly models disease progression patterns and causal relationships. Extensive evaluations across oncology survival prediction and clinical diagnosis tasks demonstrate NEP's superiority, outperforming specialized EHR models by 4.6% AUROC and general-purpose LLMs by 7.2% C-index in temporal reasoning tasks. Our analyses reveal dual benefits: state-of-the-art prediction accuracy combined with clinically interpretable attention patterns that align with known disease pathways.

2025-09-29T23:27:51Z Zekai Chen Arda Pekis Kevin Brown http://arxiv.org/abs/2509.22424v1 Desiderata for a biomedical knowledge network: opportunities, challenges and future Directions 2025-09-26T14:44:45Z

Knowledge graphs, collectively as a knowledge network, have become critical tools for knowledge discovery in computable and explainable knowledge systems. Due to the semantic and structural complexities of biomedical data, these knowledge graphs need to enable dynamic reasoning over large evolving graphs and support fit-for-purpose abstraction, while establishing standards, preserving provenance and enforcing policy constraints for actionable discovery. A recent meeting of leading scientists discussed the opportunities, challenges and future directions of a biomedical knowledge network. Here we present six desiderata inspired by the meeting: (1) inference and reasoning in biomedical knowledge graphs need domain-centric approaches; (2) harmonized and accessible standards are required for knowledge graph representation and metadata; (3) robust validation of biomedical knowledge graphs needs multi-layered, context-aware approaches that are both rigorous and scalable; (4) the evolving and synergistic relationship between knowledge graphs and large language models is essential in empowering AI-driven biomedical discovery; (5) integrated development environments, public repositories, and governance frameworks are essential for secure and reproducible knowledge graph sharing; and (6) robust validation, provenance, and ethical governance are critical for trustworthy biomedical knowledge graphs. Addressing these key issues will be essential to realize the promises of a biomedical knowledge network in advancing biomedicine.

2025-09-26T14:44:45Z 6 pages, 2 figures Chunlei Wu Hongfang Liu Jason Flannick Mark A. Musen Andrew I. Su Lawrence Hunter Thomas M. Powers Cathy H. Wu http://arxiv.org/abs/2503.23494v3 Interpretable structural-semantic decoding reveals language-like organisation of regulatory information in DNA 2025-09-26T10:17:40Z

Decoding how linear DNA encodes regulatory information remains a central challenge. Existing decoding approaches lack interpretability and struggle to reveal the underlying coding principles. Here, we present the interpretability-first, structural artificial intelligence (AI) framework for DNA (ISAF4DNA), which uses state-aware symbolic encoding and couples structural unit discovery with semantic validation to form a closed-loop structural-semantic decoder. When applied to N6-methyladenine (6mA) datasets from 63 species, ISAF4DNA reveals a language-like organization of regulatory information: (i) a conserved motif-derivation pathway AT -> GAT/ATC -> GATC; (ii) two forms of redundant syntax: anchor-type structures with a conserved core and selective flanks, and fuzzy-type clusters composed of distributed units with positional tolerance; and (iii) differential deployment trends between prokaryotes and multicellular eukaryotes. Together, these observations motivate the development of a testable framework, EpigenoLinguistics, that treats motifs as lexical units, redundancy as syntax, and deployment as pragmatics. This framework advances the ``DNA as language'' concept from a metaphor to a falsifiable framework with supporting evidence, thereby bridging biology and computational linguistics. ISAF4DNA advances the application of AI techniques in biology from black-box predictions to mechanism-level signals, augments database annotations, and guides regulatory-element design, with principles extensible to other modifications.

2025-03-30T16:03:02Z Li Yang Dongbo Wang http://arxiv.org/abs/2509.19206v1 A decentralized future for the open-science databases 2025-09-23T16:28:21Z

Continuous and reliable access to curated biological data repositories is indispensable for accelerating rigorous scientific inquiry and fostering reproducible research. Centralized repositories, though widely used, are vulnerable to single points of failure arising from cyberattacks, technical faults, natural disasters, or funding and political uncertainties. This can lead to widespread data unavailability, data loss, integrity compromises, and substantial delays in critical research, ultimately impeding scientific progress. Centralizing essential scientific resources in a single geopolitical or institutional hub is inherently dangerous, as any disruption can paralyze diverse ongoing research. The rapid acceleration of data generation, combined with an increasingly volatile global landscape, necessitates a critical re-evaluation of the sustainability of centralized models. Implementing federated and decentralized architectures presents a compelling and future-oriented pathway to substantially strengthen the resilience of scientific data infrastructures, thereby mitigating vulnerabilities and ensuring the long-term integrity of data. Here, we examine the structural limitations of centralized repositories, evaluate federated and decentralized models, and propose a hybrid framework for resilient, FAIR, and sustainable scientific data stewardship. Such an approach offers a significant reduction in exposure to governance instability, infrastructural fragility, and funding volatility, and also fosters fairness and global accessibility. The future of open science depends on integrating these complementary approaches to establish a globally distributed, economically sustainable, and institutionally robust infrastructure that safeguards scientific data as a public good, further ensuring continued accessibility, interoperability, and preservation for generations to come.

2025-09-23T16:28:21Z 21 Pages, 2 figures Gaurav Sharma Viorel Munteanu Nika Mansouri Ghiasi Jineta Banerjee Susheel Varma Luca Foschini Kyle Ellrott Onur Mutlu Dumitru Ciorbă Roel A. Ophoff Viorel Bostan Christopher E Mason Jason H. Moore Despoina Sousoni Arunkumar Krishnan Christopher E. Mason Mihai Dimian Gustavo Stolovitzky Fabio G. Liberante Taras K. Oleksyk Serghei Mangul http://arxiv.org/abs/2509.19393v1 MetaQuestion: A web application for expert knowledge elicitation addressing plant health and applied plant ecology 2025-09-22T22:39:20Z

1. Expert knowledge elicitation provides information to characterize ecological systems and management options. Linking expert knowledge elicitation with a curated question catalog supports a community of practice for ongoing improvement of question quality. 2. The MetaQuestion web app we introduce here draws on the PlantQuest catalog of questions addressing applied plant ecology and management options, making the catalog available in a flexible form for organizers of expert knowledge elicitation. Organizers can select among questions in the catalog, modify them as needed, and generate an instrument customized to their elicitation project. MetaQuestion makes available PlantQuest questions specialized for the study of invasive species such as pathogens and arthropod pests, such as geographic analyses of prevalence and network analysis of the movement of plant materials. 3. Experts answer questions in the customized instrument and their responses are compiled. For settings where internet access may be sporadic, there are options to download the instrument for experts' work and then upload responses later. MetaQuestion provides the resulting dataset in a CSV file for analysis in users' choice of software 4. Development of the PlantQuest catalog and the MetaQuestion app is ongoing, incorporating lessons learned from applications of the app. The MetaQuestion app could also be adapted to address questions from other subject areas.

2025-09-22T22:39:20Z Robert Fontan Christopher M. Perez Ashish Adhikari Romaric A. Mouafo-Tchinda Aaron I. Plex Sulá Jacobo Robledo Berea A. Etherton Manoj Choudhary Muhammad Aqeel Sarwar Zunaira Afzal Naveed Karen A. Garrett