https://arxiv.org/api/97vXjPfJSN2w/kdxlcLnl6vuMWU 2026-06-10T23:08:48Z 6061 360 15 http://arxiv.org/abs/2601.09456v1 Towards a Metadata Schema for Energy Research Software 2026-01-14T13:03:18Z

Domain-specific metadata schemas are essential to improve the findability and reusability of research software and to follow the FAIR4RS principles. However, many domains, including energy research, lack established metadata schemas. To address this gap, we developed a metadata schema for energy research software based on a requirement analysis and evaluated it through user testing. Our results show that the schema balances the need for formalization and interoperability, while also meeting the specific needs of energy researchers. Meanwhile, the testing showed that a good presentation of the required information is key to enable researchers to create the required metadata. This paper provides insights into the challenges and opportunities of designing a metadata schema for energy research software.

2026-01-14T13:03:18Z Stephan Ferenz Oliver Werth Astrid Nieße http://arxiv.org/abs/2601.09338v1 A Deep Dive into OpenStreetMap Research Since its Inception (2008-2024): Contributors, Topics, and Future Trends 2026-01-14T10:13:48Z

OpenStreetMap (OSM) has transitioned from a pioneering volunteered geographic information (VGI) project into a global, multi-disciplinary research nexus. This study presents a bibliometric and systematic analysis of the OSM research landscape, examining its development trajectory and key driving forces. By evaluating 1,926 publications from the Web of Science (WoS) Core Collection and 782 State of the Map (SotM) presentations up to June 2024, we quantify publication growth, collaboration patterns, and thematic evolution. Results demonstrate simultaneous consolidation and diversification within the field. While a stable core of contributors continues to anchor OSM research, themes have shifted from initial concerns over data production and quality toward advanced analytical and applied uses. Comparative analysis of OSM-related research in WoS and SotM reveals distinct but complementary agendas between scholars and the OSM community. Building on these findings, we identify six emerging research directions and discuss how evolving partnerships among academia, the OSM community, and industry are poised to shape the future of OSM research. This study establishes a structured reference for understanding the state of OSM studies and offers strategic pathways for navigating its future trajectory.The data and code are available at https://github.com/ya0-sun/OSMbib.

2026-01-14T10:13:48Z Yao Sun Liqiu Meng Andres Camero Stefan Auer Xiao Xiang Zhu http://arxiv.org/abs/2407.06972v2 Cloud-based digitization workflow with rich metadata acquisition for cultural heritage objects 2026-01-13T19:50:31Z

In response to several cultural heritage initiatives at the Jagiellonian University, we developed a new digitization workflow in collaboration with the Jagiellonian Library (JL). The solution is based on easy-to-access technological solutions -- Microsoft 365 cloud with MS Excel files as metadata acquisition interfaces, Office Script for validation, and MS Sharepoint for storage -- that allows metadata acquisition by domain experts regardless of their experience with information systems. The ultimate goal is to create a knowledge graph that describes the analyzed collections, linked to general knowledge bases, as well as to other cultural heritage collections, so careful attention is paid to the high accuracy of metadata and proper links to external sources. The workflow was evaluated in two pilot studies and in two workshops, which allowed for its refinement and confirmation of its correctness and usability for JL. The knowledge graph created as a result of these pilot studies was made available in a public git repository. As the proposed workflow does not interfere with existing systems or domain guidelines regarding digitization and basic metadata collection in a given institution, but extends them in order to enable rich metadata collection, not previously possible, we believe that it could be of interest to all GLAMs.

2024-07-09T15:49:47Z 24 pages, 4 figures; submitted to The Journal of Academic Librarianship Krzysztof Kutt Jagiellonian University Luiz do Valle Miranda Jagiellonian University Jakub Gomułka AGH University of Krakow Grzegorz J. Nalepa Jagiellonian University 10.1016/j.acalib.2026.103212 http://arxiv.org/abs/2601.08340v1 Pursuing transparency: How research performing organizations in Germany collect data on publication costs 2026-01-13T08:59:33Z

This article presents the results of a survey conducted in 2024 among research performing organizations (RPOs) in Germany on how they collect data on publication costs. Of the 583 invitees, 258 (44.3%) completed the questionnaire. This survey is the first comprehensive study on the recording of publication costs at RPOs in Germany. The results show that the majority of surveyed RPOs recorded publication costs at least in part. However, procedures in this regard were often non-binding. Respondents' ratings of the reliability of the collection of data on publication costs varied by the source of publication funding. Eighty percent of respondents rated the contribution of collecting data on publication costs to shaping the open access transformation as "very important" or "important." Yet, these data were used as a basis for strategic decisions in only 59% of the surveyed RPOs. Moreover, most respondents considered the implementation of an information budget at their institutions by 2025 unlikely. We discuss the implications of these findings for the open access transformation.

2026-01-13T08:59:33Z Dorothea Strecker Heinz Pampel Jonas Höfting http://arxiv.org/abs/2601.07563v1 The Issue with Special Issues: when Guest Editors Publish in Support of Self 2026-01-12T14:18:29Z

The recent exceptional growth in the number of special issues has led to the largest delegation of editorial power in the history of scientific publishing. Has this power been used responsibly? In this article we provide the first systematic analysis of a particular form of abuse of power by guest editors: endogeny, the practice of publishing articles in ones own special issue. While moderate levels of endogeny are common in special issues, excessive endogeny is a blatant case of scientific misconduct. We define special issues containing more than 33% endogeny as Published in Support of Self (PISS). We build a dataset of over 100,000 special issues published between 2015 and 2025 by five leading publishers. The large majority of guest editors engage in endogeny responsibly, if at all. Nonetheless, despite endogeny policies by publishers and indexers, PISS is comparable in magnitude to scientific fraud. All journals heavily relying on special issues host PISS, and more than 1,000 PISS special issues are published each year, hosting tens of thousands of endogenous articles. Extreme PISS abuses are rare, as the majority of PISS occurs at moderate levels of endogeny. Since the scientific literature is a common pool resource this is not good news, as it reflects a widespread normalisation of guest editor misconduct. Fortunately, PISS can be solved by setting easily enforceable commonsense policies. We provide the data and analyses needed for indexers and academic regulators to act.

2026-01-12T14:18:29Z 11 pages plus references, 2 figures, 5 tables, supplementary files available via FigShare Paolo Crosetto Pablo Gómez Barreiro Mark Austin Hanson http://arxiv.org/abs/2601.07533v1 Loci Similes: A Benchmark for Extracting Intertextualities in Latin Literature 2026-01-12T13:34:49Z

Tracing connections between historical texts is an important part of intertextual research, enabling scholars to reconstruct the virtual library of a writer and identify the sources influencing their creative process. These intertextual links manifest in diverse forms, ranging from direct verbatim quotations to subtle allusions and paraphrases disguised by morphological variation. Language models offer a promising path forward due to their capability of capturing semantic similarity beyond lexical overlap. However, the development of new methods for this task is held back by the scarcity of standardized benchmarks and easy-to-use datasets. We address this gap by introducing Loci Similes, a benchmark for Latin intertextuality detection comprising of a curated dataset of ~172k text segments containing 545 expert-verified parallels linking Late Antique authors to a corpus of classical authors. Using this data, we establish baselines for retrieval and classification of intertextualities with state-of-the-art LLMs.

2026-01-12T13:34:49Z Julian Schelb Michael Wittweiler Marie Revellio Barbara Feichtinger Andreas Spitz http://arxiv.org/abs/2601.07451v1 Building Faculty Expertise Ontology using Protege: Enhancing Academic Library Research Services 2026-01-12T11:45:48Z

Academic libraries struggle to find and access faculty expertise across disciplines. This research proposes a faculty expertise ontology with a hierarchical structure based on Protégé to enhance library services and knowledge organisation. The ontology classifies relationships between departments, subject areas, faculty members, and contact data into layers including Top, Middle, and Bottom levels. The academic structure that this tiered form takes enables discovery of expertise in departments. The ontology which answers competency questions generated from the subject matter experts can answer real-world questions like which faculties are in the specific areas, how to collaborate with other disciplines and search contact information and so on. Competency questions act as design and test instruments to show that the ontology will fulfil the information needs of Researchers, Librarians and Administrators. The ontology is able to cope with semantically-enhanced queries, as shown by SPARQL implementations. The model works effectively in initiating referrals to an expert, aligning research with the strength of a department and allowing academics to partner up. The ontology delivers a scalable platform that adapts to institutional change. In the future, we intend to integrate with institutional databases and library systems for automatic API updates, as well as develop user interfaces and visualisations.

2026-01-12T11:45:48Z Snehasish Paul http://arxiv.org/abs/2601.04015v2 Examining persistence of European open repository infrastructure and its diffusion in the scholarly record 2026-01-12T10:07:38Z

This article seeks to determine the extent to which the principle of persistence is observed by repositories and the organizations that operate them. We also evaluate the impact that negative repository persistence levels may be having on the scholarly record. We do this by interrogating and combining data about European repositories from several repository registries and web scraped sources, including the Internet Archive's Wayback Machine, thereby creating a unique dataset of historic repository locations and their OAI-PMH endpoints. We then use this data as the basis for text mining CORE, a vast corpus of scholarly outputs, to determine the extent to which impersistent European repository content has permeated the scholarly literature. Our findings indicate over a fifth of European repositories (> 20%) could be classified as 'dead', with an even greater proportion (> 40%) of the machine interfaces associated with these repositories similarly dead. Problematically, our analysis indicates that circa 12,000 unique scholarly works cite, refer to, or actively used this repository content, amounting to circa 19,000 unique repository locations, all of which are now unretrievable from their stated resource location. Partly owing to limitations in available repository registry data and the existence of 'zombie' repositories, there are reasons to conclude that the total number of scholarly works referring to dead repository content is far higher. We also find evidence of dead repository content entering the current scholarly record, a phenomenon we describe as 'dead on arrival' referencing. We consider the implications of these observations, proffer explanations, and propose possible policy interventions to address the issue of repository persistence. Our dataset also enables us to make several observations about the nature of impersistent repositories, their profile, and their decay rate.

2026-01-07T15:29:50Z 31 pages George Macgregor Joy Davidson http://arxiv.org/abs/2511.13742v2 Review of Passenger Flow Modelling Approaches Based on a Bibliometric Analysis 2026-01-12T09:43:42Z

This paper presents a bibliometric analysis of the field of short-term passenger flow forecasting within local public transit, covering 814 publications that span from 1984 to 2024. In addition to common bibliometric analysis tools, a variant of a citation network was developed, and topic modelling was conducted. The analysis reveals that research activity exhibited sporadic patterns prior to 2008, followed by a marked acceleration, characterised by a shift from conventional statistical and machine learning methodologies (e.g., ARIMA, SVM, and basic neural networks) to specialised deep learning architectures. Based on this insight, a connection to more general fields such as machine learning and time series modelling was established. In addition to modelling, spatial, linguistic, and modal biases were identified and findings from existing secondary literature were validated and quantified. This revealed existing gaps, such as constrained data fusion, open (multivariate) data, and underappreciated challenges related to model interpretability, cost-efficiency, and a balance between algorithmic performance and practical deployment considerations. In connection with the superordinate fields, the growth in relevance of foundation models is also noteworthy.

2025-11-12T07:13:18Z Jonathan Hecht Weilian Li Ziyue Li Youness Dehbi http://arxiv.org/abs/2601.02598v2 LongDA: Benchmarking LLM Agents for Long-Document Data Analysis 2026-01-11T22:21:22Z

We introduce LongDA, a data analysis benchmark for evaluating LLM-based agents under documentation-intensive analytical workflows. In contrast to existing benchmarks that assume well-specified schemas and inputs, LongDA targets real-world settings in which navigating long documentation and complex data is the primary bottleneck. To this end, we manually curate raw data files, long and heterogeneous documentation, and expert-written publications from 17 publicly available U.S. national surveys, from which we extract 505 analytical queries grounded in real analytical practice. Solving these queries requires agents to first retrieve and integrate key information from multiple unstructured documents, before performing multi-step computations and writing executable code, which remains challenging for existing data analysis agents. To support the systematic evaluation under this setting, we develop LongTA, a tool-augmented agent framework that enables document access, retrieval, and code execution, and evaluate a range of proprietary and open-source models. Our experiments reveal substantial performance gaps even among state-of-the-art models, highlighting the challenges researchers should consider before applying LLM agents for decision support in real-world, high-stakes analytical settings.

2026-01-05T23:23:16Z Yiyang Li Zheyuan Zhang Tianyi Ma Zehong Wang Keerthiram Murugesan Chuxu Zhang Yanfang Ye http://arxiv.org/abs/2502.06472v2 KARMA: Leveraging Multi-Agent LLMs for Automated Knowledge Graph Enrichment 2026-01-11T02:16:42Z

Maintaining comprehensive and up-to-date knowledge graphs (KGs) is critical for modern AI systems, but manual curation struggles to scale with the rapid growth of scientific literature. This paper presents KARMA, a novel framework employing multi-agent large language models (LLMs) to automate KG enrichment through structured analysis of unstructured text. Our approach employs nine collaborative agents, spanning entity discovery, relation extraction, schema alignment, and conflict resolution that iteratively parse documents, verify extracted knowledge, and integrate it into existing graph structures while adhering to domain-specific schema. Experiments on 1,200 PubMed articles from three different domains demonstrate the effectiveness of KARMA in knowledge graph enrichment, with the identification of up to 38,230 new entities while achieving 83.1\% LLM-verified correctness and reducing conflict edges by 18.6\% through multi-layer assessments.

2025-02-10T13:51:36Z 24 pages, 3 figures, 2 tables Spotlight paper of NeurIPS 2025 Yuxing Lu Wei Wu Xukai Zhao Rui Peng Jinzhuo Wang http://arxiv.org/abs/2601.00840v2 A Global Atlas of Digital Dermatology to Map Innovation and Disparities 2026-01-09T20:23:53Z

The adoption of artificial intelligence in dermatology promises democratized access to healthcare, but model reliability depends on the quality and comprehensiveness of the data fueling these models. Despite rapid growth in publicly available dermatology images, the field lacks quantitative key performance indicators to measure whether new datasets expand clinical coverage or merely replicate what is already known. Here we present SkinMap, a multi-modal framework for the first comprehensive audit of the field's entire data basis. We unify the publicly available dermatology datasets into a single, queryable semantic atlas comprising more than 1.1 million images of skin conditions and quantify (i) informational novelty over time, (ii) dataset redundancy, and (iii) representation gaps across demographics and diagnoses. Despite exponential growth in dataset sizes, informational novelty across time has somewhat plateaued: Some clusters, such as common neoplasms on fair skin, are densely populated, while underrepresented skin types and many rare diseases remain unaddressed. We further identify structural gaps in coverage: Darker skin tones (Fitzpatrick V-VI) constitute only 5.8% of images and pediatric patients only 3.0%, while many rare diseases and phenotype combinations remain sparsely represented. SkinMap provides infrastructure to measure blind spots and steer strategic data acquisition toward undercovered regions of clinical space.

2025-12-27T09:22:36Z Fabian Gröger Simone Lionetti Philippe Gottfrois Alvaro Gonzalez-Jimenez Lea Habermacher Labelling Consortium Ludovic Amruthalingam Matthew Groh Marc Pouly Alexander A. Navarini http://arxiv.org/abs/2603.19237v1 Prompt engineering for bibliographic web-scraping 2026-01-09T10:00:10Z

Bibliographic catalogues store millions of data. The use of computer techniques such as web-scraping allows the extraction of data in an efficient and accurate manner. The recent emergence of ChatGPT is facilitating the development of suitable prompts that allow the configuration of scraping to identify and extract information from databases. The aim of this article is to define how to efficiently use prompts engineering to elaborate a suitable data entry model, able to generate in a single interaction with ChatGPT-4o, a fully functional web-scraper, programmed in PHP language, adapted to the case of bibliographic catalogues. As a demonstration example, the bibliographic catalogue of the National Library of Spain with a dataset of thousands of records is used. The findings present an effective model for developing web-scraping programs, assisted with AI and with the minimum possible interaction. The results obtained with the model indicate that the use of prompts with large language models (LLM) can improve the quality of scraping by understanding specific contexts and patterns, adapting to different formats and styles of presentation of bibliographic information.

2026-01-09T10:00:10Z 26 pages, 7 Tables, 2 Figures Scientometrics, 2025, 130(7), 3433-3453 Manuel Blázquez-Ochando Juan José Prieto-Gutiérrez María Antonia Ovalle-Perandones 10.1007/s11192-025-05372-5 http://arxiv.org/abs/2503.18526v2 SciClaims: An End-to-End Generative System for Biomedical Claim Analysis 2026-01-08T17:04:29Z

We present SciClaims, an interactive web-based system for end-to-end scientific claim analysis in the biomedical domain. Designed for high-stakes use cases such as systematic literature reviews and patent validation, SciClaims extracts claims from text, retrieves relevant evidence from PubMed, and verifies their veracity. The system features a user-friendly interface where users can input scientific text and view extracted claims, predictions, supporting or refuting evidence, and justifications in natural language. Unlike prior approaches, SciClaims seamlessly integrates the entire scientific claim analysis process using a single large language model, without requiring additional fine-tuning. SciClaims is optimized to run efficiently on a single GPU and is publicly available for live interaction.

2025-03-24T10:31:31Z In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations Raúl Ortega José Manuel Gómez-Pérez 10.18653/v1/2025.emnlp-demos.11 http://arxiv.org/abs/2601.05103v1 Semantically Orthogonal Framework for Citation Classification: Disentangling Intent and Content 2026-01-08T16:48:36Z

Understanding the role of citations is essential for research assessment and citation-aware digital libraries. However, existing citation classification frameworks often conflate citation intent (why a work is cited) with cited content type (what part is cited), limiting their effectiveness in auto classification due to a dilemma between fine-grained type distinctions and practical classification reliability. We introduce SOFT, a Semantically Orthogonal Framework with Two dimensions that explicitly separates citation intent from cited content type, drawing inspiration from semantic role theory. We systematically re-annotate the ACL-ARC dataset using SOFT and release a cross-disciplinary test set sampled from ACT2. Evaluation with both zero-shot and fine-tuned Large Language Models demonstrates that SOFT enables higher agreement between human annotators and LLMs, and supports stronger classification performance and robust cross-domain generalization compared to ACL-ARC and SciCite annotation frameworks. These results confirm SOFT's value as a clear, reusable annotation standard, improving clarity, consistency, and generalizability for digital libraries and scholarly communication infrastructures. All code and data are publicly available on GitHub https://github.com/zhiyintan/SOFT.

2026-01-08T16:48:36Z Accepted at the 29th International Conference on Theory and Practice of Digital Libraries (TPDL 2025) Changxu Duan Zhiyin Tan 10.1007/978-3-032-05409-8_12