https://arxiv.org/api/08xP76K1tyoTNH5EB1yh4OKQktk 2026-06-14T12:38:53Z 6065 690 15 http://arxiv.org/abs/2507.18159v1 SMECS: A Software Metadata Extraction and Curation Software 2025-07-24T07:53:46Z

Metadata play a crucial role in adopting the FAIR principles for research software and enables findability and reusability. However, creating high-quality metadata can be resource-intensive for researchers and research software engineers. To address this challenge, we developed the Software Metadata Extraction and Curation Software (SMECS) which integrates the extraction of metadata from existing sources together with a user-friendly interface for metadata curation. SMECS extracts metadata from online repositories such as GitHub and presents it to researchers through an interactive interface for further curation and export as a CodeMeta file. The usability of SMECS was evaluated through usability experiments which confirmed that SMECS provides a satisfactory user experience. SMECS supports the FAIRification of research software by simplifying metadata creation.

2025-07-24T07:53:46Z Electronic Communications of the EASST Vol. 85 (2025) Stephan Ferenz Aida Jafarbigloo Oliver Werth Astrid Nieße 10.14279/eceasst.v85.2708 http://arxiv.org/abs/2507.17127v1 Do male leading authors retract more articles than female leading authors? 2025-07-23T02:04:21Z

Scientific retractions reflect issues within the scientific record, arising from human error or misconduct. Although gender differences in retraction rates have been previously observed in various contexts, no comprehensive study has explored this issue across all fields of science. This study examines gender disparities in scientific misconduct or errors, specifically focusing on differences in retraction rates between male and female first authors in relation to their research productivity. Using a dataset comprising 11,622 retracted articles and 19,475,437 non-retracted articles from the Web of Science and Retraction Watch, we investigate gender differences in retraction rates from the perspectives of retraction reasons, subject fields, and countries. Our findings indicate that male first authors have higher retraction rates, particularly for scientific misconduct such as plagiarism, authorship disputes, ethical issues, duplication, and fabrication/falsification. No significant gender differences were found in retractions attributed to mistakes. Furthermore, male first authors experience significantly higher retraction rates in biomedical and health sciences, as well as in life and earth sciences, whereas female first authors have higher retraction rates in mathematics and computer science. Similar patterns are observed for corresponding authors. Understanding these gendered patterns of retraction may contribute to strategies aimed at reducing their prevalence.

2025-07-23T02:04:21Z Journal of Informetrics (2025), 19(3), 101682 Er-Te Zheng Hui-Zhen Fu Mike Thelwall Zhichao Fang 10.1016/j.joi.2025.101682 http://arxiv.org/abs/2507.17114v1 Social media uptake of scientific journals: A comparison between X and WeChat 2025-07-23T01:33:30Z

This study examines the social media uptake of scientific journals on two different platforms - X and WeChat - by comparing the adoption of X among journals indexed in the Science Citation Index-Expanded (SCIE) with the adoption of WeChat among journals indexed in the Chinese Science Citation Database (CSCD). The findings reveal substantial differences in platform adoption and user engagement, shaped by local contexts. While only 22.7% of SCIE journals maintain an X account, 84.4% of CSCD journals have a WeChat official account. Journals in Life Sciences & Biomedicine lead in uptake on both platforms, whereas those in Technology and Physical Sciences show high WeChat uptake but comparatively lower presence on X. User engagement on both platforms is dominated by low-effort interactions rather than more conversational behaviors. Correlation analyses indicate weak-to-moderate relationships between bibliometric indicators and social media metrics, confirming that online engagement reflects a distinct dimension of journal impact, whether on an international or a local platform. These findings underscore the need for broader social media metric frameworks that incorporate locally dominant platforms, thereby offering a more comprehensive understanding of science communication practices across diverse social media and contexts.

2025-07-23T01:33:30Z This is the preprint of a paper accepted for publication in the Journal of Information Science (in press) Ting Cong Er-Te Zheng Zekun Han Zhichao Fang Rodrigo Costas 10.1177/01655515251359759 http://arxiv.org/abs/2507.15590v1 Drafting the Landscape of Computational Musicology Tools: a Survey-Based Approach 2025-07-21T13:13:31Z

Since the 60s, musicology has been increasingly impacted by computational tools in various ways, from systematic analysis approaches to modeling of creativity. This article presents a comprehensive assessment of the current state of Computational Musicology tools based on survey data collected from practitioners in the field. We gathered information on tool usage patterns, common analytical tasks, user satisfaction levels, data characteristics, and prioritized features across four distinct domains: symbolic music, music-related imagery, audio, and text. Our findings reveal significant gaps between current tooling capabilities and user needs, highlighting some limitations of these tools across all domains. This assessment contributes to the ongoing dialogue between tool developers and music scholars, aiming to enhance the effectiveness and accessibility of computational methods in musicological research.

2025-07-21T13:13:31Z 8 pages, 7 figures, to be published in Digital Libraries for Musicology Jorge Junior Morgado Vega Sachin Sharma Federico Simonetta 10.1145/3748336.3748340 http://arxiv.org/abs/2504.10424v2 Lowering the Cost of Diamond Open Access Journals 2025-07-21T07:04:19Z

Many scholarly societies face challenges in adapting their publishing to an open access model where neither authors nor readers pay any fees. Some have argued that one of the main barriers is the actual cost of publishing. The goal of this paper is to show that the actual costs can be extremely low while still maintaining scholarly quality. We accomplish this by building a journal publishing workflow that minimizes the amount of required human labor. We recently built a software system for this and launched a journal using the system, and we estimate estimate our cost to publish this journal is approximately \$705 per year, plus \$1 per article and about 10 minutes of volunteer labor per article. We benefited from two factors, namely the fact that authors in our discipline use LaTeX to prepare their manuscripts, and we had volunteer labor to develop software and run the journal. We have made most of this software open source in the hopes that it can help others.

2025-04-14T17:13:45Z Joppe Bos Kevin S. McCurley http://arxiv.org/abs/2507.14752v1 Longitudinal Sampling of URLs From the Wayback Machine 2025-07-19T21:01:38Z

We document strategies and lessons learned from sampling the web by collecting 27.3 million URLs with 3.8 billion archived pages spanning 26 years (1996-2021) from the Internet Archive's (IA) Wayback Machine. Our goal is to revisit fundamental questions regarding the size, nature, and prevalence of the publicly archivable web, in particular, to reconsider the question: "How long does a web page last?" Addressing this question requires obtaining a sample of the web. We proposed several dimensions to sample URLs from the Wayback Machine's holdings: time of first archive, HTML vs. other MIME types, URL depth (top-level pages vs. deep links), and top-level domain (TLD). We sampled 285 million URLs from IA's ZipNum index file, which contains every 6000th line of the CDX index. These indexes also include URLs of embedded resources such as images, CSS, and JavaScript. To limit our sample to "web pages" (i.e., pages intended for human interaction), we filtered for likely HTML pages based on filename extension. We then queried IA's CDX API to determine the time of first capture and MIME type of each URL. We grouped 92 million text/html URLs based on year of first capture. Archiving speed and capacity have increased over time, so we found more URLs archived in later years. To counter this, we extracted top-level URLs from deep links to upsample earlier years. Our target was 1 million URLs per year, but due to sparseness during 1996-2021, we clustered those years, collecting 1.2 million URLs for that range. Popular domains like Yahoo and Twitter were over-represented, so we performed logarithmic-scale downsampling. Our final dataset contains TimeMaps of 27.3 million URLs, comprising 3.8 billion archived pages. We convey lessons learned from sampling the archived web to inform future studies.

2025-07-19T21:01:38Z Kritika Garg Sawood Alam Dietrich Ayala Mark Graham Michele C. Weigle Michael L. Nelson http://arxiv.org/abs/2508.00871v1 Patents as Knowledge Artifacts: An Information Science Perspective on Global Innovation 2025-07-19T16:33:39Z

In an age of fast-paced technological change, patents have evolved into not only legal mechanisms of intellectual property, but also structured storage containers of knowledge full of metadata, categories, and formal innovation. This chapter proposes to reframe patents in the context of information science, by focusing on patents as knowledge artifacts, and by seeing patents as fundamentally tied to the global movement of scientific and technological knowledge. With a focus on three areas, the inventions of AIs, biotech patents, and international competition with patents, this work considers how new technologies are challenging traditional notions of inventorship, access, and moral accountability.The chapter provides a critical analysis of AI's implications for patent authorship and prior art searches, ownership issues arising from proprietary claims in biotechnology to ethical dilemmas, and the problem of using patents for strategic advantage in a global context of innovation competition. In this analysis, the chapter identified the importance of organizing information, creating metadata standards about originality, implementing retrieval systems to access previous works, and ethical contemplation about patenting unseen relationships in innovation ecosystems. Ultimately, the chapter called for a collaborative, transparent, and ethically-based approach in managing knowledge in the patenting environment highlighting the role for information professionals and policy to contribute to access equity in innovation.

2025-07-19T16:33:39Z Comments: 8 pages. This is a preprint version of the paper titled "Patents as Knowledge Artifacts: An Information Science Perspective on Global Innovation" Not peer-reviewed. Feedback welcome M. S. Rajeevan B. Mini Devi http://arxiv.org/abs/2507.14614v1 Knowing when to stop: insights from ecology for building catalogues, collections, and corpora 2025-07-19T13:25:08Z

A major locus of musicological activity-increasingly in the digital domain-is the cataloguing of sources, which requires large-scale and long-lasting research collaborations. Yet, the databases aiming at covering and representing musical repertoires are never quite complete, and scholars must contend with the question: how much are we still missing? This question structurally resembles the 'unseen species' problem in ecology, where the true number of species must be estimated from limited observations. In this case study, we apply for the first time the common Chao1 estimator to music, specifically to Gregorian chant. We find that, overall, upper bounds for repertoire coverage of the major chant genres range between 50 and 80 %. As expected, we find that Mass Propers are covered better than the Divine Office, though not overwhelmingly so. However, the accumulation curve suggests that those bounds are not tight: a stable ~5% of chants in sources indexed between 1993 and 2020 was new, so diminishing returns in terms of repertoire diversity are not yet to be expected. Our study demonstrates that these questions can be addressed empirically to inform musicological data-gathering, showing the potential of unseen species models in musicology.

2025-07-19T13:25:08Z 12th International Conference on Digital Libraries for Musicology, Sogang University, Seoul, South Korea, 26 September 2025 Jan Hajič Fabian Moss 10.1145/3748336.3748347 http://arxiv.org/abs/2508.00867v1 Better Recommendations: Validating AI-generated Subject Terms Through LOC Linked Data Service 2025-07-18T18:55:57Z

This article explores the integration of AI-generated subject terms into library cataloging, focusing on validation through the Library of Congress Linked Data Service. It examines the challenges of traditional subject cataloging under the Library of Congress Subject Headings system, including inefficiencies and cataloging backlogs. While generative AI shows promise in expediting cataloging workflows, studies reveal significant limitations in the accuracy of AI-assigned subject headings. The article proposes a hybrid approach combining AI technology with human validation through LOC Linked Data Service, aiming to enhance the precision, efficiency, and overall quality of metadata creation in library cataloging practices.

2025-07-18T18:55:57Z Kwok Leong Tang Yi Jiang http://arxiv.org/abs/2406.15154v2 Analysis of the Publication and Document Types in OpenAlex, Web of Science, Scopus, PubMed and Semantic Scholar 2025-07-18T10:47:40Z

The assignment of document and publication types in scholarly databases plays an important role in bibliometrics, for example in decision-making or university rankings. However, scholarly databases apply different curation strategies and taxonomies when classifying documents which makes it difficult to compare results from different database providers. In this study, the bibliometric databases OpenAlex, Web of Science, Scopus, PubMed and Semantic Scholar are used to analyse the extent of data variation and compare different approaches to taxonomy and data curation. Using a shared corpus of 9,575,603 publications from 2012 to 2022, we found large differences in the classification of document types such as research articles and editorials in these databases. We can also show that many of the records that lack a publication type in OpenAlex are classified as conference proceedings in Scopus and Semantic Scholar.

2024-06-21T14:00:53Z Nick Haupka Jack H. Culbert Alexander Schniedermann Najko Jahn Philipp Mayr 10.1162/QSS.a.406 http://arxiv.org/abs/2506.03221v3 ExtracTable: Human-in-the-Loop Transformation of Scientific Corpora into Structured Knowledge 2025-07-18T09:35:18Z

As the volume of scientific literature grows, efficient knowledge organization becomes increasingly challenging. Traditional approaches to structuring scientific content are time-consuming and require significant domain expertise, highlighting the need for tool support. We present ExtracTable, a Human-in-the-Loop (HITL) workflow and framework that assists researchers in transforming unstructured publications into structured representations. The workflow combines large language models (LLMs) with user-defined schemas and is designed for downstream integration into knowledge graphs (KGs). Developed and evaluated in the context of the Open Research Knowledge Graph (ORKG), ExtracTable automates key steps such as document preprocessing and data extraction while ensuring user oversight through validation. In an evaluation with ORKG community participants following the Quality Improvement Paradigm (QIP), ExtracTable demonstrated high usability and practical value. Participants gave it an average System Usability Scale (SUS) score of 84.17 (A+, the highest rating). The time to progress from a research interest to literature-based insights was reduced from between 4 hours and 2 weeks to an average of 24:40 minutes. By streamlining corpus creation and structured data extraction for knowledge graph integration, ExtracTable leverages LLMs and user models to accelerate literature reviews. However, human validation remains essential to ensure quality, and future work will address improving extraction accuracy and entity linking to existing knowledge resources.

2025-06-03T09:12:52Z Lena John Ahmed Malek Ghanmi Tim Wittenborg Sören Auer Oliver Karras 10.1007/978-3-032-05409-8_27 http://arxiv.org/abs/2504.12897v3 OntoPortal-Astro, a Semantic Artefact Catalogue for Astronomy 2025-07-18T08:05:18Z

The astronomy communities are widely recognised as mature communities for their open science practices. However, while their data ecosystems are rather advanced and permit efficient data interoperability, there are still gaps between these ecosystems. Semantic artefacts (SAs) -- e.g., ontologies, thesauri, vocabularies or metadata schemas -- are a means to bridge that gap as they allow to semantically described the data and map the underlying concepts. The increasing use of SAs in astronomy presents challenges in description, selection, evaluation, trust, and mappings. The landscape remains fragmented, with SAs scattered across various registries in diverse formats and structures -- not yet fully developed or encoded with rich semantic web standards like OWL or SKOS -- and often with overlapping scopes. Enhancing data semantic interoperability requires common platforms to catalog, align, and facilitate the sharing of FAIR (Findable, Accessible, Interoperable and Reusable) SAs. In the frame of the FAIR-IMPACT project, we prototyped a SA catalogue for astronomy, heliophysics and planetary sciences. This exercise resulted in improved vocabulary and ontology management in the communities, and is now paving the way for better interdisciplinary data discovery and reuse. This article presents current practices in our discipline, reviews candidate SAs for such a catalogue, presents driving use cases and the perspective of a real production service for the astronomy community based on the OntoPortal technology, that will be called OntoPortal-Astro.

2025-04-17T12:38:38Z Accepted by Astronomy & Computing Baptiste Cecconi Laura Debisschop Sébastien Derrière Mireille Louys Carmen Corre Nina Grau Clément Jonquet 10.1016/j.ascom.2025.100991 http://arxiv.org/abs/2507.13143v1 Managing Comprehensive Research Instrument Descriptions within a Scholarly Knowledge Graph 2025-07-17T14:08:12Z

In research, measuring instruments play a crucial role in producing the data that underpin scientific discoveries. Information about instruments is essential in data interpretation and, thus, knowledge production. However, if at all available and accessible, such information is scattered across numerous data sources. Relating the relevant details, e.g. instrument specifications or calibrations, with associated research assets (data, but also operating infrastructures) is challenging. Moreover, understanding the (possible) use of instruments is essential for researchers in experiment design and execution. To address these challenges, we propose a Knowledge Graph (KG) based approach for representing, publishing, and using information, extracted from various data sources, about instruments and associated scholarly artefacts. The resulting KG serves as a foundation for exploring and gaining a deeper understanding of the use and role of instruments in research, discovering relations between instruments and associated artefacts (articles and datasets), and opens the possibility to quantify the impact of instruments in research.

2025-07-17T14:08:12Z Muhammad Haris Sören Auer Markus Stocker http://arxiv.org/abs/2508.00862v1 A survey on proximity monitoring and warning in construction 2025-07-17T09:58:08Z

Various technologies have been applied to monitor the proximity between two construction entities, preventing struck-by accidents and thereby enhancing onsite safety. This study comprehensively reviews related efforts dedicated to proximity monitoring and warning (PMW) based on 97 relevant articles published between 2010 and 2024. The bibliometric analysis reveals the technical roadmap over time, as well as the five most influential leaders and the two largest research networks they have established. The qualitative review is then conducted from four perspectives: influencing factor study, hazard level definition and determination, proximity perception, and alarm issuing and receiving. Finally, the limitations and challenges of current proximity perception are discussed, along with corresponding future research directions, including end-to-end three-dimensional (3D) object detection, real-time 3D reconstruction and updating for dynamic construction scenes, and multimodal fusion. This review presents the current research status, limitations, and future directions of PMW, guiding the future development of PMW systems.

2025-07-17T09:58:08Z Yuexiong Ding Qiong Liu Ankang Ji Xiaowei Luo Wen Yi Albert P. C. Chan http://arxiv.org/abs/2507.12255v2 Freshness, Persistence and Success of Scientific Teams 2025-07-17T07:00:36Z

Team science dominates scientific knowledge production, but what makes academic teams successful? Using temporal data on 25.2 million publications and 31.8 million authors, we propose a novel network-driven approach to identify and study the success of persistent teams. Challenging the idea that persistence alone drives success, we find that team freshness - new collaborations built on prior experience - is key to success. High impact research tends to emerge early in a team's lifespan. Analyzing complex team overlap, we find that teams open to new collaborative ties consistently produce better science. Specifically, team re-combinations that introduce new freshness impulses sustain success, while persistence impulses from experienced teams are linked to earlier impact. Together, freshness and persistence shape team success across collaboration stages.

2025-07-16T14:00:51Z Author name correction in arXiv metadata Hanjo D. Boekhout Eelke M. Heemskerk Niccolò Pisani Frank W. Takes