https://arxiv.org/api/08xP76K1tyoTNH5EB1yh4OKQktk2026-06-14T12:38:53Z606569015http://arxiv.org/abs/2507.18159v1SMECS: A Software Metadata Extraction and Curation Software2025-07-24T07:53:46ZMetadata play a crucial role in adopting the FAIR principles for research software and enables findability and reusability. However, creating high-quality metadata can be resource-intensive for researchers and research software engineers. To address this challenge, we developed the Software Metadata Extraction and Curation Software (SMECS) which integrates the extraction of metadata from existing sources together with a user-friendly interface for metadata curation. SMECS extracts metadata from online repositories such as GitHub and presents it to researchers through an interactive interface for further curation and export as a CodeMeta file. The usability of SMECS was evaluated through usability experiments which confirmed that SMECS provides a satisfactory user experience. SMECS supports the FAIRification of research software by simplifying metadata creation.2025-07-24T07:53:46ZElectronic Communications of the EASST Vol. 85 (2025)Stephan FerenzAida JafarbiglooOliver WerthAstrid Nieße10.14279/eceasst.v85.2708http://arxiv.org/abs/2507.17127v1Do male leading authors retract more articles than female leading authors?2025-07-23T02:04:21ZScientific retractions reflect issues within the scientific record, arising from human error or misconduct. Although gender differences in retraction rates have been previously observed in various contexts, no comprehensive study has explored this issue across all fields of science. This study examines gender disparities in scientific misconduct or errors, specifically focusing on differences in retraction rates between male and female first authors in relation to their research productivity. Using a dataset comprising 11,622 retracted articles and 19,475,437 non-retracted articles from the Web of Science and Retraction Watch, we investigate gender differences in retraction rates from the perspectives of retraction reasons, subject fields, and countries. Our findings indicate that male first authors have higher retraction rates, particularly for scientific misconduct such as plagiarism, authorship disputes, ethical issues, duplication, and fabrication/falsification. No significant gender differences were found in retractions attributed to mistakes. Furthermore, male first authors experience significantly higher retraction rates in biomedical and health sciences, as well as in life and earth sciences, whereas female first authors have higher retraction rates in mathematics and computer science. Similar patterns are observed for corresponding authors. Understanding these gendered patterns of retraction may contribute to strategies aimed at reducing their prevalence.2025-07-23T02:04:21ZJournal of Informetrics (2025), 19(3), 101682Er-Te ZhengHui-Zhen FuMike ThelwallZhichao Fang10.1016/j.joi.2025.101682http://arxiv.org/abs/2507.17114v1Social media uptake of scientific journals: A comparison between X and WeChat2025-07-23T01:33:30ZThis study examines the social media uptake of scientific journals on two different platforms - X and WeChat - by comparing the adoption of X among journals indexed in the Science Citation Index-Expanded (SCIE) with the adoption of WeChat among journals indexed in the Chinese Science Citation Database (CSCD). The findings reveal substantial differences in platform adoption and user engagement, shaped by local contexts. While only 22.7% of SCIE journals maintain an X account, 84.4% of CSCD journals have a WeChat official account. Journals in Life Sciences & Biomedicine lead in uptake on both platforms, whereas those in Technology and Physical Sciences show high WeChat uptake but comparatively lower presence on X. User engagement on both platforms is dominated by low-effort interactions rather than more conversational behaviors. Correlation analyses indicate weak-to-moderate relationships between bibliometric indicators and social media metrics, confirming that online engagement reflects a distinct dimension of journal impact, whether on an international or a local platform. These findings underscore the need for broader social media metric frameworks that incorporate locally dominant platforms, thereby offering a more comprehensive understanding of science communication practices across diverse social media and contexts.2025-07-23T01:33:30ZThis is the preprint of a paper accepted for publication in the Journal of Information Science (in press)Ting CongEr-Te ZhengZekun HanZhichao FangRodrigo Costas10.1177/01655515251359759http://arxiv.org/abs/2507.15590v1Drafting the Landscape of Computational Musicology Tools: a Survey-Based Approach2025-07-21T13:13:31ZSince the 60s, musicology has been increasingly impacted by computational tools in various ways, from systematic analysis approaches to modeling of creativity. This article presents a comprehensive assessment of the current state of Computational Musicology tools based on survey data collected from practitioners in the field. We gathered information on tool usage patterns, common analytical tasks, user satisfaction levels, data characteristics, and prioritized features across four distinct domains: symbolic music, music-related imagery, audio, and text. Our findings reveal significant gaps between current tooling capabilities and user needs, highlighting some limitations of these tools across all domains. This assessment contributes to the ongoing dialogue between tool developers and music scholars, aiming to enhance the effectiveness and accessibility of computational methods in musicological research.2025-07-21T13:13:31Z8 pages, 7 figures, to be published in Digital Libraries for MusicologyJorge Junior Morgado VegaSachin SharmaFederico Simonetta10.1145/3748336.3748340http://arxiv.org/abs/2504.10424v2Lowering the Cost of Diamond Open Access Journals2025-07-21T07:04:19ZMany scholarly societies face challenges in adapting their publishing to an open access model where neither authors nor readers pay any fees. Some have argued that one of the main barriers is the actual cost of publishing. The goal of this paper is to show that the actual costs can be extremely low while still maintaining scholarly quality. We accomplish this by building a journal publishing workflow that minimizes the amount of required human labor. We recently built a software system for this and launched a journal using the system, and we estimate estimate our cost to publish this journal is approximately \$705 per year, plus \$1 per article and about 10 minutes of volunteer labor per article. We benefited from two factors, namely the fact that authors in our discipline use LaTeX to prepare their manuscripts, and we had volunteer labor to develop software and run the journal. We have made most of this software open source in the hopes that it can help others.2025-04-14T17:13:45ZJoppe BosKevin S. McCurleyhttp://arxiv.org/abs/2507.14752v1Longitudinal Sampling of URLs From the Wayback Machine2025-07-19T21:01:38ZWe document strategies and lessons learned from sampling the web by collecting 27.3 million URLs with 3.8 billion archived pages spanning 26 years (1996-2021) from the Internet Archive's (IA) Wayback Machine. Our goal is to revisit fundamental questions regarding the size, nature, and prevalence of the publicly archivable web, in particular, to reconsider the question: "How long does a web page last?" Addressing this question requires obtaining a sample of the web. We proposed several dimensions to sample URLs from the Wayback Machine's holdings: time of first archive, HTML vs. other MIME types, URL depth (top-level pages vs. deep links), and top-level domain (TLD). We sampled 285 million URLs from IA's ZipNum index file, which contains every 6000th line of the CDX index. These indexes also include URLs of embedded resources such as images, CSS, and JavaScript. To limit our sample to "web pages" (i.e., pages intended for human interaction), we filtered for likely HTML pages based on filename extension. We then queried IA's CDX API to determine the time of first capture and MIME type of each URL. We grouped 92 million text/html URLs based on year of first capture. Archiving speed and capacity have increased over time, so we found more URLs archived in later years. To counter this, we extracted top-level URLs from deep links to upsample earlier years. Our target was 1 million URLs per year, but due to sparseness during 1996-2021, we clustered those years, collecting 1.2 million URLs for that range. Popular domains like Yahoo and Twitter were over-represented, so we performed logarithmic-scale downsampling. Our final dataset contains TimeMaps of 27.3 million URLs, comprising 3.8 billion archived pages. We convey lessons learned from sampling the archived web to inform future studies.2025-07-19T21:01:38ZKritika GargSawood AlamDietrich AyalaMark GrahamMichele C. WeigleMichael L. Nelsonhttp://arxiv.org/abs/2508.00871v1Patents as Knowledge Artifacts: An Information Science Perspective on Global Innovation2025-07-19T16:33:39ZIn an age of fast-paced technological change, patents have evolved into not only legal mechanisms of intellectual property, but also structured storage containers of knowledge full of metadata, categories, and formal innovation. This chapter proposes to reframe patents in the context of information science, by focusing on patents as knowledge artifacts, and by seeing patents as fundamentally tied to the global movement of scientific and technological knowledge. With a focus on three areas, the inventions of AIs, biotech patents, and international competition with patents, this work considers how new technologies are challenging traditional notions of inventorship, access, and moral accountability.The chapter provides a critical analysis of AI's implications for patent authorship and prior art searches, ownership issues arising from proprietary claims in biotechnology to ethical dilemmas, and the problem of using patents for strategic advantage in a global context of innovation competition. In this analysis, the chapter identified the importance of organizing information, creating metadata standards about originality, implementing retrieval systems to access previous works, and ethical contemplation about patenting unseen relationships in innovation ecosystems. Ultimately, the chapter called for a collaborative, transparent, and ethically-based approach in managing knowledge in the patenting environment highlighting the role for information professionals and policy to contribute to access equity in innovation.2025-07-19T16:33:39ZComments: 8 pages. This is a preprint version of the paper titled "Patents as Knowledge Artifacts: An Information Science Perspective on Global Innovation" Not peer-reviewed. Feedback welcomeM. S. RajeevanB. Mini Devihttp://arxiv.org/abs/2507.14614v1Knowing when to stop: insights from ecology for building catalogues, collections, and corpora2025-07-19T13:25:08ZA major locus of musicological activity-increasingly in the digital domain-is the cataloguing of sources, which requires large-scale and long-lasting research collaborations. Yet, the databases aiming at covering and representing musical repertoires are never quite complete, and scholars must contend with the question: how much are we still missing? This question structurally resembles the 'unseen species' problem in ecology, where the true number of species must be estimated from limited observations. In this case study, we apply for the first time the common Chao1 estimator to music, specifically to Gregorian chant. We find that, overall, upper bounds for repertoire coverage of the major chant genres range between 50 and 80 %. As expected, we find that Mass Propers are covered better than the Divine Office, though not overwhelmingly so. However, the accumulation curve suggests that those bounds are not tight: a stable ~5% of chants in sources indexed between 1993 and 2020 was new, so diminishing returns in terms of repertoire diversity are not yet to be expected. Our study demonstrates that these questions can be addressed empirically to inform musicological data-gathering, showing the potential of unseen species models in musicology.2025-07-19T13:25:08Z12th International Conference on Digital Libraries for Musicology, Sogang University, Seoul, South Korea, 26 September 2025Jan HajičFabian Moss10.1145/3748336.3748347http://arxiv.org/abs/2508.00867v1Better Recommendations: Validating AI-generated Subject Terms Through LOC Linked Data Service2025-07-18T18:55:57ZThis article explores the integration of AI-generated subject terms into library cataloging, focusing on validation through the Library of Congress Linked Data Service. It examines the challenges of traditional subject cataloging under the Library of Congress Subject Headings system, including inefficiencies and cataloging backlogs. While generative AI shows promise in expediting cataloging workflows, studies reveal significant limitations in the accuracy of AI-assigned subject headings. The article proposes a hybrid approach combining AI technology with human validation through LOC Linked Data Service, aiming to enhance the precision, efficiency, and overall quality of metadata creation in library cataloging practices.2025-07-18T18:55:57ZKwok Leong TangYi Jianghttp://arxiv.org/abs/2406.15154v2Analysis of the Publication and Document Types in OpenAlex, Web of Science, Scopus, PubMed and Semantic Scholar2025-07-18T10:47:40ZThe assignment of document and publication types in scholarly databases plays an important role in bibliometrics, for example in decision-making or university rankings. However, scholarly databases apply different curation strategies and taxonomies when classifying documents which makes it difficult to compare results from different database providers. In this study, the bibliometric databases OpenAlex, Web of Science, Scopus, PubMed and Semantic Scholar are used to analyse the extent of data variation and compare different approaches to taxonomy and data curation. Using a shared corpus of 9,575,603 publications from 2012 to 2022, we found large differences in the classification of document types such as research articles and editorials in these databases. We can also show that many of the records that lack a publication type in OpenAlex are classified as conference proceedings in Scopus and Semantic Scholar.2024-06-21T14:00:53ZNick HaupkaJack H. CulbertAlexander SchniedermannNajko JahnPhilipp Mayr10.1162/QSS.a.406http://arxiv.org/abs/2506.03221v3ExtracTable: Human-in-the-Loop Transformation of Scientific Corpora into Structured Knowledge2025-07-18T09:35:18ZAs the volume of scientific literature grows, efficient knowledge organization becomes increasingly challenging. Traditional approaches to structuring scientific content are time-consuming and require significant domain expertise, highlighting the need for tool support. We present ExtracTable, a Human-in-the-Loop (HITL) workflow and framework that assists researchers in transforming unstructured publications into structured representations. The workflow combines large language models (LLMs) with user-defined schemas and is designed for downstream integration into knowledge graphs (KGs). Developed and evaluated in the context of the Open Research Knowledge Graph (ORKG), ExtracTable automates key steps such as document preprocessing and data extraction while ensuring user oversight through validation. In an evaluation with ORKG community participants following the Quality Improvement Paradigm (QIP), ExtracTable demonstrated high usability and practical value. Participants gave it an average System Usability Scale (SUS) score of 84.17 (A+, the highest rating). The time to progress from a research interest to literature-based insights was reduced from between 4 hours and 2 weeks to an average of 24:40 minutes. By streamlining corpus creation and structured data extraction for knowledge graph integration, ExtracTable leverages LLMs and user models to accelerate literature reviews. However, human validation remains essential to ensure quality, and future work will address improving extraction accuracy and entity linking to existing knowledge resources.2025-06-03T09:12:52ZLena JohnAhmed Malek GhanmiTim WittenborgSören AuerOliver Karras10.1007/978-3-032-05409-8_27http://arxiv.org/abs/2504.12897v3OntoPortal-Astro, a Semantic Artefact Catalogue for Astronomy2025-07-18T08:05:18ZThe astronomy communities are widely recognised as mature communities for their open science practices. However, while their data ecosystems are rather advanced and permit efficient data interoperability, there are still gaps between these ecosystems. Semantic artefacts (SAs) -- e.g., ontologies, thesauri, vocabularies or metadata schemas -- are a means to bridge that gap as they allow to semantically described the data and map the underlying concepts. The increasing use of SAs in astronomy presents challenges in description, selection, evaluation, trust, and mappings. The landscape remains fragmented, with SAs scattered across various registries in diverse formats and structures -- not yet fully developed or encoded with rich semantic web standards like OWL or SKOS -- and often with overlapping scopes. Enhancing data semantic interoperability requires common platforms to catalog, align, and facilitate the sharing of FAIR (Findable, Accessible, Interoperable and Reusable) SAs. In the frame of the FAIR-IMPACT project, we prototyped a SA catalogue for astronomy, heliophysics and planetary sciences. This exercise resulted in improved vocabulary and ontology management in the communities, and is now paving the way for better interdisciplinary data discovery and reuse. This article presents current practices in our discipline, reviews candidate SAs for such a catalogue, presents driving use cases and the perspective of a real production service for the astronomy community based on the OntoPortal technology, that will be called OntoPortal-Astro.2025-04-17T12:38:38ZAccepted by Astronomy & ComputingBaptiste CecconiLaura DebisschopSébastien DerrièreMireille LouysCarmen CorreNina GrauClément Jonquet10.1016/j.ascom.2025.100991http://arxiv.org/abs/2507.13143v1Managing Comprehensive Research Instrument Descriptions within a Scholarly Knowledge Graph2025-07-17T14:08:12ZIn research, measuring instruments play a crucial role in producing the data that underpin scientific discoveries. Information about instruments is essential in data interpretation and, thus, knowledge production. However, if at all available and accessible, such information is scattered across numerous data sources. Relating the relevant details, e.g. instrument specifications or calibrations, with associated research assets (data, but also operating infrastructures) is challenging. Moreover, understanding the (possible) use of instruments is essential for researchers in experiment design and execution. To address these challenges, we propose a Knowledge Graph (KG) based approach for representing, publishing, and using information, extracted from various data sources, about instruments and associated scholarly artefacts. The resulting KG serves as a foundation for exploring and gaining a deeper understanding of the use and role of instruments in research, discovering relations between instruments and associated artefacts (articles and datasets), and opens the possibility to quantify the impact of instruments in research.2025-07-17T14:08:12ZMuhammad HarisSören AuerMarkus Stockerhttp://arxiv.org/abs/2508.00862v1A survey on proximity monitoring and warning in construction2025-07-17T09:58:08ZVarious technologies have been applied to monitor the proximity between two construction entities, preventing struck-by accidents and thereby enhancing onsite safety. This study comprehensively reviews related efforts dedicated to proximity monitoring and warning (PMW) based on 97 relevant articles published between 2010 and 2024. The bibliometric analysis reveals the technical roadmap over time, as well as the five most influential leaders and the two largest research networks they have established. The qualitative review is then conducted from four perspectives: influencing factor study, hazard level definition and determination, proximity perception, and alarm issuing and receiving. Finally, the limitations and challenges of current proximity perception are discussed, along with corresponding future research directions, including end-to-end three-dimensional (3D) object detection, real-time 3D reconstruction and updating for dynamic construction scenes, and multimodal fusion. This review presents the current research status, limitations, and future directions of PMW, guiding the future development of PMW systems.2025-07-17T09:58:08ZYuexiong DingQiong LiuAnkang JiXiaowei LuoWen YiAlbert P. C. Chanhttp://arxiv.org/abs/2507.12255v2Freshness, Persistence and Success of Scientific Teams2025-07-17T07:00:36ZTeam science dominates scientific knowledge production, but what makes academic teams successful? Using temporal data on 25.2 million publications and 31.8 million authors, we propose a novel network-driven approach to identify and study the success of persistent teams. Challenging the idea that persistence alone drives success, we find that team freshness - new collaborations built on prior experience - is key to success. High impact research tends to emerge early in a team's lifespan. Analyzing complex team overlap, we find that teams open to new collaborative ties consistently produce better science. Specifically, team re-combinations that introduce new freshness impulses sustain success, while persistence impulses from experienced teams are linked to earlier impact. Together, freshness and persistence shape team success across collaboration stages.2025-07-16T14:00:51ZAuthor name correction in arXiv metadataHanjo D. BoekhoutEelke M. HeemskerkNiccolò PisaniFrank W. Takes