https://arxiv.org/api/9uI6WWHnztxsrLvndr0RtpHmhl0 2026-06-14T11:43:17Z 6065 675 15 http://arxiv.org/abs/2507.05903v2 AI-Reporter: A Path to a New Genre of Scientific Communication 2025-07-31T06:23:02Z

The AI-Reporter represents a paradigmatic shift in scientific publication practice. This document demonstrates through a concrete case study how our system transforms academic presentations into publication-ready chapters -- in less than three minutes. Using Arno Simons' lecture on Large Language Models from the ``Large Language Models for the History, Philosophy, and Sociology of Science'' workshop (NEPI) as an example, we show how technological innovation bridges the gap between ephemeral presentation and permanent scientific documentation.

2025-07-08T11:41:37Z Gerd Graßhoff http://arxiv.org/abs/2507.22871v1 Tracking research software outputs in the UK 2025-07-30T17:46:47Z

Research software is crucial in the research process and the growth of Open Science underscores the importance of accessing research artifacts, like data and code, raising traceability challenges among outputs. While it is a clear principle that research code, along with other essential outputs, should be recognised as artifacts of the research process, the how of this principle remains variable. This study examines where UK academic institutions store and register software as a unique research output, searching the UKRI's Gateway to Research (GtR) metadata for publicly funded research software in the UK. The quantity of software reported as research outcomes remains low in proportion to other categories. Artifact sharing appears low, with one-quarter of the reported software having no links and 45% having either a missing or erroneous URL. Of the valid URLs, we find the single largest category is Public Commercial Code Repository, with GitHub being the host of 18% of all publicly funded research software listed. These observations are contrasted with past findings from 2023 and finally, we discuss the lack of artifact sharing in UK research, with resulting implications for the maintenance and evolution of research software. Without dissemination, research software risks demotion to a transient artifact, useful only to meet short term research demands but ultimately lost to the broader enterprise of science.

2025-07-30T17:46:47Z Domhnall Carlin Austen Rainer http://arxiv.org/abs/2507.22479v1 Presenting a classifier to detect research contributions in OpenAlex 2025-07-30T08:30:23Z

This paper introduces a document type classifier with the purpose to optimise the distinction between research and non-research journal publications in OpenAlex. Based on open metadata, the classifier can detect non-research or editorial content within a set of classified articles and reviews (e.g. paratexts, abstracts, editorials, letters). The classifier achieves an F1-score of 0,95, indicating a potential improvement in the data quality of bibliometric research in OpenAlex when applying the classifier on real data. In total, 4.589.967 out of 42.701.863 articles and reviews could be reclassified as non-research contributions by the classifier, representing a share of 10,75%

2025-07-30T08:30:23Z Nick Haupka 10.1007/s11192-025-05524-7 http://arxiv.org/abs/2507.22391v1 Knowledge engineering for open science: Building and deploying knowledge bases for metadata standards 2025-07-30T05:26:49Z

Scientists strive to make their datasets available in open repositories, with the goal that they be findable, accessible, interoperable, and reusable (FAIR). Although it is hard for most investigators to remember all the guiding principles associated with FAIR data, there is one overarching requirement: The data need to be annotated with rich, discipline-specific, standardized metadata. The Center for Expanded Data Annotation and Retrieval (CEDAR) builds technology that enables scientists to encode metadata standards as templates that enumerate the attributes of different kinds of experiments. These metadata templates capture preferences regarding how data should be described and what a third party needs to know to make sense of the datasets. CEDAR templates describing community metadata preferences have been used to standardize metadata for a variety of scientific consortia. They have been used as the basis for data-annotation systems that acquire metadata through Web forms or through spreadsheets, and they can help correct metadata to ensure adherence to standards. Like the declarative knowledge bases that underpinned intelligent systems decades ago, CEDAR templates capture the knowledge in symbolic form, and they allow that knowledge to be applied in a variety of settings. They provide a mechanism for scientific communities to create shared metadata standards and to encode their preferences for the application of those standards, and for deploying those standards in a range of intelligent systems to promote open science.

2025-07-30T05:26:49Z 22 pages, 7 figures Mark A. Musen Martin J. O'Connor Josef Hardi Marcos Martinez-Romero http://arxiv.org/abs/2507.22019v1 Not Here, Go There: Analyzing Redirection Patterns on the Web 2025-07-29T17:12:16Z

URI redirections are integral to web management, supporting structural changes, SEO optimization, and security. However, their complexities affect usability, SEO performance, and digital preservation. This study analyzed 11 million unique redirecting URIs, following redirections up to 10 hops per URI, to uncover patterns and implications of redirection practices. Our findings revealed that 50% of the URIs terminated successfully, while 50% resulted in errors, including 0.06% exceeding 10 hops. Canonical redirects, such as HTTP to HTTPS transitions, were prevalent, reflecting adherence to SEO best practices. Non-canonical redirects, often involving domain or path changes, highlighted significant web migrations, rebranding, and security risks. Notable patterns included "sink" URIs, where multiple redirects converged, ranging from traffic consolidation by global websites to deliberate "Rickrolling." The study also identified 62,000 custom 404 URIs, almost half being soft 404s, which could compromise SEO and user experience. These findings underscore the critical role of URI redirects in shaping the web while exposing challenges such as outdated URIs, server instability, and improper error handling. This research offers a detailed analysis of URI redirection practices, providing insights into their prevalence, types, and outcomes. By examining a large dataset, we highlight inefficiencies in redirection chains and examine patterns such as the use of "sink" URIs and custom error pages. This information can help webmasters, researchers, and digital archivists improve web usability, optimize resource allocation, and safeguard valuable online content.

2025-07-29T17:12:16Z Extended version of the paper accepted at the 2025 ACM Web Science Conference (WebSci 2025) Kritika Garg Sawood Alam Dietrich Ayala Michele C. Weigle Michael L. Nelson 10.1145/3717867.3717925 http://arxiv.org/abs/2508.00906v1 Exploring the Role of Gamification in Enhancing Academic Library Services: A Survey of Library Leaders in India 2025-07-29T08:51:41Z

This study explores the role of gamification in enhancing academic library services in India by surveying library leaders across various institutions. Using game-like elements in non-game contexts, gamification can boost user engagement and improve services such as information literacy and research consultations. Findings reveal moderate awareness and generally positive perceptions of gamification's effectiveness. However, challenges like insufficient staff expertise, infrastructure, and limited funding hinder implementation. The study emphasises the need for additional resources, including staff training and technological upgrades, to unlock the full potential of gamification in academic libraries.

2025-07-29T08:51:41Z The final published version will appear in College & Research Libraries, March 2026 Subaveerapandiyan A Pragya Lohia Dattatraya Kalbande Naved Ahmad Kailash Chand Sharma http://arxiv.org/abs/2507.19302v2 Understanding discrepancies in the coverage of OpenAlex: the case of China 2025-07-28T02:15:10Z

Citation indexes play a crucial role for understanding how science is produced, disseminated, and used. However, these databases often face a critical trade-off: those offering extensive and high-quality coverage are typically proprietary, whereas publicly accessible datasets frequently exhibit fragmented coverage and inconsistent data quality. OpenAlex was developed to address this challenge, providing a freely available database with broad open coverage, with a particular emphasis on non-English speaking countries. Yet, few studies have assessed the quality of the OpenAlex dataset. This paper assesses the coverage, by OpenAlex, of China's papers, which shows an abnormal trend, and compares it with other countries that do not have English as their main language. Our analysis reveals that while OpenAlex increases the coverage of China's publications, primarily those disseminated by a national database, this coverage is incomplete and discontinuous when compared to other countries' records in the database. We observe similar issues in other non-English-speaking countries, with coverage varying across regions. These findings indicate that although OpenAlex expands coverage of research outputs, continuity issues persist and disproportionately affect certain countries. We emphasize the need for researchers to use OpenAlex data cautiously, being mindful of its potential limitations in cross-national analyses.

2025-07-25T14:18:24Z Mengxue Zheng Lili Miao Yi Bu Vincent Larivière http://arxiv.org/abs/2305.08477v3 Representing provenance and track changes of cultural heritage metadata in RDF: a survey of existing approaches 2025-07-27T13:58:57Z

In the realm of Digital Humanities, the management of cultural heritage metadata is pivotal for ensuring data trustworthiness. Provenance information - contextual metadata detailing the origin and history of data - plays a crucial role in this process. However, tracking provenance and changes in metadata using the Resource Description Framework (RDF) presents significant challenges due to the limitations of foundational Semantic Web technologies. This article offers a comprehensive review of existing models and approaches for representing provenance and tracking changes in RDF, with a specific focus on cultural heritage metadata. It examines W3C standard proposals such as RDF Reification and n-ary relations, along with various alternative systems. Through an in-depth analysis, the study identifies Named Graphs, RDF*, the Provenance Ontology (PROV-O), Dublin Core (DC), Conjectural Graphs, and the OpenCitations Data Model (OCDM) as the most effective solutions. These models are evaluated based on their compliance with RDF standards, scalability, and applicability across different domains. The findings underscore the importance of selecting the appropriate model to ensure robust and reliable management of provenance in RDF datasets, thereby contributing to the ongoing discourse on provenance representation in the Digital Humanities.

2023-05-15T09:28:22Z 23 pages, 4 figures, accepted for publication in Digital Scholarship in the Humanities Arcangelo Massari Research Centre for Open Scholarly Metadata, Department of Classical Philology and Italian Studies, University of Bologna, Bologna, Italy Digital Humanities Advanced Research Centre Silvio Peroni Research Centre for Open Scholarly Metadata, Department of Classical Philology and Italian Studies, University of Bologna, Bologna, Italy Digital Humanities Advanced Research Centre Francesca Tomasi Digital Humanities Advanced Research Centre Ivan Heibi Research Centre for Open Scholarly Metadata, Department of Classical Philology and Italian Studies, University of Bologna, Bologna, Italy Digital Humanities Advanced Research Centre 10.1093/llc/fqaf076 http://arxiv.org/abs/2507.20131v1 Permanent Data Encoding (PDE): A Visual Language for Semantic Compression and Knowledge Preservation in 3-Character Units 2025-07-27T05:11:00Z

Permanent Data Encoding (PDE) is a visual language framework designed for long-term, human-readable, and electrically independent knowledge preservation. By encoding semantic content into compact 2-3 character alphanumeric codes, paired with public dictionaries and rule-based expansion structures, PDE enables information to be visually interpreted and logically reconstructed without reliance on digital systems. Unlike QR codes or binary data, PDE offers a transparent and self-contained method of encoding meaning. This paper outlines the PDE syntax, dictionary protocol, use cases in disaster resilience and AI integration, and its implications as a cross-generational semantic infrastructure.

2025-07-27T05:11:00Z 23 pages, 6 figures. This work introduces a visual language called PDE (Permanent Data Encoding) for semantic compression and post-digital knowledge preservation. Submitted to arXiv for open access and community feedback Yoshiharu Tsuyuki Xianqi Li Yuji Kurihara Kenji Mitsudo http://arxiv.org/abs/2507.20000v1 Matching Game Preferences Through Dialogical Large Language Models: A Perspective 2025-07-26T16:40:17Z

This perspective paper explores the future potential of "conversational intelligence" by examining how Large Language Models (LLMs) could be combined with GRAPHYP's network system to better understand human conversations and preferences. Using recent research and case studies, we propose a conceptual framework that could make AI rea-soning transparent and traceable, allowing humans to see and understand how AI reaches its conclusions. We present the conceptual perspective of "Matching Game Preferences through Dialogical Large Language Models (D-LLMs)," a proposed system that would allow multiple users to share their different preferences through structured conversations. This approach envisions personalizing LLMs by embedding individual user preferences directly into how the model makes decisions. The proposed D-LLM framework would require three main components: (1) reasoning processes that could analyze different search experiences and guide performance, (2) classification systems that would identify user preference patterns, and (3) dialogue approaches that could help humans resolve conflicting information. This perspective framework aims to create an interpretable AI system where users could examine, understand, and combine the different human preferences that influence AI responses, detected through GRAPHYP's search experience networks. The goal of this perspective is to envision AI systems that would not only provide answers but also show users how those answers were reached, making artificial intelligence more transparent and trustworthy for human decision-making.

2025-07-26T16:40:17Z 28 pages, 1 figure. Published in Applied Sciences Applied Sciences, 2025, 15(15), 8307 Renaud Fabre Daniel Egret Patrice Bellot 10.3390/app15158307 http://arxiv.org/abs/2507.19092v1 Comparing OCR Pipelines for Folkloristic Text Digitization 2025-07-25T09:22:41Z

The digitization of historical folkloristic materials presents unique challenges due to diverse text layouts, varying print and handwriting styles, and linguistic variations. This study explores different optical character recognition (OCR) approaches for Slovene folkloristic and historical text digitization, integrating both traditional methods and large language models (LLMs) to improve text transcription accuracy while maintaining linguistic and structural integrity. We compare single-stage OCR techniques with multi-stage pipelines that incorporate machine learning-driven post-processing for text normalization and layout reconstruction. While LLM-enhanced methods show promise in refining recognition outputs and improving readability, they also introduce challenges related to unintended modifications, particularly in the preservation of dialectal expressions and historical structures. Our findings provide insights into selecting optimal digitization strategies for large-scale folklore archives and outline recommendations for developing robust OCR pipelines that balance automation with the need for textual authenticity in digital humanities research.

2025-07-25T09:22:41Z 4th edition of DigitalHeritage World Congress and Expo 2025 Octavian M. Machidon Alina L. Machidon http://arxiv.org/abs/2502.14561v3 Can LLMs Predict Citation Intent? An Experimental Analysis of In-context Learning and Fine-tuning on Open LLMs 2025-07-25T02:46:55Z

This work investigates the ability of open Large Language Models (LLMs) to predict citation intent through in-context learning and fine-tuning. Unlike traditional approaches relying on domain-specific pre-trained models like SciBERT, we demonstrate that general-purpose LLMs can be adapted to this task with minimal task-specific data. We evaluate twelve model variations across five prominent open LLM families using zero-, one-, few-, and many-shot prompting. Our experimental study identifies the top-performing model and prompting parameters through extensive in-context learning experiments. We then demonstrate the significant impact of task-specific adaptation by fine-tuning this model, achieving a relative F1-score improvement of 8% on the SciCite dataset and 4.3% on the ACL-ARC dataset compared to the instruction-tuned baseline. These findings provide valuable insights for model selection and prompt engineering. Additionally, we make our end-to-end evaluation framework and models openly available for future use.

2025-02-20T13:45:42Z Accepted for publication on TPDL 2025 Paris Koloveas Serafeim Chatzopoulos Thanasis Vergoulis Christos Tryfonopoulos 10.1007/978-3-032-05409-8_13 http://arxiv.org/abs/2507.18741v1 KuiSCIMA v2.0: Improved Baselines, Calibration, and Cross-Notation Generalization for Historical Chinese Music Notations in Jiang Kui's Baishidaoren Gequ 2025-07-24T18:40:38Z

Optical Music Recognition (OMR) for historical Chinese musical notations, such as suzipu and lülüpu, presents unique challenges due to high class imbalance and limited training data. This paper introduces significant advancements in OMR for Jiang Kui's influential collection Baishidaoren Gequ from 1202. In this work, we develop and evaluate a character recognition model for scarce imbalanced data. We improve upon previous baselines by reducing the Character Error Rate (CER) from 10.4% to 7.1% for suzipu, despite working with 77 highly imbalanced classes, and achieve a remarkable CER of 0.9% for lülüpu. Our models outperform human transcribers, with an average human CER of 15.9% and a best-case CER of 7.6%. We employ temperature scaling to achieve a well-calibrated model with an Expected Calibration Error (ECE) below 0.0162. Using a leave-one-edition-out cross-validation approach, we ensure robust performance across five historical editions. Additionally, we extend the KuiSCIMA dataset to include all 109 pieces from Baishidaoren Gequ, encompassing suzipu, lülüpu, and jianzipu notations. Our findings advance the digitization and accessibility of historical Chinese music, promoting cultural diversity in OMR and expanding its applicability to underrepresented music traditions.

2025-07-24T18:40:38Z International Conference on Document Analysis and Recognition. This preprint has not undergone any post-submission improvements or corrections. The Version of Record of this contribution is published in "19th International Conference on Document Analysis and Recognition (ICDAR 2025), Wuhan, China, September 16-21, 2025, Proceedings", and is available online at the External DOI field below Tristan Repolusk Eduardo Veas http://arxiv.org/abs/2507.18201v1 Integrating an ISO 30401-compliant Knowledge Management System with the processes of an Integrated Management System 2025-07-24T08:56:11Z

With the evolution of process approaches within organizations, the increasing importance of quality management systems (like ISO 9001) and the recent introduction of ISO 30401 for knowledge management, we examine how these different elements converge within the framework of an Integrated Management System. The article specifically demonstrates how an ISO30401-compliant knowledge management system can be implemented by deploying the mechanisms of the SECI model through the steps of the PDCA cycle as applied in the processes of the integrated management system.

2025-07-24T08:56:11Z Conf{é}rence nationale sur les Applications Pratiques de l'Intelligence Artificielle (APIA), AFIA, Jun 2025, DIJON, France Patrick Prieur Aline Belloni http://arxiv.org/abs/2507.18197v1 Integrating an ISO30401-compliant Knowledge management system with existing business processes of an organization 2025-07-24T08:54:19Z

Business process modeling is used by most organizations as an essential framework for ensuring efficiency and effectiveness of the work and workflow performed by its employees and for ensuring the alignment of such work with its strategic goals. For organizations that are compliant or near-compliant with ISO 9001, this approach involves the detailed mapping of processes, sub-processes, activities, and tasks. ISO30401 is a Management System Standard, introduced in 2018, establishing universal requirements for the set up of a Knowledge Management System in an organization. As ``ISO30401 implementers'' we regularly face the challenge of explaining our clients how the knowledge development, transformation and conveyances activities depicted in ISO30401 do integrate with existing operational processes. This article recaps process modelling principles in the context of ISO9001 and explores, based on our experience, how an ISO30401-compliant Knowledge Management System (KMS) entwines with all other processes of an Integrated Management System and in particular how it can be implemented by deploying the mechanisms of the SECI model through the steps of PDCA cycles.

2025-07-24T08:54:19Z in French language. AGeCSO2025 : 18{è}me Colloque International de l'Association pour la Gestion des Connaissances dans la Soci{é}t{é} et les Organisations, Association pour la Gestion des Connaissances dans la Soci{é}t{é} et les Organisations (AGECSO), Jun 2025, TROYES, France Aline Belloni Patrick Prieur