https://arxiv.org/api/9uI6WWHnztxsrLvndr0RtpHmhl02026-06-14T11:43:17Z606567515http://arxiv.org/abs/2507.05903v2AI-Reporter: A Path to a New Genre of Scientific Communication2025-07-31T06:23:02ZThe AI-Reporter represents a paradigmatic shift in scientific publication practice. This document demonstrates through a concrete case study how our system transforms academic presentations into publication-ready chapters -- in less than three minutes. Using Arno Simons' lecture on Large Language Models from the ``Large Language Models for the History, Philosophy, and Sociology of Science'' workshop (NEPI) as an example, we show how technological innovation bridges the gap between ephemeral presentation and permanent scientific documentation.2025-07-08T11:41:37ZGerd Graßhoffhttp://arxiv.org/abs/2507.22871v1Tracking research software outputs in the UK2025-07-30T17:46:47ZResearch software is crucial in the research process and the growth of Open Science underscores the importance of accessing research artifacts, like data and code, raising traceability challenges among outputs. While it is a clear principle that research code, along with other essential outputs, should be recognised as artifacts of the research process, the how of this principle remains variable. This study examines where UK academic institutions store and register software as a unique research output, searching the UKRI's Gateway to Research (GtR) metadata for publicly funded research software in the UK. The quantity of software reported as research outcomes remains low in proportion to other categories. Artifact sharing appears low, with one-quarter of the reported software having no links and 45% having either a missing or erroneous URL. Of the valid URLs, we find the single largest category is Public Commercial Code Repository, with GitHub being the host of 18% of all publicly funded research software listed. These observations are contrasted with past findings from 2023 and finally, we discuss the lack of artifact sharing in UK research, with resulting implications for the maintenance and evolution of research software. Without dissemination, research software risks demotion to a transient artifact, useful only to meet short term research demands but ultimately lost to the broader enterprise of science.2025-07-30T17:46:47ZDomhnall CarlinAusten Rainerhttp://arxiv.org/abs/2507.22479v1Presenting a classifier to detect research contributions in OpenAlex2025-07-30T08:30:23ZThis paper introduces a document type classifier with the purpose to optimise the distinction between research and non-research journal publications in OpenAlex. Based on open metadata, the classifier can detect non-research or editorial content within a set of classified articles and reviews (e.g. paratexts, abstracts, editorials, letters). The classifier achieves an F1-score of 0,95, indicating a potential improvement in the data quality of bibliometric research in OpenAlex when applying the classifier on real data. In total, 4.589.967 out of 42.701.863 articles and reviews could be reclassified as non-research contributions by the classifier, representing a share of 10,75%2025-07-30T08:30:23ZNick Haupka10.1007/s11192-025-05524-7http://arxiv.org/abs/2507.22391v1Knowledge engineering for open science: Building and deploying knowledge bases for metadata standards2025-07-30T05:26:49ZScientists strive to make their datasets available in open repositories, with the goal that they be findable, accessible, interoperable, and reusable (FAIR). Although it is hard for most investigators to remember all the guiding principles associated with FAIR data, there is one overarching requirement: The data need to be annotated with rich, discipline-specific, standardized metadata. The Center for Expanded Data Annotation and Retrieval (CEDAR) builds technology that enables scientists to encode metadata standards as templates that enumerate the attributes of different kinds of experiments. These metadata templates capture preferences regarding how data should be described and what a third party needs to know to make sense of the datasets. CEDAR templates describing community metadata preferences have been used to standardize metadata for a variety of scientific consortia. They have been used as the basis for data-annotation systems that acquire metadata through Web forms or through spreadsheets, and they can help correct metadata to ensure adherence to standards. Like the declarative knowledge bases that underpinned intelligent systems decades ago, CEDAR templates capture the knowledge in symbolic form, and they allow that knowledge to be applied in a variety of settings. They provide a mechanism for scientific communities to create shared metadata standards and to encode their preferences for the application of those standards, and for deploying those standards in a range of intelligent systems to promote open science.2025-07-30T05:26:49Z22 pages, 7 figuresMark A. MusenMartin J. O'ConnorJosef HardiMarcos Martinez-Romerohttp://arxiv.org/abs/2507.22019v1Not Here, Go There: Analyzing Redirection Patterns on the Web2025-07-29T17:12:16ZURI redirections are integral to web management, supporting structural changes, SEO optimization, and security. However, their complexities affect usability, SEO performance, and digital preservation. This study analyzed 11 million unique redirecting URIs, following redirections up to 10 hops per URI, to uncover patterns and implications of redirection practices. Our findings revealed that 50% of the URIs terminated successfully, while 50% resulted in errors, including 0.06% exceeding 10 hops. Canonical redirects, such as HTTP to HTTPS transitions, were prevalent, reflecting adherence to SEO best practices. Non-canonical redirects, often involving domain or path changes, highlighted significant web migrations, rebranding, and security risks. Notable patterns included "sink" URIs, where multiple redirects converged, ranging from traffic consolidation by global websites to deliberate "Rickrolling." The study also identified 62,000 custom 404 URIs, almost half being soft 404s, which could compromise SEO and user experience. These findings underscore the critical role of URI redirects in shaping the web while exposing challenges such as outdated URIs, server instability, and improper error handling. This research offers a detailed analysis of URI redirection practices, providing insights into their prevalence, types, and outcomes. By examining a large dataset, we highlight inefficiencies in redirection chains and examine patterns such as the use of "sink" URIs and custom error pages. This information can help webmasters, researchers, and digital archivists improve web usability, optimize resource allocation, and safeguard valuable online content.2025-07-29T17:12:16ZExtended version of the paper accepted at the 2025 ACM Web Science Conference (WebSci 2025)Kritika GargSawood AlamDietrich AyalaMichele C. WeigleMichael L. Nelson10.1145/3717867.3717925http://arxiv.org/abs/2508.00906v1Exploring the Role of Gamification in Enhancing Academic Library Services: A Survey of Library Leaders in India2025-07-29T08:51:41ZThis study explores the role of gamification in enhancing academic library services in India by surveying library leaders across various institutions. Using game-like elements in non-game contexts, gamification can boost user engagement and improve services such as information literacy and research consultations. Findings reveal moderate awareness and generally positive perceptions of gamification's effectiveness. However, challenges like insufficient staff expertise, infrastructure, and limited funding hinder implementation. The study emphasises the need for additional resources, including staff training and technological upgrades, to unlock the full potential of gamification in academic libraries.2025-07-29T08:51:41ZThe final published version will appear in College & Research Libraries, March 2026Subaveerapandiyan APragya LohiaDattatraya KalbandeNaved AhmadKailash Chand Sharmahttp://arxiv.org/abs/2507.19302v2Understanding discrepancies in the coverage of OpenAlex: the case of China2025-07-28T02:15:10ZCitation indexes play a crucial role for understanding how science is produced, disseminated, and used. However, these databases often face a critical trade-off: those offering extensive and high-quality coverage are typically proprietary, whereas publicly accessible datasets frequently exhibit fragmented coverage and inconsistent data quality. OpenAlex was developed to address this challenge, providing a freely available database with broad open coverage, with a particular emphasis on non-English speaking countries. Yet, few studies have assessed the quality of the OpenAlex dataset. This paper assesses the coverage, by OpenAlex, of China's papers, which shows an abnormal trend, and compares it with other countries that do not have English as their main language. Our analysis reveals that while OpenAlex increases the coverage of China's publications, primarily those disseminated by a national database, this coverage is incomplete and discontinuous when compared to other countries' records in the database. We observe similar issues in other non-English-speaking countries, with coverage varying across regions. These findings indicate that although OpenAlex expands coverage of research outputs, continuity issues persist and disproportionately affect certain countries. We emphasize the need for researchers to use OpenAlex data cautiously, being mindful of its potential limitations in cross-national analyses.2025-07-25T14:18:24ZMengxue ZhengLili MiaoYi BuVincent Larivièrehttp://arxiv.org/abs/2305.08477v3Representing provenance and track changes of cultural heritage metadata in RDF: a survey of existing approaches2025-07-27T13:58:57ZIn the realm of Digital Humanities, the management of cultural heritage metadata is pivotal for ensuring data trustworthiness. Provenance information - contextual metadata detailing the origin and history of data - plays a crucial role in this process. However, tracking provenance and changes in metadata using the Resource Description Framework (RDF) presents significant challenges due to the limitations of foundational Semantic Web technologies. This article offers a comprehensive review of existing models and approaches for representing provenance and tracking changes in RDF, with a specific focus on cultural heritage metadata. It examines W3C standard proposals such as RDF Reification and n-ary relations, along with various alternative systems. Through an in-depth analysis, the study identifies Named Graphs, RDF*, the Provenance Ontology (PROV-O), Dublin Core (DC), Conjectural Graphs, and the OpenCitations Data Model (OCDM) as the most effective solutions. These models are evaluated based on their compliance with RDF standards, scalability, and applicability across different domains. The findings underscore the importance of selecting the appropriate model to ensure robust and reliable management of provenance in RDF datasets, thereby contributing to the ongoing discourse on provenance representation in the Digital Humanities.2023-05-15T09:28:22Z23 pages, 4 figures, accepted for publication in Digital Scholarship in the HumanitiesArcangelo MassariResearch Centre for Open Scholarly Metadata, Department of Classical Philology and Italian Studies, University of Bologna, Bologna, ItalyDigital Humanities Advanced Research CentreSilvio PeroniResearch Centre for Open Scholarly Metadata, Department of Classical Philology and Italian Studies, University of Bologna, Bologna, ItalyDigital Humanities Advanced Research CentreFrancesca TomasiDigital Humanities Advanced Research CentreIvan HeibiResearch Centre for Open Scholarly Metadata, Department of Classical Philology and Italian Studies, University of Bologna, Bologna, ItalyDigital Humanities Advanced Research Centre10.1093/llc/fqaf076http://arxiv.org/abs/2507.20131v1Permanent Data Encoding (PDE): A Visual Language for Semantic Compression and Knowledge Preservation in 3-Character Units2025-07-27T05:11:00ZPermanent Data Encoding (PDE) is a visual language framework designed for long-term, human-readable, and electrically independent knowledge preservation. By encoding semantic content into compact 2-3 character alphanumeric codes, paired with public dictionaries and rule-based expansion structures, PDE enables information to be visually interpreted and logically reconstructed without reliance on digital systems. Unlike QR codes or binary data, PDE offers a transparent and self-contained method of encoding meaning. This paper outlines the PDE syntax, dictionary protocol, use cases in disaster resilience and AI integration, and its implications as a cross-generational semantic infrastructure.2025-07-27T05:11:00Z23 pages, 6 figures. This work introduces a visual language called PDE (Permanent Data Encoding) for semantic compression and post-digital knowledge preservation. Submitted to arXiv for open access and community feedbackYoshiharu TsuyukiXianqi LiYuji KuriharaKenji Mitsudohttp://arxiv.org/abs/2507.20000v1Matching Game Preferences Through Dialogical Large Language Models: A Perspective2025-07-26T16:40:17ZThis perspective paper explores the future potential of "conversational intelligence" by examining how Large Language Models (LLMs) could be combined with GRAPHYP's network system to better understand human conversations and preferences. Using recent research and case studies, we propose a conceptual framework that could make AI rea-soning transparent and traceable, allowing humans to see and understand how AI reaches its conclusions. We present the conceptual perspective of "Matching Game Preferences through Dialogical Large Language Models (D-LLMs)," a proposed system that would allow multiple users to share their different preferences through structured conversations. This approach envisions personalizing LLMs by embedding individual user preferences directly into how the model makes decisions. The proposed D-LLM framework would require three main components: (1) reasoning processes that could analyze different search experiences and guide performance, (2) classification systems that would identify user preference patterns, and (3) dialogue approaches that could help humans resolve conflicting information. This perspective framework aims to create an interpretable AI system where users could examine, understand, and combine the different human preferences that influence AI responses, detected through GRAPHYP's search experience networks. The goal of this perspective is to envision AI systems that would not only provide answers but also show users how those answers were reached, making artificial intelligence more transparent and trustworthy for human decision-making.2025-07-26T16:40:17Z28 pages, 1 figure. Published in Applied SciencesApplied Sciences, 2025, 15(15), 8307Renaud FabreDaniel EgretPatrice Bellot10.3390/app15158307http://arxiv.org/abs/2507.19092v1Comparing OCR Pipelines for Folkloristic Text Digitization2025-07-25T09:22:41ZThe digitization of historical folkloristic materials presents unique challenges due to diverse text layouts, varying print and handwriting styles, and linguistic variations. This study explores different optical character recognition (OCR) approaches for Slovene folkloristic and historical text digitization, integrating both traditional methods and large language models (LLMs) to improve text transcription accuracy while maintaining linguistic and structural integrity. We compare single-stage OCR techniques with multi-stage pipelines that incorporate machine learning-driven post-processing for text normalization and layout reconstruction. While LLM-enhanced methods show promise in refining recognition outputs and improving readability, they also introduce challenges related to unintended modifications, particularly in the preservation of dialectal expressions and historical structures. Our findings provide insights into selecting optimal digitization strategies for large-scale folklore archives and outline recommendations for developing robust OCR pipelines that balance automation with the need for textual authenticity in digital humanities research.2025-07-25T09:22:41Z4th edition of DigitalHeritage World Congress and Expo 2025Octavian M. MachidonAlina L. Machidonhttp://arxiv.org/abs/2502.14561v3Can LLMs Predict Citation Intent? An Experimental Analysis of In-context Learning and Fine-tuning on Open LLMs2025-07-25T02:46:55ZThis work investigates the ability of open Large Language Models (LLMs) to predict citation intent through in-context learning and fine-tuning. Unlike traditional approaches relying on domain-specific pre-trained models like SciBERT, we demonstrate that general-purpose LLMs can be adapted to this task with minimal task-specific data. We evaluate twelve model variations across five prominent open LLM families using zero-, one-, few-, and many-shot prompting. Our experimental study identifies the top-performing model and prompting parameters through extensive in-context learning experiments. We then demonstrate the significant impact of task-specific adaptation by fine-tuning this model, achieving a relative F1-score improvement of 8% on the SciCite dataset and 4.3% on the ACL-ARC dataset compared to the instruction-tuned baseline. These findings provide valuable insights for model selection and prompt engineering. Additionally, we make our end-to-end evaluation framework and models openly available for future use.2025-02-20T13:45:42ZAccepted for publication on TPDL 2025Paris KoloveasSerafeim ChatzopoulosThanasis VergoulisChristos Tryfonopoulos10.1007/978-3-032-05409-8_13http://arxiv.org/abs/2507.18741v1KuiSCIMA v2.0: Improved Baselines, Calibration, and Cross-Notation Generalization for Historical Chinese Music Notations in Jiang Kui's Baishidaoren Gequ2025-07-24T18:40:38ZOptical Music Recognition (OMR) for historical Chinese musical notations, such as suzipu and lülüpu, presents unique challenges due to high class imbalance and limited training data. This paper introduces significant advancements in OMR for Jiang Kui's influential collection Baishidaoren Gequ from 1202. In this work, we develop and evaluate a character recognition model for scarce imbalanced data. We improve upon previous baselines by reducing the Character Error Rate (CER) from 10.4% to 7.1% for suzipu, despite working with 77 highly imbalanced classes, and achieve a remarkable CER of 0.9% for lülüpu. Our models outperform human transcribers, with an average human CER of 15.9% and a best-case CER of 7.6%. We employ temperature scaling to achieve a well-calibrated model with an Expected Calibration Error (ECE) below 0.0162. Using a leave-one-edition-out cross-validation approach, we ensure robust performance across five historical editions. Additionally, we extend the KuiSCIMA dataset to include all 109 pieces from Baishidaoren Gequ, encompassing suzipu, lülüpu, and jianzipu notations. Our findings advance the digitization and accessibility of historical Chinese music, promoting cultural diversity in OMR and expanding its applicability to underrepresented music traditions.2025-07-24T18:40:38ZInternational Conference on Document Analysis and Recognition. This preprint has not undergone any post-submission improvements or corrections. The Version of Record of this contribution is published in "19th International Conference on Document Analysis and Recognition (ICDAR 2025), Wuhan, China, September 16-21, 2025, Proceedings", and is available online at the External DOI field belowTristan RepoluskEduardo Veashttp://arxiv.org/abs/2507.18201v1Integrating an ISO 30401-compliant Knowledge Management System with the processes of an Integrated Management System2025-07-24T08:56:11ZWith the evolution of process approaches within organizations, the increasing importance of quality management systems (like ISO 9001) and the recent introduction of ISO 30401 for knowledge management, we examine how these different elements converge within the framework of an Integrated Management System. The article specifically demonstrates how an ISO30401-compliant knowledge management system can be implemented by deploying the mechanisms of the SECI model through the steps of the PDCA cycle as applied in the processes of the integrated management system.2025-07-24T08:56:11ZConf{é}rence nationale sur les Applications Pratiques de l'Intelligence Artificielle (APIA), AFIA, Jun 2025, DIJON, FrancePatrick PrieurAline Bellonihttp://arxiv.org/abs/2507.18197v1Integrating an ISO30401-compliant Knowledge management system with existing business processes of an organization2025-07-24T08:54:19ZBusiness process modeling is used by most organizations as an essential framework for ensuring efficiency and effectiveness of the work and workflow performed by its employees and for ensuring the alignment of such work with its strategic goals. For organizations that are compliant or near-compliant with ISO 9001, this approach involves the detailed mapping of processes, sub-processes, activities, and tasks. ISO30401 is a Management System Standard, introduced in 2018, establishing universal requirements for the set up of a Knowledge Management System in an organization. As ``ISO30401 implementers'' we regularly face the challenge of explaining our clients how the knowledge development, transformation and conveyances activities depicted in ISO30401 do integrate with existing operational processes. This article recaps process modelling principles in the context of ISO9001 and explores, based on our experience, how an ISO30401-compliant Knowledge Management System (KMS) entwines with all other processes of an Integrated Management System and in particular how it can be implemented by deploying the mechanisms of the SECI model through the steps of PDCA cycles.2025-07-24T08:54:19Zin French language. AGeCSO2025 : 18{è}me Colloque International de l'Association pour la Gestion des Connaissances dans la Soci{é}t{é} et les Organisations, Association pour la Gestion des Connaissances dans la Soci{é}t{é} et les Organisations (AGECSO), Jun 2025, TROYES, FranceAline BelloniPatrick Prieur