https://arxiv.org/api/aBbIT564Nw4mRPCm+Gw60bfGVfw 2026-06-14T16:25:22Z 6065 750 15 http://arxiv.org/abs/2411.05409v3 Web Archives Metadata Generation with GPT-4o: Challenges and Insights 2025-06-19T16:56:00Z

Current metadata creation for web archives is time consuming and costly due to reliance on human effort. This paper explores the use of gpt-4o for metadata generation within the Web Archive Singapore, focusing on scalability, efficiency, and cost effectiveness. We processed 112 Web ARChive (WARC) files using data reduction techniques, achieving a notable 99.9% reduction in metadata generation costs. By prompt engineering, we generated titles and abstracts, which were evaluated both intrinsically using Levenshtein Distance and BERTScore, and extrinsically with human cataloguers using McNemar's test. Results indicate that while our method offers significant cost savings and efficiency gains, human curated metadata maintains an edge in quality. The study identifies key challenges including content inaccuracies, hallucinations, and translation issues, suggesting that Large Language Models (LLMs) should serve as complements rather than replacements for human cataloguers. Future work will focus on refining prompts, improving content filtering, and addressing privacy concerns through experimentation with smaller models. This research advances the integration of LLMs in web archiving, offering valuable insights into their current capabilities and outlining directions for future enhancements. The code is available at https://github.com/masamune-prog/warc2summary for further development and use by institutions facing similar challenges.

2024-11-08T08:59:40Z Published in Information Technology and Libraries, Vol. 44, No. 2, June 2025 Ashwin Nair Zhen Rong Goh Tianrui Liu Abigail Yongping Huang 10.5860/ital.v44i2.17305 10.5860/ital.v44i2.17305 10.5860/ital.v44i2.17305 http://arxiv.org/abs/2505.06448v4 Gaming the Metrics? Bibliometric Anomalies and the Integrity Crisis in Global University Rankings 2025-06-19T12:39:22Z

Global university rankings have reshaped how academic success is defined, incentivizing metrics such as publication counts and citation rates at the expense of scholarly integrity. This study examines 18 universities in India, Lebanon, Saudi Arabia, and the United Arab Emirates, selected from among the world's 1,000 most-publishing institutions for their extraordinary research growth and sharp declines in first and corresponding authorship. These institutions exhibit bibliometric patterns consistent with strategic metric optimization, including publication surges of up to 965%, a proliferation of hyper-prolific authors, dense reciprocal co-authorship and citation networks, elevated shares of output in delisted journals, and rising retraction rates. These patterns are analyzed in light of Goodhart's Law and institutional isomorphism, illustrating how performance pressures can reshape academic behavior. To systematically assess and monitor such risks, the study introduces the Research Integrity Risk Index (RI2), a composite indicator based on retraction rates and reliance on delisted journals. RI2 effectively identifies institutions with bibliometric profiles that diverge from global norms and may warrant closer examination. The findings highlight the urgent need for integrity-sensitive reforms in how rankings, funders, and institutions assess scholarly performance.

2025-05-09T21:34:54Z Lokman I. Meho http://arxiv.org/abs/2506.16051v1 From Data to Decision: Data-Centric Infrastructure for Reproducible ML in Collaborative eScience 2025-06-19T06:09:01Z

Reproducibility remains a central challenge in machine learning (ML), especially in collaborative eScience projects where teams iterate over data, features, and models. Current ML workflows are often dynamic yet fragmented, relying on informal data sharing, ad hoc scripts, and loosely connected tools. This fragmentation impedes transparency, reproducibility, and the adaptability of experiments over time. This paper introduces a data-centric framework for lifecycle-aware reproducibility, centered around six structured artifacts: Dataset, Feature, Workflow, Execution, Asset, and Controlled Vocabulary. These artifacts formalize the relationships between data, code, and decisions, enabling ML experiments to be versioned, interpretable, and traceable over time. The approach is demonstrated through a clinical ML use case of glaucoma detection, illustrating how the system supports iterative exploration, improves reproducibility, and preserves the provenance of collaborative decisions across the ML lifecycle.

2025-06-19T06:09:01Z Zhiwei Li Carl Kesselman Tran Huy Nguyen Benjamin Yixing Xu Kyle Bolo Kimberley Yu http://arxiv.org/abs/2410.07969v2 PubMed knowledge graph 2.0: Connecting papers, patents, and clinical trials in biomedical science 2025-06-18T09:08:47Z

Papers, patents, and clinical trials are essential scientific resources in biomedicine, crucial for knowledge sharing and dissemination. However, these documents are often stored in disparate databases with varying management standards and data formats, making it challenging to form systematic and fine-grained connections among them. To address this issue, we construct PKG 2.0, a comprehensive knowledge graph dataset encompassing over 36 million papers, 1.3 million patents, and 0.48 million clinical trials in the biomedical field. PKG 2.0 integrates these dispersed resources through 482 million biomedical entity linkages, 19 million citation linkages, and 7 million project linkages. The construction of PKG 2.0 wove together fine-grained biomedical entity extraction, high-performance author name disambiguation, multi-source citation integration, and high-quality project data from the NIH Exporter. Data validation demonstrates that PKG 2.0 excels in key tasks such as author disambiguation and biomedical entity recognition. This dataset provides valuable resources for biomedical researchers, bibliometric scholars, and those engaged in literature mining.

2024-10-10T14:28:10Z 32 pages, 6 figures, 28 tables Sci Data 12, 1018 (2025) Jian Xu Chao Yu Jiawei Xu Vetle I. Torvik Jaewoo Kang Mujeen Sung Min Song Yi Bu Ying Ding 10.1038/s41597-025-05343-8 http://arxiv.org/abs/2506.14715v1 Procedural Knowledge Libraries: Towards Executable (Research) Memory 2025-06-17T16:52:12Z

Procedural Knowledge Libraries (PKLs) are frameworks for capturing the full arc of scientific inquiry, not just its outcomes. Whereas traditional libraries store static end products, PKLs preserve the process that leads to those results, including hypotheses, failures, decisions, and iterations. By addressing the loss of tacit knowledge -- typically buried in notebooks, emails, or memory -- PKLs lay a foundation for reproducible, collaborative, and adaptive research. PKLs provide executable, version-controlled records that contextualize each step of a research process. For example, a researcher using Jupyter notebooks could share not just final outputs, but also the reasoning, discarded approaches, and intermediate analyses that informed them. This work proposes a framework for implementing PKLs within the Jupyter ecosystem, supported by a lens-based transformation model and procedural storage schema.

2025-06-17T16:52:12Z 24 pages Hamidah Oderinwale http://arxiv.org/abs/2506.14503v1 An Open Research Dataset of the 1932 Cairo Congress of Arab Music 2025-06-17T13:28:27Z

This paper introduces ORD-CC32 , an open research dataset derived from the 1932 Cairo Congress of Arab Music recordings, a historically significant collection representing diverse Arab musical traditions. The dataset includes structured metadata, melodic and rhythmic mode tags (maqam and iqa), manually labeled tonic information, and acoustic features extracted using state-of-the-art pitch detection methods. These resources support computational studies of tuning, temperament, and regional variations in Arab music. A case study using pitch histograms demonstrates the potential for data-driven analysis of microtonal differences across regions. By making this dataset openly available, we aim to enable interdisciplinary research in computational ethnomusicology, music information retrieval (MIR), cultural studies, and digital heritage preservation. ORD-CC32 is shared on Zenodo with tools for feature extraction and metadata retrieval.

2025-06-17T13:28:27Z 14 pages, 4 figures, 4 tables Baris Bozkurt College of Interdisciplinary Studies, Zayed University, Dubai, United Arab Emirates http://arxiv.org/abs/2506.14430v1 Works-magnet: Accelerating Metadata Curation for Open Science 2025-06-17T11:46:32Z

The transition to Open Science necessitates robust and reliable metadata. While national initiatives, such as the French Open Science Monitor, aim to track this evolution using open data, reliance on proprietary databases persists in many places. Open platforms like OpenAlex still require significant human intervention for data accuracy. This paper introduces Works-magnet, a project by the French Ministry of Higher Education and Research (MESR) Data Science & Engineering Team. Works-magnet is designed to accelerate the curation of bibliographic and research data metadata, particularly affiliations, by making automated AI calculations visible and correctable. It addresses challenges related to metadata heterogeneity, complex processing chains, and the need for human curation in a diverse research landscape. The paper details Works-magnet's concepts, and the observed limitations, while outlining future directions for enhancing open metadata quality and reusability. The works-magnet app is open source on github https://github.com/dataesr/works-magnet

2025-06-17T11:46:32Z Eric Jeangirard http://arxiv.org/abs/2505.11633v2 Chatting with Papers: A Hybrid Approach Using LLMs and Knowledge Graphs 2025-06-17T10:48:44Z

This demo paper reports on a new workflow \textit{GhostWriter} that combines the use of Large Language Models and Knowledge Graphs (semantic artifacts) to support navigation through collections. Situated in the research area of Retrieval Augmented Generation, this specific workflow represents the creation of local and adaptable chatbots. Based on the tool-suite \textit{EverythingData} at the backend, \textit{GhostWriter} provides an interface that enables querying and ``chatting'' with a collection. Applied iteratively, the workflow supports the information needs of researchers when interacting with a collection of papers, whether it be to gain an overview, to learn more about a specific concept and its context, and helps the researcher ultimately to refine their research question in a controlled way. We demonstrate the workflow for a collection of articles from the \textit{method data analysis} journal published by GESIS -- Leibniz-Institute for the Social Sciences. We also point to further application areas.

2025-05-16T18:51:51Z 10 pages, 3 figures, Accepted at Joint Workshop of the 5th AI + Informetrics (AII) and the 6th Extraction and Evaluation of Knowledge Entities from Scientific Documents (EEKE) Vyacheslav Tykhonov Han Yang Philipp Mayr Jetze Touber Andrea Scharnhorst http://arxiv.org/abs/2506.13525v1 Implicit and Explicit Research Quality Score Probabilities from ChatGPT 2025-06-16T14:18:23Z

The large language model (LLM) ChatGPT's quality scores for journal articles correlate more strongly with human judgements than some citation-based indicators in most fields. Averaging multiple ChatGPT scores improves the results, apparently leveraging its internal probability model. To leverage these probabilities, this article tests two novel strategies: requesting percentage likelihoods for scores and extracting the probabilities of alternative tokens in the responses. The probability estimates were then used to calculate weighted average scores. Both strategies were evaluated with five iterations of ChatGPT 4o-mini on 96,800 articles submitted to the UK Research Excellence Framework (REF) 2021, using departmental average REF2021 quality scores as a proxy for article quality. The data was analysed separately for each of the 34 field-based REF Units of Assessment. For the first strategy, explicit requests for tables of score percentage likelihoods substantially decreased the value of the scores (lower correlation with the proxy quality indicator). In contrast, weighed averages of score token probabilities slightly increased the correlation with the quality proxy indicator and these probabilities reasonably accurately reflected ChatGPT's outputs. The token probability approach is therefore the most accurate method for ranking articles by research quality as well as being cheaper than comparable ChatGPT strategies.

2025-06-16T14:18:23Z Mike Thelwall Yunhan Yang http://arxiv.org/abs/2506.13256v1 Accessibility Barriers in Multi-Terabyte Public Datasets: The Gap Between Promise and Practice 2025-06-16T08:52:58Z

The promise of "free and open" multi-terabyte datasets often collides with harsh realities. While these datasets may be technically accessible, practical barriers -- from processing complexity to hidden costs -- create a system that primarily serves well-funded institutions. This study examines accessibility challenges across web crawls, satellite imagery, scientific data, and collaborative projects, revealing a consistent two-tier system where theoretical openness masks practical exclusivity. Our analysis demonstrates that datasets marketed as "publicly accessible" typically require minimum investments of \$1,000+ for meaningful analysis, with complex processing pipelines demanding \$10,000-100,000+ in infrastructure costs. The infrastructure requirements -- distributed computing knowledge, domain expertise, and substantial budgets -- effectively gatekeep these datasets despite their "open" status, limiting practical accessibility to those with institutional support or substantial resources.

2025-06-16T08:52:58Z 5 pages, 28 references. Analysis of practical barriers to accessing multi-terabyte public datasets Marc Bara http://arxiv.org/abs/2506.12440v1 Style-based Composer Identification and Attribution of Symbolic Music Scores: a Systematic Survey 2025-06-14T10:34:07Z

This paper presents the first comprehensive systematic review of literature on style-based composer identification and authorship attribution in symbolic music scores. Addressing the critical need for improved reliability and reproducibility in this field, the review rigorously analyzes 58 peer-reviewed papers published across various historical periods, with the search adapted to evolving terminology. The analysis critically assesses prevailing repertoires, computational approaches, and evaluation methodologies, highlighting significant challenges. It reveals that a substantial portion of existing research suffers from inadequate validation protocols and an over-reliance on simple accuracy metrics for often imbalanced datasets, which can undermine the credibility of attribution claims. The crucial role of robust metrics like Balanced Accuracy and rigorous cross-validation in ensuring trustworthy results is emphasized. The survey also details diverse feature representations and the evolution of machine learning models employed. Notable real-world authorship attribution cases, such as those involving works attributed to Bach, Josquin Desprez, and Lennon-McCartney, are specifically discussed, illustrating the opportunities and pitfalls of applying computational techniques to resolve disputed musical provenance. Based on these insights, a set of actionable guidelines for future research are proposed. These recommendations are designed to significantly enhance the reliability, reproducibility, and musicological validity of composer identification and authorship attribution studies, fostering more robust and interpretable computational stylistic analysis.

2025-06-14T10:34:07Z Accepted at the TISMIR 2025 Transactions of the International Society for Music Information Retrieval 8 p. 213-235 Federico Simonetta 10.5334/tismir.240 http://arxiv.org/abs/2401.16845v4 Chronicling Germany: An Annotated Historical Newspaper Dataset 2025-06-13T18:04:57Z

The correct detection of dense article layout and the recognition of characters in historical newspaper pages remains a challenging requirement for Natural Language Processing (NLP) and machine learning applications on historical newspapers in the field of digital history. Digital newspaper portals for historic Germany typically provide Optical Character Recognition (OCR) text, albeit of varying quality. Unfortunately, layout information is often missing, limiting this rich source's scope. Our dataset is designed to enable the training of layout and OCR modells for historic German-language newspapers. The Chronicling Germany dataset contains 693 annotated historical newspaper pages from the time period between 1852 and 1924. The paper presents a processing pipeline and establishes baseline results on in- and out-of-domain test data using this pipeline. Both our dataset and the corresponding baseline code are freely available online. This work creates a starting point for future research in the field of digital history and historic German language newspaper processing. Furthermore, it provides the opportunity to study a low-resource task in computer vision

2024-01-30T09:39:04Z Dataset available at: https://gitlab.uni-bonn.de/digital-history/Chronicling-Germany-Dataset . Baseline code: https://github.com/Digital-History-Bonn/Chronicling-Germany-Code Christian Schultze High-Performance Computing and Analytics Niklas Kerkfeld High-Performance Computing and Analytics Kara Kuebart Institut für Geschichtswissenschaft Universität Bonn Princilia Weber Institut für Geschichtswissenschaft Universität Bonn Moritz Wolter High-Performance Computing and Analytics Felix Selgert Institut für Geschichtswissenschaft Universität Bonn http://arxiv.org/abs/2506.10488v2 Sheet Music Benchmark: Standardized Optical Music Recognition Evaluation 2025-06-13T07:32:56Z

In this work, we introduce the Sheet Music Benchmark (SMB), a dataset of six hundred and eighty-five pages specifically designed to benchmark Optical Music Recognition (OMR) research. SMB encompasses a diverse array of musical textures, including monophony, pianoform, quartet, and others, all encoded in Common Western Modern Notation using the Humdrum **kern format. Alongside SMB, we introduce the OMR Normalized Edit Distance (OMR-NED), a new metric tailored explicitly for evaluating OMR performance. OMR-NED builds upon the widely-used Symbol Error Rate (SER), offering a fine-grained and detailed error analysis that covers individual musical elements such as note heads, beams, pitches, accidentals, and other critical notation features. The resulting numeric score provided by OMR-NED facilitates clear comparisons, enabling researchers and end-users alike to identify optimal OMR approaches. Our work thus addresses a long-standing gap in OMR evaluation, and we support our contributions with baseline experiments using standardized SMB dataset splits for training and assessing state-of-the-art methods.

2025-06-12T08:42:19Z Juan C. Martinez-Sevilla Joan Cerveto-Serrano Noelia Luna Greg Chapman Craig Sapp David Rizo Jorge Calvo-Zaragoza http://arxiv.org/abs/2506.10942v1 Building a Media Ecosystem Observatory from Scratch: Infrastructure, Methodology, and Insights 2025-06-12T17:46:54Z

Understanding the flow of information across today's fragmented digital media landscape requires scalable, cross-platform infrastructure. In this paper, we present the Canadian Media Ecosystem Observatory, a national-scale infrastructure designed to monitor political and media discourse across platforms in near real time. Media Ecosystem Observatory (MEO) data infrastructure features custom crawlers for major platforms, a unified indexing pipeline, and a normalization layer that harmonizes heterogeneous schemas into a common data model. Semantic embeddings are computed for each post to enable similarity search and vector-based analyses such as topic modeling and clustering. Processed and raw data are made accessible through API, dashboards and website, supporting both automated and ad hoc research workflows. We illustrate the utility of the observatory through example analyses of major Canadian political events, including Meta's 2023 news ban and the recent federal elections. As a whole, the system offers a model for digital trace infrastructure and an evolving research platform for studying the dynamics of modern media ecosystems.

2025-06-12T17:46:54Z Zeynep Pehlivan Saewon Park Alexei Sisulu Abrahams Mika Desblancs-Patel Benjamin David Steel Aengus Bridgman http://arxiv.org/abs/2503.14519v3 Content ARCs: Decentralized Content Rights in the Age of Generative AI 2025-06-12T16:32:35Z

The rise of Generative AI (GenAI) has sparked significant debate over balancing the interests of creative rightsholders and AI developers. As GenAI models are trained on vast datasets that often include copyrighted material, questions around fair compensation and proper attribution have become increasingly urgent. To address these challenges, this paper proposes a framework called Content ARCs (Authenticity, Rights, Compensation). By combining open standards for provenance and dynamic licensing with data attribution, and decentralized technologies, Content ARCs create a mechanism for managing rights and compensating creators for using their work in AI training. We characterize several nascent works in the AI data licensing space within Content ARCs and identify where challenges remain to fully implement the end-to-end framework.

2025-03-14T11:57:08Z Accepted to IEEE International Conference on AI and the Digital Economy (CADE 2025) Kar Balan Andrew Gilbert John Collomosse