https://arxiv.org/api/MSBtTA9vLhNaKdRv0DvQwDw+hOE 2026-06-10T04:27:19Z 6061 105 15 http://arxiv.org/abs/2605.03537v1 A Skill-Based AI Agentic Pipeline for Library of Congress Subject Indexing 2026-05-05T09:11:45Z

This paper presents a modular AI agentic skill pipeline for automating subject indexing with Library of Congress Subject Headings (LCSH). Subject indexing - the process of analyzing a work's aboutness, selecting controlled vocabulary terms, and encoding them as MARC21 subject access fields - is one of the most time-consuming components of library cataloging. The system decomposes this process into four discrete, sequentially executed agent skills: conceptual analysis, quantitative filtering, authority validation, and MARC field synthesis. Each skill encodes domain knowledge drawn directly from Library of Congress Subject Headings Manual (SHM) instruction sheets and subject analysis theory. The pipeline was evaluated against a corpus of ten titles whose existing subject headings were captured from the Harvard Library bibliographic dataset (a snapshot of their Alma ILS). Results demonstrate strong conceptual alignment with professional subject indexing practice, with notable differences in specificity, subdivision practice, and the agent's adherence to the 2026 LC policy discontinuing form subdivisions in favor of LCGFT 655 fields.

2026-05-05T09:11:45Z Eric H. C. Chow http://arxiv.org/abs/2512.17795v2 Intelligent Knowledge Mining Framework: Bridging AI Analysis and Trustworthy Preservation 2026-05-05T07:02:02Z

The unprecedented proliferation of digital data presents significant challenges in access, integration, and value creation across all data-intensive sectors. Valuable information is frequently encapsulated within disparate systems, unstructured documents, and heterogeneous formats, creating silos that impede efficient utilization and collaborative decision-making. This paper introduces the Intelligent Knowledge Mining Framework (IKMF), a comprehensive conceptual model designed to bridge the critical gap between dynamic AI-driven analysis and trustworthy long-term preservation. The framework proposes a dual-stream architecture: a horizontal Mining Process that systematically transforms raw data into semantically rich, machine-actionable knowledge, and a parallel Trustworthy Archiving Stream that ensures the integrity, provenance, and computational reproducibility of these assets. By defining a blueprint for this symbiotic relationship, the paper provides a foundational model for transforming static repositories into living ecosystems that facilitate the flow of actionable intelligence from producers to consumers. This paper outlines the motivation, problem statement, and key research questions guiding the research and development of the framework, presents the underlying scientific methodology, and details its conceptual design and modeling.

2025-12-19T17:01:03Z Binh Vu http://arxiv.org/abs/2605.02128v1 Liberata -- Graph Scientometrics for a Share Based System of Academic Publishing 2026-05-04T01:17:17Z

Contemporary scientometric indicators remain anchored in paradigms and axioms from when academic research was conducted in small scholarly communities. With the global proliferation of scientific research, academia is now organized in large communities with high rates of information incompleteness regarding work impact and individual contributions. This has significant implications for how research output is measured and quality controlled, especially as the rate of academic publishing continues to rise. Exploits of complex systems are typically found at discrete transition points where rules turn on or off, and academia is not immune to this pattern. Exploitative career boosting strategies are a growing problem, largely enabled by misaligned incentives and traditional metrics that force discretization of credit to authors and prior works despite their fundamentally continuous nature. This article introduces Liberata's scientometrics, a share based framework for academic publishing and quality control. In this system, authorship positions are replaced with contribution shares that sum to unity and encode both ordinality and relative contribution distances. These shares can be traded on Liberata's academic marketplaces for quality control services such as peer review and replication, rewarding contributors based on the long term success of the work. Citations are weighted to guard against frivolous referencing and credit inflation, and modular correction factors allow multiple measures of impact. Liberata's metrics are formalized through two fundamental graphs, Shares and References, from which the system constructs academic capital and derives scientometrics capturing impact, risk, collaboration, collusion, value of quality control, and diversification. These metrics represent academic contributions and extend naturally to institutions, regions, time periods, and research fields.

2026-05-04T01:17:17Z Han Zhang Anshuman Sabath Timothy W. Dunn L. Catherine Brinson http://arxiv.org/abs/2605.01941v1 HERITRACE: a domain-agnostic framework for SHACL-driven RDF curation with provenance and change tracking 2026-05-03T15:56:56Z

HERITRACE is an open-source web application that enables users without Semantic Web expertise to curate RDF data through form-based interfaces with automatic provenance documentation and change tracking in RDF. It uses SHACL for data model definition and form generation, connects to existing SPARQL-accessible stores without data migration, and records every modification as a provenance snapshot that can be browsed and restored. HERITRACE is domain-agnostic: adapting it to a new collection requires only SHACL shapes and YAML display rules, without code changes. This paper describes the software architecture and provides the first empirical evaluation. HERITRACE is deployed in production for the ParaText project, where classical philologists curate bibliographic data about ancient Greek exegetical traditions, and is planned as the editing interface for OpenCitations and as the curation layer for the Social Sciences and Humanities Citation Index within the GRAPHIA Horizon Europe project. Since it operates on any SPARQL-accessible store without data migration, its adoption potential extends to any domain maintaining RDF data. HERITRACE is publicly available on GitHub under the ISC license, archived on Zenodo and Software Heritage Archive, and documented for deployment with a pre-built Docker image.

2026-05-03T15:56:56Z 19 pages, 5 figures. Submitted to ISWC 2026 Arcangelo Massari Research Centre for Open Scholarly Metadata, Department of Classical Philology and Italian Studies, University of Bologna, Bologna, Italy Silvio Peroni Research Centre for Open Scholarly Metadata, Department of Classical Philology and Italian Studies, University of Bologna, Bologna, Italy http://arxiv.org/abs/2605.01337v1 Comparison of OpenAlex and Scopus coverage of German institutions' publications in top-tier journals 2026-05-02T09:21:24Z

OpenAlex has recently emerged as a leading alternative to proprietary bibliometric sources. However, concerns remain regarding the quality of its metadata, especially the institutional profiles which are crucial for evaluating organizations. This study assesses the quality of affiliation data in OpenAlex using German research institutions. Publications from top-tier journals were analyzed and institutional publication counts in OpenAlex were systematically compared with counts in Scopus. The results show that OpenAlex generally contains more publications at the journal level, reflecting its broader coverage. However, institutional publication counts in OpenAlex are consistently lower, indicating missing or incorrectly assigned affiliations. Nevertheless, the correlations between institutional outputs in both databases are very high, suggesting that relative institutional rankings remain stable. These findings suggest that OpenAlex is suitable for comparative institutional analyses in academic research but requires further improvement in affiliation metadata before it can be used for evaluation contexts that rely on absolute publication counts.

2026-05-02T09:21:24Z Andrey Lovakov Ivan Sterligov http://arxiv.org/abs/2504.20605v2 TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models 2026-05-02T05:59:29Z

Moral stories are a time-tested vehicle for transmitting values, yet modern NLP lacks a large, structured corpus that couples coherent narratives with explicit ethical lessons. We present TF1-EN-3M, to our knowledge the first open dataset of three million English-language fables generated exclusively by instruction-tuned models no larger than 8B parameters. Each story follows a six-slot scaffold (character -> trait -> setting -> conflict -> resolution -> moral), produced through a combinatorial prompt engine that guarantees genre fidelity while covering a broad thematic space. A fully reproducible evaluation pipeline employs a panel of open-weight LLM judges from distinct model families, scoring grammar, creativity, moral clarity, and template adherence, complemented by reference-free diversity and readability metrics. Among ten open-weight generator candidates, an 8B-parameter Llama-3 variant delivers the best quality-cost trade-off, producing high-scoring fables on consumer hardware at approximately $0.135 per 1,000 fables. We release the dataset, generation code, evaluation scripts, and full metadata under a permissive license, enabling exact reproducibility and cost benchmarking. TF1-EN-3M opens avenues for research in instruction following, narrative intelligence, value alignment, and child-friendly educational AI -- demonstrating that large-scale moral storytelling requires neither proprietary giant models nor proprietary evaluation infrastructure.

2025-04-29T10:15:28Z 18 pages, 6 tables, 1 figure. v2: revised evaluation with open-weight LLM judge panel, expanded citations Mihai Nadas Laura Diosan Andrei Piscoran Andreea Tomescu http://arxiv.org/abs/2602.23452v3 CiteAudit: You Cited It, But Did You Read It? A Benchmark for Verifying Scientific References in the LLM Era 2026-05-01T20:05:28Z

Scientific research relies on citation integrity, yet large language models (LLMs) have introduced a critical risk: fabricated references that appear plausible but correspond to no real publications. As manual verification becomes infeasible and existing automated tools remain fragile, we introduce CiteAudit, a comprehensive benchmark and detection framework for hallucinated citations. We design a multi-agent verification pipeline that decomposes citation checking into metadata extraction, memory lookup, web-based retrieval, and final judgment. To evaluate this, we construct a large-scale, human-validated dataset spanning diverse domains and hallucination types. Experiments demonstrate that our framework achieves superior verification performance over state-of-the-art LLMs and commercial baselines. Our work provides the necessary infrastructure to audit citations at scale and safeguard the trustworthiness of scholarly discourse. Code is available at https://github.com/shiiiikw/CiteAudit.

2026-02-26T19:17:39Z We have further refined the benchmark construction and reference verification pipeline to improve clarity and consistency. The revised version includes updated results and additional details to better align the evaluation with the intended setup. These changes provide a more precise presentation of the experimental findings, with conclusions and contributions remaining unchanged Kaiwen Shi Weixiang Sun Zheyuan Zhang Lichao Sun Nitesh V. Chawla Yanfang Ye http://arxiv.org/abs/2604.28061v1 Measuring research data reuse in scholarly publications using generative artificial intelligence: Open Science Indicator development and preliminary results 2026-04-30T16:07:43Z

Numerous metascience studies and other initiatives have begun to monitor the prevalence of open science practices when it is more important to understand the 'downstream' effects or impacts of open science. PLOS and DataSeer have developed a new LLM-based indicator to measure an important effect of open science: the reuse of research data. Our results show a data reuse rate of 43%, which is higher than established bibliometric techniques. We show that data reuse can be measured at scale using LLMs and generative artificial intelligence. The positive effects of research data sharing and reuse may currently be underestimated.

2026-04-30T16:07:43Z 12 pages. Submitted to 30th Annual International Conference on Science and Technology Indicators Lauren Cadwallader Iain Hrynaszkiewicz parth sarin Tim Vines http://arxiv.org/abs/2601.18616v2 Goals and Strategies for the Indexing of Publication Types and Study Designs 2026-04-30T12:23:38Z

Objectives. Major research and implementation efforts have been devoted to indexing articles according to the major topics discussed, but much less effort to indexing their publication types and study designs (collectively, PTs). In this Perspective, we discuss how indexing PTs differs from topical MeSH indexing and requires a different approach. Materials and Methods. Rather than focus on the technical aspects of machine learning-based indexing models, we emphasize the goals and purposes for which biomedical articles are indexed, and the surprisingly thorny question of how indexing systems should be evaluated. Results. Topical Medical Subject Heading (MeSH) terms are assigned to articles that cover the major topics discussed; when more than one term is applicable, only the most specific term is assigned. In contrast, PTs are assigned to articles that have a given structure or use a particular design. To meet the needs of end-users, particularly groups involved in evidence syntheses, PT indexing needs to be comprehensive and employ probabilistic goodness-of-fit prediction scores. Whereas existing NLM hierarchies place publication types and study design-related terms on separate trees from each other, we have created a unified hierarchy that permits more appropriate retrieval via automatic expansion. Discussion. Automated PT indexing systems should allow users to input article records or full-text PDFs and receive scores in real time. This will offer consistent indexing across bibliographic databases, as well as preprints and unpublished manuscripts. Conclusions. Automated PT indexing systems, properly designed and implemented, hold the promise of greatly improving the retrieval of biomedical articles, saving substantial effort when writing evidence syntheses and benefiting other users as well.

2026-01-26T15:56:05Z 14 pages, no figures, 1 table Neil R. Smalheiser Joe D. Menke Arthur W. Holt Halil Kilicoglu Jodi Schneider http://arxiv.org/abs/2604.27315v1 Cross-lingual Comparison of Research Funding Projects with Multilingual Sentence-BERT: Evidence from KAKENHI, NIH, NSF, and UKRI 2026-04-30T01:57:42Z

Cross-national comparison of research funding projects is increasingly important for science policy and strategic planning, but language differences remain a major obstacle. In particular, KAKENHI project descriptions are written primarily in Japanese, whereas projects from major overseas funding agencies, such as NSF, NIH, and UKRI, are documented in English. This study investigates whether multilingual sentence embeddings can support meaningful cross-lingual comparison of research funding projects, with particular attention to the semantic effects of translating Japanese texts into English. For each KAKENHI project, we construct two representations: the original Japanese text and its machine-translated English version, both embedded in a shared semantic space using a multilingual Sentence-BERT model. We then compare their distances and nearest-neighbor relationships with respect to projects from English-language funding agencies. The results show that the Japanese and translated English representations of the same KAKENHI project are, on average, located closer to one another than to native English projects, indicating substantial cross-lingual alignment. However, the overlap of nearest neighbors between the two representations is limited, averaging 2.9 out of 10. This suggests that multilingual embeddings capture semantic similarity across languages to a meaningful extent, while language differences and translation still affect the local structure of the embedding space. These findings suggest that multilingual embeddings provide a useful basis for large-scale exploratory comparison of funding projects across countries and agencies. At the same time, they offer an empirical reference for assessing semantic drift when Japanese research project data are translated into English for international analysis.

2026-04-30T01:57:42Z 8 pages, 4 figures Miki Kimura-Ida http://arxiv.org/abs/2604.26835v1 HalluCiteChecker: A Lightweight Toolkit for Hallucinated Citation Detection and Verification in the Era of AI Scientists 2026-04-29T16:01:42Z

We introduce HalluCiteChecker, a toolkit for detecting and verifying hallucinated citations in scientific papers. While AI assistant technologies have transformed the academic writing process, including citation recommendation, they have also led to the emergence of hallucinated citations that do not correspond to any existing work. Such citations not only undermine the credibility of scientific papers but also impose an additional burden on reviewers and authors, who must manually verify their validity during the review process. In this study, we formalize hallucinated citation detection as an NLP task and provide a corresponding toolkit as a practical foundation for addressing this problem. Our package is lightweight and can perform verification in seconds on a standard laptop. It can also be executed entirely offline and runs efficiently using only CPUs. We hope that HalluCiteChecker will help reduce reviewer workload and support organizers by enabling systematic pre-review and publication checks. Our code is released under the Apache 2.0 license on GitHub and is distributed as an installable package via PyPI. A demonstration video is available on YouTube.

2026-04-29T16:01:42Z Work In Progress Yusuke Sakai Hidetaka Kamigaito Taro Watanabe http://arxiv.org/abs/2602.12537v3 News Harvesting from Google News combining Web Scraping, LLM Metadata Extraction and SCImago Media Rankings enrichment: a case study of IFMIF-DONES 2026-04-29T02:58:58Z

This study develops and evaluates a systematic methodology for constructing news datasets from Google News, combining automated web scraping, large language model (LLM)-based metadata extraction, and SCImago Media Rankings enrichment. Using the IFMIF-DONES fusion energy project as a case study, we implemented a five-stage data collection pipeline across 81 region-language combinations, yielding 1,482 validated records after a 56% noise reduction. Results are compared against two licensed press databases: MyNews (2,280 records) and ProQuest Newsstream Collection (148 records). Overlap analysis reveals high complementarity, with 76% of Google News records exclusive to this platform. The dataset captures content types absent from proprietary databases, including specialized outlets, institutional communications, and social media posts. However, significant methodological challenges emerge: temporal instability requiring synchronic collection, a 100-result cap per query demanding multi-stage strategies, and unexpected noise including academic PDFs, false positives, and pornographic content infiltrating results through black hat SEO techniques. LLM-assisted extraction proved effective for structured articles but exhibited systematic hallucination patterns requiring validation protocols. We conclude that Google News offers valuable complementary coverage for communication research but demands substantial methodological investment, multi-source triangulation, and robust filtering mechanisms to ensure dataset integrity.

2026-02-13T02:34:26Z 24 pages, 7 figures Victor Herrero-Solana http://arxiv.org/abs/2604.26236v1 Do E-Scooter Speed Governance Policies Reduce Harsh Acceleration and Deceleration? Evidence from 19.5 Million Trips Around a Regulatory Ban 2026-04-29T02:34:55Z

Do e-scooter speed governance policies yield behavioral safety gains beyond the mechanical cap they impose? A firmware ceiling mechanically prevents speeding, but whether the same riders also generate fewer harsh accelerations and harsh decelerations when the ungoverned mode is withdrawn remains open. We analyze 19.5 million GPS-instrumented trips from 52 South Korean cities (February to November 2023). Our two-stage predict-then-validate design targets two trip-level binary outcomes, any harsh-acceleration event and any harsh-deceleration event. In Phase~I, we predict each outcome's within-user reduction under an ungoverned-to-governed substitution, using a rider-heterogeneous random-parameters binary logit on the pre-ban period. In Phase~II, we validate these predictions using a difference-in-differences specification that exploits the operator's system-wide December~2023 removal of the ungoverned mode. The causal estimates confirm the Phase~I predictions in sign and order of magnitude on both outcomes, are Bonferroni-significant, and satisfy a 3-month pre-ban parallel-trends test. A within-user composition check finds no behavioral offsetting, indicating that firmware removal of an ungoverned mode lowers both harsh-event margins through a purely mechanical channel. These results imply that speed governance policies can deliver measurable safety gains on unconstrained behavioral margins.

2026-04-29T02:34:55Z Seongjin Choi Sunbin Yoo Sugie Lee http://arxiv.org/abs/2604.25665v1 LLM-ReSum: A Framework for LLM Reflective Summarization through Self-Evaluation 2026-04-28T14:00:09Z

Reliable evaluation of large language model (LLM)-generated summaries remains an open challenge, particularly across heterogeneous domains and document lengths. We conduct a comprehensive meta-evaluation of 14 automatic summarization metrics and LLM-based evaluators across seven datasets spanning five domains, covering documents from short news articles to long scientific, governmental, and legal texts (2K-27K words) with over 1,500 human-annotated summaries. Our results show that traditional lexical overlap metrics (e.g., ROUGE, BLEU) exhibit weak or negative correlation with human judgments, while task-specific neural metrics and LLM-based evaluators achieve substantially higher alignment, especially for linguistic quality assessment. Leveraging these findings, we propose LLM-ReSum, a self-reflective summarization framework that integrates LLM-based evaluation and generation in a closed feedback loop without model finetuning. Across three domains, LLM-ReSum improves low-quality summaries by up to 33% in factual accuracy and 39% in coverage, with human evaluators preferring refined summaries in 89% of cases. We additionally introduce PatentSumEval, a new human-annotated benchmark for legal document summarization comprising 180 expert-evaluated summaries. All code and datasets will be released in GitHub.

2026-04-28T14:00:09Z 15 pages, 3 figures, 5 tables Huyen Nguyen Haoxuan Zhang Yang Zhang Junhua Ding Haihua Chen http://arxiv.org/abs/2604.25597v1 Generating Synthetic Citation Networks with Communities 2026-04-28T13:03:21Z

Generating realistic synthetic citation, patent, or component dependency networks is essential for benchmarking community detection, graph visualisation, and network data mining algorithms. We present the first systematic comparison of generators of directed graphs that are nearly acyclic and have a ground-truth community structure. We evaluate 12 methods across 7 real citation networks and 26 metrics. We propose the practice of reversing directions of edges in static generators to break cycles and induce a citation-like flow, which significantly improves the performance of a degree-corrected Stochastic Block Model. Our novel methodological approach to evaluating community detection benchmarks distinguishes between endogenous and exogenous mesoscopic similarities, with the latter proving more important. This distinction reveals that high-parameter models suffer from overfitting by memorising planted community statistics which lead to their failing to produce realistic networks. Finally, we introduce the Citation Seeder (CS) algorithm, an iterative generator grounded in the Price-Pareto model of citation networks, with interpretable parameters and O(N+E) runtime. CS achieves competitive results against the best-performing baselines while using up to four orders of magnitude fewer parameters and providing a clean framework for explaining and predicting a network's future growth.

2026-04-28T13:03:21Z Łukasz Brzozowski Marek Gagolewski Grzegorz Siudem