https://arxiv.org/api/D8SKN61ufHr79GNk48l9UQkhkJ4 2026-06-09T23:29:30Z 6057 45 15 http://arxiv.org/abs/2512.10165v3 BookReconciler: An Open-Source Tool for Metadata Enrichment and Work-Level Clustering 2026-05-25T06:33:53Z

We present BookReconciler, an open-source tool for enhancing and clustering book data. BookReconciler allows users to take spreadsheets with minimal metadata, such as book title and author, and automatically 1) add authoritative, persistent identifiers like ISBNs 2) and cluster related Expressions and Manifestations of the same Work, e.g., different translations or editions. This enhancement makes it easier to combine related collections and analyze books at scale. The tool is currently designed as an extension for OpenRefine -- a popular software application -- and connects to major bibliographic services including the Library of Congress, VIAF, OCLC, HathiTrust, Google Books, and Wikidata. Our approach prioritizes human judgment. Through an interactive interface, users can manually evaluate matches and define the contours of a Work (e.g., to include translations or not). We evaluate reconciliation performance on datasets of U.S. prize-winning books and contemporary world fiction. BookReconciler achieves near-perfect accuracy for U.S. works but lower performance for global texts, reflecting structural weaknesses in bibliographic infrastructures for non-English and global literature. Overall, BookReconciler supports the reuse of bibliographic data across domains and applications, contributing to ongoing work in digital libraries and digital humanities.

2025-12-10T23:51:55Z Published in the proceedings of the Joint Conference on Digital Libraries (JCDL) 2025, Resources Joint Conference on Digital Libraries (JCDL), 2025 Matt Miller Dan Sinykin Melanie Walsh http://arxiv.org/abs/2606.07554v1 RetraLytix: An Integrated Analytics Dashboard for Mapping Global Trends in Scientific Retractions 2026-05-25T02:55:03Z

Retraction is a correction to scientific literature when there is a major flaw, fraud or misuse of ethical practices in the published work. With the increasing growth of research output, number of retracted studies has also increased, which raises concerns about the issue of research ethics and transparency. Moreover, retraction data coming from several platforms or databases limits its scope in tracking the time-to-time retraction trends. To address this, we propose a web-based integrated platform, called RetraLytix, for easy analysis of distributed retraction data. It automatically integrates retraction data from major databases like Crossref, Retraction Watch and Open Alex and visualizes data in a user interactive centralized platform. It offers a real-time dashboard, comparative analysis, and benchmarking of entities such as countries, institutions, authors, journals and main research areas. RetraLytix helps users to detect trends, retraction patterns, and assess research environment to make data-driven decisions. The system has a potential to become a research integrity tracking and governance tool for researchers, administrators and policymakers.

2026-05-25T02:55:03Z Chahat Singh Sejal Gupta Krishna Mundra Kiran Sharma http://arxiv.org/abs/2604.08501v2 sciwrite-lint: Verification Infrastructure for the Age of Science Vibe-Writing 2026-05-24T17:12:44Z

Scientific papers make claims about prior work backed by citations. Verifying those citations at scale (that each cited paper exists, says what the citation claims, and is itself reliable) is structurally beyond what human review can deliver: a typical paper has dozens of citations, and a careful reviewer reads at most a handful end-to-end. AI-assisted writing makes this gap even more urgent: LLMs hallucinate references and may fill in plausible details from titles or abstracts of papers they never read, worse for the smaller local-weights models that privacy-aware researchers must use. sciwrite-lint applies the linting paradigm from software engineering to citation verification: it runs entirely on the researcher's machine (free public databases, a single consumer GPU, and open-weights models), is fast enough to re-lint between revisions so authors catch problems at the source while drafting, and serves journals and reviewers as an automated first pass. The pipeline checks reference existence, metadata accuracy, retraction status, and claim support, traverses one level into cited papers' bibliographies, and produces per-reference reliability scores. We evaluate on 30 unseen papers (arXiv and bioRxiv) with error injection and LLM-adjudicated false-positive analysis. The same linting workflow extends to internal consistency: numbers in text vs. tables, abstract vs. body, figure captions vs. content, statistical results vs. their verbal interpretation, plus structural cross-references (dangling cites, orphan references). As a separate experimental contribution we also propose SciLint Score: citation-chain integrity combined with a contribution component operationalizing five philosophy-of-science frameworks (Popper, Lakatos, Kitcher, Laudan, Mayo).

2026-04-09T17:46:44Z Code: https://github.com/authentic-research-partners/sciwrite-lint Sergey V Samsonau http://arxiv.org/abs/2605.25172v1 Rejoinder: The ICML 2023 Ranking Experiment: Examining Author Self-Assessment in ML/AI Peer Review 2026-05-24T17:06:47Z

This article is the rejoinder to ``The ICML 2023 Ranking Experiment: Examining Author Self-Assessment in ML/AI Peer Review,'' to appear in the Journal of the American Statistical Association with discussion. To address the practical and theoretical points raised by the discussants, we organize our response around four core themes: (i) formulating peer review as a statistical estimation problem; (ii) mitigating equity and strategic concerns in the deployment of the Isotonic Mechanism; (iii) incorporating complementary signals such as reviewer rankings and structured metadata; and (iv) exploring a human-centered framework for peer review in the era of generative AI.

2026-05-24T17:06:47Z Rejoinder to the JASA Discussion of "The ICML 2023 Ranking Experiment: Examining Author Self-Assessment in ML/AI Peer Review" (arXiv:2408.13430) Buxin Su Jiayao Zhang Natalie Collina Yuling Yan Didong Li Kyunghyun Cho Jianqing Fan Aaron Roth Weijie Su http://arxiv.org/abs/2605.24180v1 Human-AI Collaboration in Science at Scale: A Global Large-scale Randomized Field Experiment 2026-05-22T20:06:17Z

Collaboration is the defining mode of modern science, yet its core mechanism -- feedback -- remains hard to observe, difficult to scale, and unequally distributed. Here we test whether large language models (LLMs) can contribute to this hidden but vital practice and reallocate scientific feedback, an essential yet scarce resource for knowledge production. In a global large-scale randomized field experiment, we delivered customized LLM-generated feedback for over 31,000 arXiv preprints across 150 fields and more than 45,000 researchers from 133 geographic regions. Relative to controls, authors who received feedback had a significantly higher likelihood of revising their manuscripts, corresponding to a 12.55% relative increase over the baseline revision rate. Exposure to AI feedback also increased authors' subsequent use of LLM tools in their future papers, suggesting longer-run shifts in scientific practice. These effects were strongest among authors from non-English-dominant research regions, manuscripts less embedded in the scholarly literature, and teams with lower h-indexes and earlier career stages, consistent with the idea that AI feedback may provide the greatest benefit where access to timely critique is otherwise limited. Together, these findings provide causal evidence that structured AI-based interventions can transform access to scientific feedback from a largely private advantage into a more widely distributed resource, with broader implications for productivity, equity, and capacity across the global research system.

2026-05-22T20:06:17Z Binglu Wang Weixin Liang Jiahui Xue Yuhui Zhang Hancheng Cao Dashun Wang Yian Yin http://arxiv.org/abs/2605.23586v1 Tracking a Decade of Research at the University of Nigeria, Nsukka: A Scientometric Analysis (2014-2023) 2026-05-22T12:57:08Z

This study employs scientometric methods to assess the research output and performance of the University of Nigeria from 2014 to 2023. By analyzing publication trends, citation patterns, and collaboration networks, the research aims to comprehensively evaluate the university's research productivity, impact, and disciplinary focus. These research endeavors are characterized by innovation, interdisciplinary collaboration, and commitment to excellence, making the University of Nigeria a significant hub for cutting-edge research in Nigeria and beyond. The present study has been undertaken to determine the impact of the university's research and publication trends from 2014 to 2023. The study focuses on year-wise research output, citation impact at local and global levels, prominent authors and their total output, top journals, collaborating countries, and the most contributing departments of the University of Nigeria. The university's ten years of publication data indicate that 6,353 papers were published from 2014 to 2023, receiving 86,202 citations with an h-index of 39. In addition to this, the stenographical mapping of data is presented through graphs using the VOSviewer software mapping technique. The findings of this study will contribute to understanding the university's research strengths, weaknesses, and potential areas for improvement. Additionally, the results will inform evidence-based decision-making for enhancing research strategies and policies at the University of Nigeria

2026-05-22T12:57:08Z 16 pages, 4 figures, Research Article The University of Arusha Academic Journal (UoAAJ); Volume 4 Issue 2; 2026 Muneer Ahmad Joseph U Igligli http://arxiv.org/abs/2604.27580v3 Thinking like a business: Reconfiguring relationships to sustain open data infrastructures 2026-05-22T07:29:05Z

Sustaining open data infrastructures over time is a complex puzzle, involving dynamic funding models and relationships with customers, collaborators, and competitors. Despite their importance, these mechanisms are often hidden from view, limiting their applicability to other infrastructures. In this article, we examine how Dryad, a well-known open data infrastructure, has worked toward financial sustainability by reconfiguring relationships with other actors and by strategically implementing a new business model and process of assetization. We identify four types of relationship reconfigurations with customers, collaborators, and competitors critical to Dryad's financial evolution: reinforcing, forging, positioning, and excluding. We then analyze how Dryad's strategic efforts to develop a new fee structure have changed its interpretations of value(s), community, and governance, factors important in an infrastructure's longevity. We conclude by highlighting emerging tensions that provide insight for other open infrastructures working to become financially sustainable. As a whole, our analysis focuses not just on financial mechanisms for funding open data infrastructures (although those emerge) but on the relationships which enable them.

2026-04-30T08:31:26Z Kathleen Gregory Dorothea Strecker http://arxiv.org/abs/2605.21673v1 From Licensing to Open Access: Designing a Sustainable Transition in Operational Weather Data 2026-05-20T19:30:02Z

This translational article documents the European Centre for Medium-Range Weather Forecasts (ECMWF) transition from a restricted data licensing model to open access under CC BY 4.0, completed in October 2025. The policy context included EU open data requirements and alignment with international data exchange frameworks. The transition was implemented through a tiered service model that kept core forecast data open while offering operationally supported delivery as a cost-recovered service. Between 2020 and 2025, ECMWF executed an iterative planning cycle: setting an annual target for revenue reduction, specifying additions to the open tier under that target, provisioning infrastructure, and assessing outcomes to update assumptions. Drawing on internal administrative records (2014 - 2025), we describe design choices, operational constraints, and early outcomes. In the six months following the end of the transition, more than 93% of previously paying organisations retained a Service Agreement, while open endpoint download volumes increased substantially. We discuss trade-offs in defining the open tier (resolution, parameters, schedule), the reduction of compliance overheads formerly associated with redistribution restrictions, and the scalability implications of global distribution. We note an emerging sustainability question as AI-based forecast products become freely available. The early evidence is consistent with the view that a tiered service model can be designed to reconcile open-access obligations with operational sustainability, subject to monitoring over longer contract renewal cycles (typically annual).

2026-05-20T19:30:02Z Emma Pidduck Umberto Modigliani Victoria L. Bennett Fabio Venuti Florian Pappenberger Florence Rabier http://arxiv.org/abs/2605.17534v2 The Curious Case of Max Planck retracted papers. When past scientific practices meet contemporary publishing norms 2026-05-20T19:20:18Z

This article examines the case of two papers published in Naturwissenschaften by the physicist Max Planck that were retrospectively marked as retracted on Springer digital platform. Rather than originating in scientific fraud, these withdrawals appear to result from contemporary digitization and copyright-management procedures applied anachronistically to historical publications. Through an investigation of the circulation history of Planck 1940 and 1942 philosophical essays, the article shows that republication across multiple formats was a common and legitimate practice within the scientific publishing culture of the early 20th century. Such practices only became problematic with the later transformation of the scientific article into a countable and proprietary unit within systems of bibliometric evaluation and commercial academic publishing. This article argues that contemporary notions such as duplicate publication and self-plagiarism are historically situated categories that cannot be applied retrospectively without distorting the historical record. More broadly, the Planck case reveals how digital scholarly infrastructures controlled by large commercial publishers can limit the accessibility of the scientific past. Ironically, the original papers remain accessible today through the nonprofit digital platform Internet Archive rather than through the publisher that originally issued the journal.

2026-05-17T16:38:32Z 12 pages, 2 figures Yves Gingras Mahdi Khelfaoui http://arxiv.org/abs/2603.28103v2 Transcription and Recognition of Italian Parliamentary Speeches Using Vision-Language Models 2026-05-20T13:17:12Z

Parliamentary proceedings represent a rich yet challenging resource for computational analysis, particularly when preserved only as scanned historical documents. Existing efforts to transcribe Italian parliamentary speeches have relied on traditional Optical Character Recognition pipelines, resulting in transcription errors and limited semantic annotation. In this paper, we propose a pipeline based on Vision-Language Models for the automatic transcription, semantic segmentation, and entity linking of Italian parliamentary speeches. The pipeline employs a specialised OCR model to extract text while preserving reading order, followed by a large-scale Vision-Language Model that performs transcription refinement, element classification, and speaker identification by jointly reasoning over visual layout and textual content. Extracted speakers are then linked to the Chamber of Deputies knowledge base through SPARQL queries and a multi-strategy fuzzy matching procedure. Evaluation against an established benchmark demonstrates substantial improvements both in transcription quality and speaker tagging.

2026-03-30T07:06:49Z to be published in: ParlaCLARIN V: Interoperability, Multilinguality, and Multimodality in Parliamentary Corpora, organized within the 15th Language Resource and Evaluation Conference (2026) Luigi Curini Alfio Ferrara Giovanni Pagano Sergio Picascia http://arxiv.org/abs/2605.17657v2 General Science Ranking (GSR): An Open-Source, Citation-Normalized Journal and Conference Classification System for Computer Science and Medicine 2026-05-19T20:00:00Z

The academic journal zoning system is central to evaluating research talent, funding, and institutions. The CAS journal partition system, one of East Asia's most widely used tools, will cease operation in March 2026, creating a policy gap. Existing alternatives have major limitations: JCR depends on paid databases and excludes conferences; Scimago/CiteScore relies on Elsevier proprietary data; expert-based rankings such as CCF and CORE lack quantitative foundations and update slowly. This paper proposes the General Science Ranking (GSR), a multidimensional bibliometric framework built entirely on open-source data. GSR covers 500 computer science venues (397 journals and 103 conferences) and 500 medical journals using OpenAlex and Semantic Scholar. Scores combine four indicators: field-weighted citation impact (FWCI), two-year impact factor (IF2), five-year h-index (h5), and citation CAGR. For CS conferences lacking citation time-series data, IF2-approx was estimated from calibration on 1.41 million OpenAlex journal papers. Rankings adopt fixed quotas: Q1 (1-50), Q2 (51-100), Q3 (101-200), and Q4 (201+). All code and data are open source. In CS rankings, conferences and journals each occupy 25 of the top 50 Q1 positions. The leading conferences are NeurIPS, ICCV, ICLR, and CVPR. In medicine, CA: A Cancer Journal for Clinicians ranks first, followed by New England Journal of Medicine and The Lancet. Agreement with JCR Q1 reaches 84 percent in medicine and 71 percent in CS. Sensitivity analysis shows only 1.7 percent to 2.5 percent of CS conferences change partitions, indicating robustness. GSR provides a free, reproducible, field-normalized ranking system covering both journals and conferences, making it suitable for institutional evaluation policies.

2026-05-17T21:19:00Z Zhikai Yu http://arxiv.org/abs/2605.20168v1 One in Eight OpenAlex Abstracts Has Integrity Issues 2026-05-19T17:53:13Z

Scientific abstracts are increasingly used as primary data in computational metascience research, yet the quality of these abstracts in widely used bibliographic databases has not been systematically examined. We assess the integrity of 10,000 randomly sampled English-language journal abstracts from OpenAlex using a two-stage annotation protocol combining human expert review and large language model classification. We identify seven distinct failure modes and find that 12\% of abstracts have integrity issues, with insufficient content and misplaced metadata being the most prevalent. We discuss implications for downstream research and describe a forthcoming community portal to support collective annotation efforts.

2026-05-19T17:53:13Z 10 pages, 5 figures Seorin Kim Vincent Holst Vincent Ginis http://arxiv.org/abs/2511.11010v3 GovScape: A Public Multimodal Search System for 70 Million Pages of Government PDFs 2026-05-18T18:39:36Z

Efforts over the past three decades have produced web archives containing billions of webpage snapshots and petabytes of data. The End of Term Web Archive alone contains, among other file types, millions of PDFs produced by the federal government. While preservation with web archives has been successful, significant challenges for access and discoverability remain. For example, current affordances for browsing the End of Term PDFs are limited to downloading and browsing individual PDFs, as well as performing basic keyword search across them. In this paper, we introduce GovScape, a public search system that supports multimodal searches across 10,015,993 federal government PDFs from the 2020 End of Term crawl (70,958,487 total PDF pages) - to our knowledge, all renderable PDFs in the 2020 crawl that are 50 pages or under. GovScape supports four primary forms of search over these 10 million PDFs: in addition to providing (1) filter conditions over metadata facets including domain and crawl date and (2) exact text search against the PDF text, we provide (3) semantic text search and (4) visual search against the PDFs across individual pages, enabling users to structure queries such as "redacted documents" or "pie charts." We detail the constituent components of GovScape, including the search affordances, embedding pipeline, system architecture, and open source codebase. Significantly, the total estimated compute cost for GovScape's pre-processing pipeline for 10 million PDFs was approximately $1,500, equivalent to 47,000 PDF pages per dollar spent on compute, demonstrating the potential for immediate scalability. Accordingly, we outline steps that we have already begun pursuing toward multimodal search at the 100+ million PDF scale. GovScape can be found at https://www.govscape.net.

2025-11-14T06:54:48Z 11 pages, 4 figures, 4 tables Ying-Hsiang Huang Claire Gong Shreya Shaji Alison Yan Leslie Harka Albert Du Anjali Gopal Samuel J Klein Shannon Zejiang Shen Mark Phillips Trevor Owens Kyle Deeds Benjamin Charles Germain Lee http://arxiv.org/abs/2605.18752v1 Traditional statistical representations outperform generative AI in identifying expert peer reviewers 2026-05-18T17:59:45Z

The exponential growth of scientific submissions has strained the peer review system. Despite the rapidly expanding global pool of researchers, this unprecedented scale has rendered the previous approach of manual expert identification unfeasible. Therefore, institutions have naturally turned to Large Language Models (LLMs) to automate intricate processes like expert reviewer identification. However, the reliability of these new models in accurately identifying domain experts lacks rigorous evaluation. We conduct a comprehensive empirical evaluation of statistical and AI-driven expertise identification methodologies to benchmark their reliability and limitations. Framing expert identification as an information retrieval problem, we utilize the distributed peer review system of a major international astronomical observatory, where proposal authorship serves as our proxy ground truth for domain expertise. Evaluating six retrieval methodologies utilized across observatories and computer science conferences, we demonstrate that traditional statistical representations outperform generative AI. Specifically, Term Frequency-Inverse Document Frequency successfully identified a labeled expert within the top 25 recommendations 79.5% of the time, compared to 51.5% for GPT-4o mini. Our results highlight that distinguishing subfield expertise requires fine-grained vocabulary, which is obscured by the semantic smoothing in generative methods. By establishing a rigorous evaluation framework for automated peer review, we demonstrate that transparent and reproducible statistical representations still outperform computationally expensive LLMs in specialized scientific tasks.

2026-05-18T17:59:45Z Vicente Amado Olivo Tereza Jerabkova Jakub Klencki John Carpenter Mario Malički Ferdinando Patat Louis-Gregory Strolger Wolfgang Kerzendorf http://arxiv.org/abs/2605.18715v1 Global training and the collaborative structure of elite U.S. science 2026-05-18T17:46:53Z

Globally trained scientific labor is a substantial component of U.S. universities, yet the organizational mechanisms linking foreign degree training to elite scientific output remain poorly understood. We link comprehensive U.S. faculty rosters to more than 12 million OpenAlex-indexed faculty-publication observations from 2011 to 2020. Faculty with non-U.S. degrees constitute one-tenth of the U.S. professoriate but account for larger shares of total publications and top-1% cited papers. This overrepresentation is concentrated in high-output disciplinary domains and research-intensive institutions. Within institution - domain - rank - year strata, however, differences in top-1% output, FWCI, and corresponding-author share attenuate sharply, indicating that much of the aggregate pattern reflects organizational placement rather than large within-context citation advantages. Collaboration structure further differentiates foreign- and domestically trained faculty: mixed domestic-foreign faculty teams exhibit substantially elevated elite-output rates, and the association attenuates strongly after accounting for team size, suggesting that collaboration scale is central to the pattern. Topic-distinctiveness analyses show little evidence that foreign-degree faculty occupy unusually rare research niches. Overall, foreign-degree training is best understood less as an individual productivity attribute than as a structural feature of elite U.S. science, operating through institutional concentration and collaborative integration.

2026-05-18T17:46:53Z Erjia Yan Chaoqun Ni Xiang Zheng