https://arxiv.org/api/IT71D6lb3H5lqOkeFg/DG1XR300 2026-06-14T10:25:06Z 6065 660 15 http://arxiv.org/abs/2508.06004v1 When a Paper Has 1000 Authors: Rethinking Citation Metrics in the Era of LLMs 2025-08-08T04:18:26Z

Author-level citation metrics provide a practical, interpretable, and scalable signal of scholarly influence in a complex research ecosystem. It has been widely used as a proxy in hiring decisions. However, the past five years have seen the rapid emergence of large-scale publications in the field of large language models and foundation models, with papers featuring hundreds to thousands of co-authors and receiving tens of thousands of citations within months. For example, Gemini has 1361 authors and has been cited around 4600 times in 19 months. In such cases, traditional metrics, such as total citation count and the $h$-index, fail to meaningfully distinguish individual contributions. Therefore, we propose the following research question: How can one identify standout researchers among thousands of co-authors in large-scale LLM papers? This question is particularly important in scenarios such as academic hiring and funding decisions. In this paper, we introduce a novel citation metric designed to address this challenge by balancing contributions across large-scale and small-scale publications. We propose the SBCI index, analyze its theoretical properties, and evaluate its behavior on synthetic publication datasets. Our results demonstrate that the proposed metric provides a more robust and discriminative assessment of individual scholarly impact in the era of large-scale collaborations.

2025-08-08T04:18:26Z Weihang Guo Zhao Song Jiahao Zhang http://arxiv.org/abs/2508.04612v1 A Reproducible, Scalable Pipeline for Synthesizing Autoregressive Model Literature 2025-08-06T16:33:20Z

The accelerating pace of research on autoregressive generative models has produced thousands of papers, making manual literature surveys and reproduction studies increasingly impractical. We present a fully open-source, reproducible pipeline that automatically retrieves candidate documents from public repositories, filters them for relevance, extracts metadata, hyper-parameters and reported results, clusters topics, produces retrieval-augmented summaries and generates containerised scripts for re-running selected experiments. Quantitative evaluation on 50 manually-annotated papers shows F1 scores above 0.85 for relevance classification, hyper-parameter extraction and citation identification. Experiments on corpora of up to 1000 papers demonstrate near-linear scalability with eight CPU workers. Three case studies -- AWD-LSTM on WikiText-2, Transformer-XL on WikiText-103 and an autoregressive music model on the Lakh MIDI dataset -- confirm that the extracted settings support faithful reproduction, achieving test perplexities within 1--3% of the original reports.

2025-08-06T16:33:20Z 9 pages Faruk Alpay Bugra Kilictas Hamdi Alakkad http://arxiv.org/abs/2508.04213v1 A Hybrid AI Methodology for Generating Ontologies of Research Topics from Scientific Paper Corpora 2025-08-06T08:48:14Z

Taxonomies and ontologies of research topics (e.g., MeSH, UMLS, CSO, NLM) play a central role in providing the primary framework through which intelligent systems can explore and interpret the literature. However, these resources have traditionally been manually curated, a process that is time-consuming, prone to obsolescence, and limited in granularity. This paper presents Sci-OG, a semi-auto\-mated methodology for generating research topic ontologies, employing a multi-step approach: 1) Topic Discovery, extracting potential topics from research papers; 2) Relationship Classification, determining semantic relationships between topic pairs; and 3) Ontology Construction, refining and organizing topics into a structured ontology. The relationship classification component, which constitutes the core of the system, integrates an encoder-based language model with features describing topic occurrence in the scientific literature. We evaluate this approach against a range of alternative solutions using a dataset of 21,649 manually annotated semantic triples. Our method achieves the highest F1 score (0.951), surpassing various competing approaches, including a fine-tuned SciBERT model and several LLM baselines, such as the fine-tuned GPT4-mini. Our work is corroborated by a use case which illustrates the practical application of our system to extend the CSO ontology in the area of cybersecurity. The presented solution is designed to improve the accessibility, organization, and analysis of scientific knowledge, thereby supporting advancements in AI-enabled literature management and research exploration.

2025-08-06T08:48:14Z Alessia Pisu Livio Pompianu Francesco Osborne Diego Reforgiato Recupero Daniele Riboni Angelo Salatino http://arxiv.org/abs/2508.04024v1 Identity Theft in AI Conference Peer Review 2025-08-06T02:36:52Z

We discuss newly uncovered cases of identity theft in the scientific peer-review process within artificial intelligence (AI) research, with broader implications for other academic procedures. We detail how dishonest researchers exploit the peer-review system by creating fraudulent reviewer profiles to manipulate paper evaluations, leveraging weaknesses in reviewer recruitment workflows and identity verification processes. The findings highlight the critical need for stronger safeguards against identity theft in peer review and academia at large, and to this end, we also propose mitigating strategies.

2025-08-06T02:36:52Z Nihar B. Shah Melisa Bok Xukun Liu Andrew McCallum http://arxiv.org/abs/2508.03962v1 Accelerating Scientific Discovery with Multi-Document Summarization of Impact-Ranked Papers 2025-08-05T22:56:09Z

The growing volume of scientific literature makes it challenging for scientists to move from a list of papers to a synthesized understanding of a topic. Because of the constant influx of new papers on a daily basis, even if a scientist identifies a promising set of papers, they still face the tedious task of individually reading through dozens of titles and abstracts to make sense of occasionally conflicting findings. To address this critical bottleneck in the research workflow, we introduce a summarization feature to BIP! Finder, a scholarly search engine that ranks literature based on distinct impact aspects like popularity and influence. Our approach enables users to generate two types of summaries from top-ranked search results: a concise summary for an instantaneous at-a-glance comprehension and a more comprehensive literature review-style summary for greater, better-organized comprehension. This ability dynamically leverages BIP! Finder's already existing impact-based ranking and filtering features to generate context-sensitive, synthesized narratives that can significantly accelerate literature discovery and comprehension.

2025-08-05T22:56:09Z Paris Koloveas Serafeim Chatzopoulos Dionysis Diamantis Christos Tryfonopoulos Thanasis Vergoulis http://arxiv.org/abs/2508.03828v1 MegaWika 2: A More Comprehensive Multilingual Collection of Articles and their Sources 2025-08-05T18:18:17Z

We introduce MegaWika 2, a large, multilingual dataset of Wikipedia articles with their citations and scraped web sources; articles are represented in a rich data structure, and scraped source texts are stored inline with precise character offsets of their citations in the article text. MegaWika 2 is a major upgrade from the original MegaWika, spanning six times as many articles and twice as many fully scraped citations. Both MegaWika and MegaWika 2 support report generation research ; whereas MegaWika also focused on supporting question answering and retrieval applications, MegaWika 2 is designed to support fact checking and analyses across time and language.

2025-08-05T18:18:17Z Samuel Barham Chandler May Benjamin Van Durme http://arxiv.org/abs/1802.06015v2 Interdisciplinarity Revealed by Transitive Reduction of Citation Networks 2025-08-04T13:28:14Z

We investigate the impact of transitive reduction on citation networks. Our hypothesis is that documents which lose fewer citations under transitive reduction are likely to be interdisciplinary, while a large loss of citations suggests a document is primarily cited within a single discipline. We test this hypothesis by using an artificial model of a citation network and by using data on citations from three sources: academic papers, court decisions and patents. Where needed, we applied modularity-based clustering techniques on a network defined using bibliographic coupling to classify documents by topic. A cluster-dependent measure was then used to classify the nodes as interdisciplinary or intradisciplinary. Our results provide strong support for our hypothesis in three of the four cases, with somewhat weaker but still positive support in the case of patents.

2018-02-16T16:32:07Z New version completely reworked with new title and additional author. Twenty pages including appendices. Previous title was "Diversity from the Topology of Citation Networks" H. AlMuhanna V. Vasiliauskaite T. S. Evans http://arxiv.org/abs/2508.02379v1 USRN Discovery Pilot: Increasing the Discoverability of Open Access Content Through a National Network 2025-08-04T13:06:37Z

This paper presents the results of the USRN Discovery Pilot Project, a collaboration of SPARC, the Confederation of Open Access Repositories (COAR), CORE and Antleaf, to enhance the discoverability of research papers in US repositories leveraging CORE as an indexing service for USRN repositories. The project conducted actions in three strategic areas: Assessing and quantitatively measuring discoverability and barriers to it at the beginning and end of the pilot project, conducting interventions to increase discoverability, and supporting interventions by technology and guidelines (provided by CORE services), to minimise effort and maximise effect. The key results of the project include: Around three-quarters of a million research outputs held in the selected US repositories have been made discoverable (a 50% increase) compared to the year before; The project has made available the CORE Data Provider's Guide as well as a selection of new and improved tools to support repositories in increasing their discoverability. These include the CORE Reindexing Button and Index Notification modules, Fresh Finds and the USRN Desirable Characteristics for Digital Publication Repositories checking tool. The project team is now exploring ways to scale out this work to include more repositories.

2025-08-04T13:06:37Z 8 pages, Presented at the 20th International Conference on Open Repositories, June 15-18 2025, Chicago, Illinois, USA Petr Knoth Paul Walk Matteo Cancellieri Micheal Upshall Halyna Torchylo Jennifer Beamer Kathleen Shearer Heather Joseph http://arxiv.org/abs/2508.02335v1 Interoperable verification and dissemination of software assets in repositories using COAR Notify 2025-08-04T12:13:26Z

The discoverability, attribution, and reusability of open research software are often hindered by its obscurity within academic manuscripts. To address this, the SoFAIR project (2024-2025) introduces a comprehensive workflow leveraging machine learning tools for extracting software mentions from research papers. The project integrates repository systems, authors, and services like HAL and Software Heritage to ensure proper archiving, citation, and accessibility of research software in alignment with FAIR principles. To enable interoperable communication across the various systems we present an integration of the COAR Notify Protocol, which facilitates automated, interoperable communication among repositories and authors to validate and disseminate software mentions. This paper outlines the SoFAIR workflow and the implementation of the COAR Notify Protocol, emphasising its potential to enhance the visibility and credibility of research software as first-class bibliographic records.

2025-08-04T12:13:26Z 8 pages. Presented at the 20th International Conference on Open Repositories, June 15-18 2025, Chicago, Illinois, USA Matteo Cancellieri Martin Docekal David Pride Morane Gruenpeter David Douard Petr Knoth http://arxiv.org/abs/2508.02084v1 SSBD Ontology: A Two-Tier Approach for Interoperable Bioimaging Metadata 2025-08-04T05:51:55Z

Advanced bioimaging technologies have enabled the large-scale acquisition of multidimensional data, yet effective metadata management and interoperability remain significant challenges. To address these issues, we propose a new ontology-driven framework for the Systems Science of Biological Dynamics Database (SSBD) that adopts a two-tier architecture. The core layer provides a class-centric structure referencing existing biomedical ontologies, supporting both SSBD:repository -- which focuses on rapid dataset publication with minimal metadata -- and SSBD:database, which is enhanced with biological and imaging-related annotations. Meanwhile, the instance layer represents actual imaging dataset information as Resource Description Framework individuals that are explicitly linked to the core classes. This layered approach aligns flexible instance data with robust ontological classes, enabling seamless integration and advanced semantic queries. By coupling flexibility with rigor, the SSBD Ontology promotes interoperability, data reuse, and the discovery of novel biological mechanisms. Moreover, our solution aligns with the Recommended Metadata for Biological Images guidelines and fosters compatibility. Ultimately, our approach contributes to establishing a Findable, Accessible, Interoperable, and Reusable data ecosystem within the bioimaging community.

2025-08-04T05:51:55Z Accepted to the 24th International Semantic Web Conference Resource Track (ISWC 2025) Yuki Yamagata Koji Kyoda Hiroya Itoga Emi Fujisawa Shuichi Onami http://arxiv.org/abs/2506.15237v2 Dissecting the gender divide: Authorship and acknowledgment in scientific publications 2025-08-04T02:30:17Z

The issue of gender bias in scientific publications has been a topic of ongoing debate. One aspect of this debate concerns whether women receive equal credit for their contributions compared to men. Conventional wisdom suggests that women are more likely to be acknowledged than listed as co-authors. In this study, we analyze data from over 20,000 authors and 60,000 acknowledged individuals across nine disciplines in open-access journals. Our results confirm persistent gender disparities: women are more frequently acknowledged than credited as co-authors, especially in roles involving investigation and analysis. To account for status and disciplinary effects, we examined collaboration pair composed of highly cited and less-cited scholars. In collaborations, highly cited scholars are more likely to be listed as an author regardless of gender. Notably, highly cited women in such pairs are even more likely to be co-authors than their men counterparts. Our findings suggest that power dynamics and perceived success heavily influence the distribution of credit in scientific publishing. These results underscore the role of status dynamics in shaping authorship and call for a more nuanced understanding of how gender, power, and recognition interact in scientific publishing. Our findings offer valuable insights for scholars, editors, and funding committed to advancing equity in science.

2025-06-18T08:19:01Z 23 pages, 7 figures Keigo Kusumegi Daniel E. Acuña Yukie Sano http://arxiv.org/abs/2508.01882v1 A Global South Strategy for Evaluating Research Value with ChatGPT 2025-08-03T18:32:09Z

Research evaluation is important for appointments, promotions, departmental assessments, and national science strategy monitoring. Whilst Global North universities often have sufficient senior researchers for effective peer review and enough trust in citation data to use it for supporting indicators, the same is less likely to be true in the Global South. Moreover, Global South research priorities may not align well with citation-based indicators. This article introduces a ChatGPT-based strategy designed to address both limitations, applying it to Mauritius. The strategy involves giving ChatGPT instructions about how to evaluate the quality of research from the perspective of a given Global South nation and then using it to score articles based on these criteria. Results from Mauritius show that ChatGPT's scores for 1,566 journal articles published between 2015 and 2021 have an almost zero correlation with both ChatGPT research quality scores and citation rates. A word association thematic analysis of articles with relatively high scores for value to Mauritius identified a range of plausible themes, including education, policy relevance, and industrial production. Higher scoring articles also tended to mention the country or an important commercial sector in the abstract. Whilst the evidence suggests that assessing the direct value to a country of journal articles using ChatGPT gives plausible results, this approach should be used cautiously because it has unknown accuracy and ignores the wider value of research contributions.

2025-08-03T18:32:09Z Robin Nunkoo Mike Thelwall http://arxiv.org/abs/2508.02760v1 Towards a Manifesto for Cyber Humanities: Paradigms, Ethics, and Prospects 2025-08-03T17:33:24Z

The accelerated evolution of digital infrastructures and algorithmic systems is reshaping how the humanities engage with knowledge and culture. Rooted in the traditions of Digital Humanities and Digital Humanism, the concept of "Cyber Humanities" proposes a critical reconfiguration of humanistic inquiry for the post-digital era. This Manifesto introduces a flexible framework that integrates ethical design, sustainable digital practices, and participatory knowledge systems grounded in human-centered approaches. By means of a Decalogue of foundational principles, the Manifesto invites the scientific community to critically examine and reimagine the algorithmic infrastructures that influence culture, creativity, and collective memory. Rather than being a simple extension of existing practices, "Cyber Humanities" should be understood as a foundational paradigm for humanistic inquiry in a computationally mediated world. Keywords: Cyber Humanities, Digital Humanities, Transdisciplinary Epistemology, Algorithmic Reflexivity, Human-centered AI, Ethics-by-Design, Knowledge Ecosystems, Digital Sovereignty, Cognitive Infrastructures

2025-08-03T17:33:24Z 18 pages, 1 table, 48 references, to appear in: 1st. IEEE Int. Conf. on "Cyber Humanities" Giovanni Adorni Emanuele Bellini http://arxiv.org/abs/2508.02740v1 Who Gets Cited? Gender- and Majority-Bias in LLM-Driven Reference Selection 2025-08-02T13:27:32Z

Large language models (LLMs) are rapidly being adopted as research assistants, particularly for literature review and reference recommendation, yet little is known about whether they introduce demographic bias into citation workflows. This study systematically investigates gender bias in LLM-driven reference selection using controlled experiments with pseudonymous author names. We evaluate several LLMs (GPT-4o, GPT-4o-mini, Claude Sonnet, and Claude Haiku) by varying gender composition within candidate reference pools and analyzing selection patterns across fields. Our results reveal two forms of bias: a persistent preference for male-authored references and a majority-group bias that favors whichever gender is more prevalent in the candidate pool. These biases are amplified in larger candidate pools and only modestly attenuated by prompt-based mitigation strategies. Field-level analysis indicates that bias magnitude varies across scientific domains, with social sciences showing the least bias. Our findings indicate that LLMs can reinforce or exacerbate existing gender imbalances in scholarly recognition. Effective mitigation strategies are needed to avoid perpetuating existing gender disparities in scientific citation practices before integrating LLMs into high-stakes academic workflows.

2025-08-02T13:27:32Z Jiangen He http://arxiv.org/abs/2512.00001v1 Identifying and extracting Data Access Statements from full-text academic articles 2025-08-01T14:47:12Z

A Data Access Statement (DAS) is a formal declaration detailing how and where the underlying research data associated with a publication can be accessed. It promotes transparency, reproducibility, and compliance with funder and publisher data-sharing requirements. Funders such as Plan S, the European Union, UKRI, and NIH emphasise the inclusion of DAS in publications, underscoring its growing importance. While a DAS enhances research by increasing transparency, discoverability, and data quality while clarifying access protocols and elevating datasets as first-class research outputs, the repository community faces challenges in managing and curating DAS as a standard metadata component. Manual DAS curation remains labour-intensive and time-consuming, hindering efficient data-sharing practices. CORE has co-designed with the repository community a module that uses machine learning to identify and extract DAS from full-text articles. This tool facilitates the automated encoding, curation, and validation of DAS within metadata, reducing manual workload and improving metadata quality. This integration aligns with CORE's objective to enhance repository services by providing enriched metadata and supporting compliance with funder requirements. By streamlining DAS management and expanding metadata frameworks, CORE contributes to a more accessible and interconnected scholarly ecosystem, fostering data discoverability and reuse.

2025-08-01T14:47:12Z Presented at the Open Repositories Conference 2025, Chicago, Illinois David Pride Matteo Cancellieri Petr Knoth