https://arxiv.org/api/IT71D6lb3H5lqOkeFg/DG1XR3002026-06-14T10:25:06Z606566015http://arxiv.org/abs/2508.06004v1When a Paper Has 1000 Authors: Rethinking Citation Metrics in the Era of LLMs2025-08-08T04:18:26ZAuthor-level citation metrics provide a practical, interpretable, and scalable signal of scholarly influence in a complex research ecosystem. It has been widely used as a proxy in hiring decisions. However, the past five years have seen the rapid emergence of large-scale publications in the field of large language models and foundation models, with papers featuring hundreds to thousands of co-authors and receiving tens of thousands of citations within months. For example, Gemini has 1361 authors and has been cited around 4600 times in 19 months. In such cases, traditional metrics, such as total citation count and the $h$-index, fail to meaningfully distinguish individual contributions. Therefore, we propose the following research question: How can one identify standout researchers among thousands of co-authors in large-scale LLM papers? This question is particularly important in scenarios such as academic hiring and funding decisions. In this paper, we introduce a novel citation metric designed to address this challenge by balancing contributions across large-scale and small-scale publications. We propose the SBCI index, analyze its theoretical properties, and evaluate its behavior on synthetic publication datasets. Our results demonstrate that the proposed metric provides a more robust and discriminative assessment of individual scholarly impact in the era of large-scale collaborations.2025-08-08T04:18:26ZWeihang GuoZhao SongJiahao Zhanghttp://arxiv.org/abs/2508.04612v1A Reproducible, Scalable Pipeline for Synthesizing Autoregressive Model Literature2025-08-06T16:33:20ZThe accelerating pace of research on autoregressive generative models has produced thousands of papers, making manual literature surveys and reproduction studies increasingly impractical. We present a fully open-source, reproducible pipeline that automatically retrieves candidate documents from public repositories, filters them for relevance, extracts metadata, hyper-parameters and reported results, clusters topics, produces retrieval-augmented summaries and generates containerised scripts for re-running selected experiments. Quantitative evaluation on 50 manually-annotated papers shows F1 scores above 0.85 for relevance classification, hyper-parameter extraction and citation identification. Experiments on corpora of up to 1000 papers demonstrate near-linear scalability with eight CPU workers. Three case studies -- AWD-LSTM on WikiText-2, Transformer-XL on WikiText-103 and an autoregressive music model on the Lakh MIDI dataset -- confirm that the extracted settings support faithful reproduction, achieving test perplexities within 1--3% of the original reports.2025-08-06T16:33:20Z9 pagesFaruk AlpayBugra KilictasHamdi Alakkadhttp://arxiv.org/abs/2508.04213v1A Hybrid AI Methodology for Generating Ontologies of Research Topics from Scientific Paper Corpora2025-08-06T08:48:14ZTaxonomies and ontologies of research topics (e.g., MeSH, UMLS, CSO, NLM) play a central role in providing the primary framework through which intelligent systems can explore and interpret the literature. However, these resources have traditionally been manually curated, a process that is time-consuming, prone to obsolescence, and limited in granularity. This paper presents Sci-OG, a semi-auto\-mated methodology for generating research topic ontologies, employing a multi-step approach: 1) Topic Discovery, extracting potential topics from research papers; 2) Relationship Classification, determining semantic relationships between topic pairs; and 3) Ontology Construction, refining and organizing topics into a structured ontology. The relationship classification component, which constitutes the core of the system, integrates an encoder-based language model with features describing topic occurrence in the scientific literature. We evaluate this approach against a range of alternative solutions using a dataset of 21,649 manually annotated semantic triples. Our method achieves the highest F1 score (0.951), surpassing various competing approaches, including a fine-tuned SciBERT model and several LLM baselines, such as the fine-tuned GPT4-mini. Our work is corroborated by a use case which illustrates the practical application of our system to extend the CSO ontology in the area of cybersecurity. The presented solution is designed to improve the accessibility, organization, and analysis of scientific knowledge, thereby supporting advancements in AI-enabled literature management and research exploration.2025-08-06T08:48:14ZAlessia PisuLivio PompianuFrancesco OsborneDiego Reforgiato RecuperoDaniele RiboniAngelo Salatinohttp://arxiv.org/abs/2508.04024v1Identity Theft in AI Conference Peer Review2025-08-06T02:36:52ZWe discuss newly uncovered cases of identity theft in the scientific peer-review process within artificial intelligence (AI) research, with broader implications for other academic procedures. We detail how dishonest researchers exploit the peer-review system by creating fraudulent reviewer profiles to manipulate paper evaluations, leveraging weaknesses in reviewer recruitment workflows and identity verification processes. The findings highlight the critical need for stronger safeguards against identity theft in peer review and academia at large, and to this end, we also propose mitigating strategies.2025-08-06T02:36:52ZNihar B. ShahMelisa BokXukun LiuAndrew McCallumhttp://arxiv.org/abs/2508.03962v1Accelerating Scientific Discovery with Multi-Document Summarization of Impact-Ranked Papers2025-08-05T22:56:09ZThe growing volume of scientific literature makes it challenging for scientists to move from a list of papers to a synthesized understanding of a topic. Because of the constant influx of new papers on a daily basis, even if a scientist identifies a promising set of papers, they still face the tedious task of individually reading through dozens of titles and abstracts to make sense of occasionally conflicting findings. To address this critical bottleneck in the research workflow, we introduce a summarization feature to BIP! Finder, a scholarly search engine that ranks literature based on distinct impact aspects like popularity and influence. Our approach enables users to generate two types of summaries from top-ranked search results: a concise summary for an instantaneous at-a-glance comprehension and a more comprehensive literature review-style summary for greater, better-organized comprehension. This ability dynamically leverages BIP! Finder's already existing impact-based ranking and filtering features to generate context-sensitive, synthesized narratives that can significantly accelerate literature discovery and comprehension.2025-08-05T22:56:09ZParis KoloveasSerafeim ChatzopoulosDionysis DiamantisChristos TryfonopoulosThanasis Vergoulishttp://arxiv.org/abs/2508.03828v1MegaWika 2: A More Comprehensive Multilingual Collection of Articles and their Sources2025-08-05T18:18:17ZWe introduce MegaWika 2, a large, multilingual dataset of Wikipedia articles with their citations and scraped web sources; articles are represented in a rich data structure, and scraped source texts are stored inline with precise character offsets of their citations in the article text. MegaWika 2 is a major upgrade from the original MegaWika, spanning six times as many articles and twice as many fully scraped citations. Both MegaWika and MegaWika 2 support report generation research ; whereas MegaWika also focused on supporting question answering and retrieval applications, MegaWika 2 is designed to support fact checking and analyses across time and language.2025-08-05T18:18:17ZSamuel BarhamChandler MayBenjamin Van Durmehttp://arxiv.org/abs/1802.06015v2Interdisciplinarity Revealed by Transitive Reduction of Citation Networks2025-08-04T13:28:14ZWe investigate the impact of transitive reduction on citation networks. Our hypothesis is that documents which lose fewer citations under transitive reduction are likely to be interdisciplinary, while a large loss of citations suggests a document is primarily cited within a single discipline. We test this hypothesis by using an artificial model of a citation network and by using data on citations from three sources: academic papers, court decisions and patents. Where needed, we applied modularity-based clustering techniques on a network defined using bibliographic coupling to classify documents by topic. A cluster-dependent measure was then used to classify the nodes as interdisciplinary or intradisciplinary. Our results provide strong support for our hypothesis in three of the four cases, with somewhat weaker but still positive support in the case of patents.2018-02-16T16:32:07ZNew version completely reworked with new title and additional author. Twenty pages including appendices. Previous title was "Diversity from the Topology of Citation Networks"H. AlMuhannaV. VasiliauskaiteT. S. Evanshttp://arxiv.org/abs/2508.02379v1USRN Discovery Pilot: Increasing the Discoverability of Open Access Content Through a National Network2025-08-04T13:06:37ZThis paper presents the results of the USRN Discovery Pilot Project, a collaboration of SPARC, the Confederation of Open Access Repositories (COAR), CORE and Antleaf, to enhance the discoverability of research papers in US repositories leveraging CORE as an indexing service for USRN repositories. The project conducted actions in three strategic areas: Assessing and quantitatively measuring discoverability and barriers to it at the beginning and end of the pilot project, conducting interventions to increase discoverability, and supporting interventions by technology and guidelines (provided by CORE services), to minimise effort and maximise effect. The key results of the project include: Around three-quarters of a million research outputs held in the selected US repositories have been made discoverable (a 50% increase) compared to the year before; The project has made available the CORE Data Provider's Guide as well as a selection of new and improved tools to support repositories in increasing their discoverability. These include the CORE Reindexing Button and Index Notification modules, Fresh Finds and the USRN Desirable Characteristics for Digital Publication Repositories checking tool. The project team is now exploring ways to scale out this work to include more repositories.2025-08-04T13:06:37Z8 pages, Presented at the 20th International Conference on Open Repositories, June 15-18 2025, Chicago, Illinois, USAPetr KnothPaul WalkMatteo CancellieriMicheal UpshallHalyna TorchyloJennifer BeamerKathleen ShearerHeather Josephhttp://arxiv.org/abs/2508.02335v1Interoperable verification and dissemination of software assets in repositories using COAR Notify2025-08-04T12:13:26ZThe discoverability, attribution, and reusability of open research software are often hindered by its obscurity within academic manuscripts. To address this, the SoFAIR project (2024-2025) introduces a comprehensive workflow leveraging machine learning tools for extracting software mentions from research papers. The project integrates repository systems, authors, and services like HAL and Software Heritage to ensure proper archiving, citation, and accessibility of research software in alignment with FAIR principles. To enable interoperable communication across the various systems we present an integration of the COAR Notify Protocol, which facilitates automated, interoperable communication among repositories and authors to validate and disseminate software mentions. This paper outlines the SoFAIR workflow and the implementation of the COAR Notify Protocol, emphasising its potential to enhance the visibility and credibility of research software as first-class bibliographic records.2025-08-04T12:13:26Z8 pages. Presented at the 20th International Conference on Open Repositories, June 15-18 2025, Chicago, Illinois, USAMatteo CancellieriMartin DocekalDavid PrideMorane GruenpeterDavid DouardPetr Knothhttp://arxiv.org/abs/2508.02084v1SSBD Ontology: A Two-Tier Approach for Interoperable Bioimaging Metadata2025-08-04T05:51:55ZAdvanced bioimaging technologies have enabled the large-scale acquisition of multidimensional data, yet effective metadata management and interoperability remain significant challenges. To address these issues, we propose a new ontology-driven framework for the Systems Science of Biological Dynamics Database (SSBD) that adopts a two-tier architecture. The core layer provides a class-centric structure referencing existing biomedical ontologies, supporting both SSBD:repository -- which focuses on rapid dataset publication with minimal metadata -- and SSBD:database, which is enhanced with biological and imaging-related annotations. Meanwhile, the instance layer represents actual imaging dataset information as Resource Description Framework individuals that are explicitly linked to the core classes. This layered approach aligns flexible instance data with robust ontological classes, enabling seamless integration and advanced semantic queries. By coupling flexibility with rigor, the SSBD Ontology promotes interoperability, data reuse, and the discovery of novel biological mechanisms. Moreover, our solution aligns with the Recommended Metadata for Biological Images guidelines and fosters compatibility. Ultimately, our approach contributes to establishing a Findable, Accessible, Interoperable, and Reusable data ecosystem within the bioimaging community.2025-08-04T05:51:55ZAccepted to the 24th International Semantic Web Conference Resource Track (ISWC 2025)Yuki YamagataKoji KyodaHiroya ItogaEmi FujisawaShuichi Onamihttp://arxiv.org/abs/2506.15237v2Dissecting the gender divide: Authorship and acknowledgment in scientific publications2025-08-04T02:30:17ZThe issue of gender bias in scientific publications has been a topic of ongoing debate. One aspect of this debate concerns whether women receive equal credit for their contributions compared to men. Conventional wisdom suggests that women are more likely to be acknowledged than listed as co-authors. In this study, we analyze data from over 20,000 authors and 60,000 acknowledged individuals across nine disciplines in open-access journals. Our results confirm persistent gender disparities: women are more frequently acknowledged than credited as co-authors, especially in roles involving investigation and analysis. To account for status and disciplinary effects, we examined collaboration pair composed of highly cited and less-cited scholars. In collaborations, highly cited scholars are more likely to be listed as an author regardless of gender. Notably, highly cited women in such pairs are even more likely to be co-authors than their men counterparts. Our findings suggest that power dynamics and perceived success heavily influence the distribution of credit in scientific publishing. These results underscore the role of status dynamics in shaping authorship and call for a more nuanced understanding of how gender, power, and recognition interact in scientific publishing. Our findings offer valuable insights for scholars, editors, and funding committed to advancing equity in science.2025-06-18T08:19:01Z23 pages, 7 figuresKeigo KusumegiDaniel E. AcuñaYukie Sanohttp://arxiv.org/abs/2508.01882v1A Global South Strategy for Evaluating Research Value with ChatGPT2025-08-03T18:32:09ZResearch evaluation is important for appointments, promotions, departmental assessments, and national science strategy monitoring. Whilst Global North universities often have sufficient senior researchers for effective peer review and enough trust in citation data to use it for supporting indicators, the same is less likely to be true in the Global South. Moreover, Global South research priorities may not align well with citation-based indicators. This article introduces a ChatGPT-based strategy designed to address both limitations, applying it to Mauritius. The strategy involves giving ChatGPT instructions about how to evaluate the quality of research from the perspective of a given Global South nation and then using it to score articles based on these criteria. Results from Mauritius show that ChatGPT's scores for 1,566 journal articles published between 2015 and 2021 have an almost zero correlation with both ChatGPT research quality scores and citation rates. A word association thematic analysis of articles with relatively high scores for value to Mauritius identified a range of plausible themes, including education, policy relevance, and industrial production. Higher scoring articles also tended to mention the country or an important commercial sector in the abstract. Whilst the evidence suggests that assessing the direct value to a country of journal articles using ChatGPT gives plausible results, this approach should be used cautiously because it has unknown accuracy and ignores the wider value of research contributions.2025-08-03T18:32:09ZRobin NunkooMike Thelwallhttp://arxiv.org/abs/2508.02760v1Towards a Manifesto for Cyber Humanities: Paradigms, Ethics, and Prospects2025-08-03T17:33:24ZThe accelerated evolution of digital infrastructures and algorithmic systems is reshaping how the humanities engage with knowledge and culture. Rooted in the traditions of Digital Humanities and Digital Humanism, the concept of "Cyber Humanities" proposes a critical reconfiguration of humanistic inquiry for the post-digital era. This Manifesto introduces a flexible framework that integrates ethical design, sustainable digital practices, and participatory knowledge systems grounded in human-centered approaches. By means of a Decalogue of foundational principles, the Manifesto invites the scientific community to critically examine and reimagine the algorithmic infrastructures that influence culture, creativity, and collective memory.
Rather than being a simple extension of existing practices, "Cyber Humanities" should be understood as a foundational paradigm for humanistic inquiry in a computationally mediated world.
Keywords: Cyber Humanities, Digital Humanities, Transdisciplinary Epistemology, Algorithmic Reflexivity, Human-centered AI, Ethics-by-Design, Knowledge Ecosystems, Digital Sovereignty, Cognitive Infrastructures2025-08-03T17:33:24Z18 pages, 1 table, 48 references, to appear in: 1st. IEEE Int. Conf. on "Cyber Humanities"Giovanni AdorniEmanuele Bellinihttp://arxiv.org/abs/2508.02740v1Who Gets Cited? Gender- and Majority-Bias in LLM-Driven Reference Selection2025-08-02T13:27:32ZLarge language models (LLMs) are rapidly being adopted as research assistants, particularly for literature review and reference recommendation, yet little is known about whether they introduce demographic bias into citation workflows. This study systematically investigates gender bias in LLM-driven reference selection using controlled experiments with pseudonymous author names. We evaluate several LLMs (GPT-4o, GPT-4o-mini, Claude Sonnet, and Claude Haiku) by varying gender composition within candidate reference pools and analyzing selection patterns across fields. Our results reveal two forms of bias: a persistent preference for male-authored references and a majority-group bias that favors whichever gender is more prevalent in the candidate pool. These biases are amplified in larger candidate pools and only modestly attenuated by prompt-based mitigation strategies. Field-level analysis indicates that bias magnitude varies across scientific domains, with social sciences showing the least bias. Our findings indicate that LLMs can reinforce or exacerbate existing gender imbalances in scholarly recognition. Effective mitigation strategies are needed to avoid perpetuating existing gender disparities in scientific citation practices before integrating LLMs into high-stakes academic workflows.2025-08-02T13:27:32ZJiangen Hehttp://arxiv.org/abs/2512.00001v1Identifying and extracting Data Access Statements from full-text academic articles2025-08-01T14:47:12ZA Data Access Statement (DAS) is a formal declaration detailing how and where the underlying research data associated with a publication can be accessed. It promotes transparency, reproducibility, and compliance with funder and publisher data-sharing requirements. Funders such as Plan S, the European Union, UKRI, and NIH emphasise the inclusion of DAS in publications, underscoring its growing importance. While a DAS enhances research by increasing transparency, discoverability, and data quality while clarifying access protocols and elevating datasets as first-class research outputs, the repository community faces challenges in managing and curating DAS as a standard metadata component. Manual DAS curation remains labour-intensive and time-consuming, hindering efficient data-sharing practices. CORE has co-designed with the repository community a module that uses machine learning to identify and extract DAS from full-text articles. This tool facilitates the automated encoding, curation, and validation of DAS within metadata, reducing manual workload and improving metadata quality. This integration aligns with CORE's objective to enhance repository services by providing enriched metadata and supporting compliance with funder requirements. By streamlining DAS management and expanding metadata frameworks, CORE contributes to a more accessible and interconnected scholarly ecosystem, fostering data discoverability and reuse.2025-08-01T14:47:12ZPresented at the Open Repositories Conference 2025, Chicago, IllinoisDavid PrideMatteo CancellieriPetr Knoth