https://arxiv.org/api/tCKpNm9sgIF0ZonwSPe+KqD7Aj0 2026-06-14T17:33:32Z 6065 765 15 http://arxiv.org/abs/2506.07547v2 From Rapid Release to Reinforced Elite: Citation Inequality Is Stronger in Preprints than Journals 2025-06-11T12:47:21Z

Preprints have been considered primarily as a supplement to journal-based systems for the rapid dissemination of relevant scientific knowledge and have historically been supported by studies indicating that preprints and published reports have comparable authorship, references, and quality. However, as preprints increasingly serve as an independent medium for scholarly communication rather than precursors to the version of record, it remains uncertain how preprint usage is shaping scientific discourse. Our research revealed that the preprint citations exhibit significantly higher inequality than journal citations, consistently among categories. This trend persisted even when controlling for age and the mean citation count of the journal matched to each of the preprint categories. We also found that the citation inequality in preprints is not solely driven by a few highly cited papers or those with no impact, but rather reflects a broader systemic effect. Whether the preprint is subsequently published in a journal or not does not significantly affect the citation inequality. Further analyses of the structural factors show that preferential attachment does not significantly contribute to citation inequality in preprints, whereas author prestige plays a substantial role. Notably, the gap in citation inequality between the preprint category and the journal is more pronounced in fields where preprints are more established, such as mathematics, physics, and high-energy physics. This highlights a potential vulnerability in preprint ecosystems where reputation-driven citation may hinder scientific diversity.

2025-06-09T08:38:17Z Chiaki Miura Ichiro Sakata http://arxiv.org/abs/2409.04432v3 A Survey on Knowledge Organization Systems of Research Fields: Resources and Challenges 2025-06-11T09:15:33Z

Knowledge Organization Systems (KOSs), such as term lists, thesauri, taxonomies, and ontologies, play a fundamental role in categorising, managing, and retrieving information. In the academic domain, KOSs are often adopted for representing research areas and their relationships, primarily aiming to classify research articles, academic courses, patents, books, scientific venues, domain experts, grants, software, experiment materials, and several other relevant products and agents. These structured representations of research areas, widely embraced by many academic fields, have proven effective in empowering AI-based systems to i) enhance retrievability of relevant documents, ii) enable advanced analytic solutions to quantify the impact of academic research, and iii) analyse and forecast research dynamics. This paper aims to present a comprehensive survey of the current KOS for academic disciplines. We analysed and compared 45 KOSs according to five main dimensions: scope, structure, curation, usage, and links to other KOSs. Our results reveal a very heterogeneous scenario in terms of scope, scale, quality, and usage, highlighting the need for more integrated solutions for representing research knowledge across academic fields. We conclude by discussing the main challenges and the most promising future directions.

2024-09-06T17:54:43Z Published at Quantitative Science Studies Angelo Salatino Tanay Aggarwal Andrea Mannocci Francesco Osborne Enrico Motta 10.1162/qss_a_00363 http://arxiv.org/abs/2412.08258v2 Large Language Models for Scholarly Ontology Generation: An Extensive Analysis in the Engineering Field 2025-06-11T08:58:18Z

Ontologies of research topics are crucial for structuring scientific knowledge, enabling scientists to navigate vast amounts of research, and forming the backbone of intelligent systems such as search engines and recommendation systems. However, manual creation of these ontologies is expensive, slow, and often results in outdated and overly general representations. As a solution, researchers have been investigating ways to automate or semi-automate the process of generating these ontologies. This paper offers a comprehensive analysis of the ability of large language models (LLMs) to identify semantic relationships between different research topics, which is a critical step in the development of such ontologies. To this end, we developed a gold standard based on the IEEE Thesaurus to evaluate the task of identifying four types of relationships between pairs of topics: broader, narrower, same-as, and other. Our study evaluates the performance of seventeen LLMs, which differ in scale, accessibility (open vs. proprietary), and model type (full vs. quantised), while also assessing four zero-shot reasoning strategies. Several models have achieved outstanding results, including Mixtral-8x7B, Dolphin-Mistral-7B, and Claude 3 Sonnet, with F1-scores of 0.847, 0.920, and 0.967, respectively. Furthermore, our findings demonstrate that smaller, quantised models, when optimised through prompt engineering, can deliver performance comparable to much larger proprietary models, while requiring significantly fewer computational resources.

2024-12-11T10:11:41Z Now accepted to Information Processing & Management. this is the camera ready Tanay Aggarwal Angelo Salatino Francesco Osborne Enrico Motta http://arxiv.org/abs/2402.08640v4 Forecasting high-impact research topics via machine learning on evolving knowledge graphs 2025-06-11T07:14:34Z

The exponential growth in scientific publications poses a severe challenge for human researchers. It forces attention to more narrow sub-fields, which makes it challenging to discover new impactful research ideas and collaborations outside one's own field. While there are ways to predict a scientific paper's future citation counts, they need the research to be finished and the paper written, usually assessing impact long after the idea was conceived. Here we show how to predict the impact of onsets of ideas that have never been published by researchers. For that, we developed a large evolving knowledge graph built from more than 21 million scientific papers. It combines a semantic network created from the content of the papers and an impact network created from the historic citations of papers. Using machine learning, we can predict the dynamic of the evolving network into the future with high accuracy (AUC values beyond 0.9 for most experiments), and thereby the impact of new research directions. We envision that the ability to predict the impact of new ideas will be a crucial component of future artificial muses that can inspire new impactful and interesting scientific ideas.

2024-02-13T18:09:38Z 15 pages, 12 figures, Comments welcome! Mach. Learn.: Sci. Technol. 6 025041 (2025) Xuemei Gu Mario Krenn 10.1088/2632-2153/add6ef http://arxiv.org/abs/2412.18063v2 LMRPA: Large Language Model-Driven Efficient Robotic Process Automation for OCR 2025-06-10T09:32:11Z

This paper introduces LMRPA, a novel Large Model-Driven Robotic Process Automation (RPA) model designed to greatly improve the efficiency and speed of Optical Character Recognition (OCR) tasks. Traditional RPA platforms often suffer from performance bottlenecks when handling high-volume repetitive processes like OCR, leading to a less efficient and more time-consuming process. LMRPA allows the integration of Large Language Models (LLMs) to improve the accuracy and readability of extracted text, overcoming the challenges posed by ambiguous characters and complex text structures.Extensive benchmarks were conducted comparing LMRPA to leading RPA platforms, including UiPath and Automation Anywhere, using OCR engines like Tesseract and DocTR. The results are that LMRPA achieves superior performance, cutting the processing times by up to 52\%. For instance, in Batch 2 of the Tesseract OCR task, LMRPA completed the process in 9.8 seconds, where UiPath finished in 18.1 seconds and Automation Anywhere finished in 18.7 seconds. Similar improvements were observed with DocTR, where LMRPA outperformed other automation tools conducting the same process by completing tasks in 12.7 seconds, while competitors took over 20 seconds to do the same. These findings highlight the potential of LMRPA to revolutionize OCR-driven automation processes, offering a more efficient and effective alternative solution to the existing state-of-the-art RPA models.

2024-12-24T00:21:36Z 10 pages , 1 figure , 1 algorithm Osama Hosam Abdellaif Abdelrahman Nader Ali Hamdi http://arxiv.org/abs/2506.08300v1 Institutional Books 1.0: A 242B token dataset from Harvard Library's collections, refined for accuracy and usability 2025-06-10T00:11:30Z

Large language models (LLMs) use data to learn about the world in order to produce meaningful correlations and predictions. As such, the nature, scale, quality, and diversity of the datasets used to train these models, or to support their work at inference time, have a direct impact on their quality. The rapid development and adoption of LLMs of varying quality has brought into focus the scarcity of publicly available, high-quality training data and revealed an urgent need to ground the stewardship of these datasets in sustainable practices with clear provenance chains. To that end, this technical report introduces Institutional Books 1.0, a large collection of public domain books originally digitized through Harvard Library's participation in the Google Books project, beginning in 2006. Working with Harvard Library, we extracted, analyzed, and processed these volumes into an extensively-documented dataset of historic texts. This analysis covers the entirety of Harvard Library's collection scanned as part of that project, originally spanning 1,075,899 volumes written in over 250 different languages for a total of approximately 250 billion tokens. As part of this initial release, the OCR-extracted text (original and post-processed) as well as the metadata (bibliographic, source, and generated) of the 983,004 volumes, or 242B tokens, identified as being in the public domain have been made available. This report describes this project's goals and methods as well as the results of the analyses we performed, all in service of making this historical collection more accessible and easier for humans and machines alike to filter, read and use.

2025-06-10T00:11:30Z Matteo Cargnelutti Catherine Brobston John Hess Jack Cushman Kristi Mukk Aristana Scourtas Kyle Courtney Greg Leppert Amanda Watson Martha Whitehead Jonathan Zittrain http://arxiv.org/abs/2506.08199v1 Extracting Information About Publication Venues Using Citation-Informed Transformers 2025-06-09T20:20:16Z

Scientific document embeddings contain a variety of rich features which can be harnessed for downstream tasks such as recommendation, ranking, and clustering. We explore which tangible insights can be drawn from scientific document embeddings to understand trends in computer science research featured across nine well-known venues. We collect approximately 60,000 scientific documents published between 2015 and 2023 and analyze their embeddings, which we produce with the SPECTER pre-trained language model. In particular, we examine whether similarity between two venues can be measured using the embeddings of the scientific documents they admit for publication. Our findings indicate that some venues within computer science are indistinguishable when only considering the distributions of their document embeddings. We additionally examine whether any two venues are becoming increasingly similar over time and identify a trend of convergence within some venues in our analysis. We discuss the implications of these results and the potential impact on new scientific contributions.

2025-06-09T20:20:16Z Brian D. Zimmerman Joshua Folkins Olga Vechtomova http://arxiv.org/abs/2506.08196v1 No Stupid Questions: An Analysis of Question Query Generation for Citation Recommendation 2025-06-09T20:13:32Z

Existing techniques for citation recommendation are constrained by their adherence to article contents and metadata. We leverage GPT-4o-mini's latent expertise as an inquisitive assistant by instructing it to ask questions which, when answered, could expose new insights about an excerpt from a scientific article. We evaluate the utility of these questions as retrieval queries, measuring their effectiveness in retrieving and ranking masked target documents. In some cases, generated questions ended up being better queries than extractive keyword queries generated by the same model. We additionally propose MMR-RBO, a variation of Maximal Marginal Relevance (MMR) using Rank-Biased Overlap (RBO) to identify which questions will perform competitively with the keyword baseline. As all question queries yield unique result sets, we contend that there are no stupid questions.

2025-06-09T20:13:32Z 6 pages, 5 figures, 2 tables Brian D. Zimmerman Julien Aubert-Béduchaud Florian Boudin Akiko Aizawa Olga Vechtomova http://arxiv.org/abs/2506.07748v1 Research quality evaluation by AI in the era of Large Language Models: Advantages, disadvantages, and systemic effects 2025-06-09T13:32:06Z

Artificial Intelligence (AI) technologies like ChatGPT now threaten bibliometrics as the primary generators of research quality indicators. They are already used in at least one research quality evaluation system and evidence suggests that they are used informally by many peer reviewers. Since using bibliometrics to support research evaluation continues to be controversial, this article reviews the corresponding advantages and disadvantages of AI-generated quality scores. From a technical perspective, generative AI based on Large Language Models (LLMs) equals or surpasses bibliometrics in most important dimensions, including accuracy (mostly higher correlations with human scores), and coverage (more fields, more recent years) and may reflect more research quality dimensions. Like bibliometrics, current LLMs do not "measure" research quality, however. On the clearly negative side, LLM biases are currently unknown for research evaluation, and LLM scores are less transparent than citation counts. From a systemic perspective, the key issue is how introducing LLM-based indicators into research evaluation will change the behaviour of researchers. Whilst bibliometrics encourage some authors to target journals with high impact factors or to try to write highly cited work, LLM-based indicators may push them towards writing misleading abstracts and overselling their work in the hope of impressing the AI. Moreover, if AI-generated journal indicators replace impact factors, then this would encourage journals to allow authors to oversell their work in abstracts, threatening the integrity of the academic record.

2025-06-09T13:32:06Z Mike Thelwall http://arxiv.org/abs/2506.06753v1 Influential scientists shape knowledge flows between science and IGO policy 2025-06-07T10:50:00Z

Intergovernmental organizations (IGOs) increasingly rely on scientific evidence, yet the pathways through which scientific research enters policy remain opaque. By linking 230,737 scientific papers cited in IGO policy documents (2015-2023) to their authors and collaboration networks, we identify a small group of policy-influential scientists (PI-Sci) who dominate this knowledge flow. These scientists form tightly interconnected, internationally spanning co-authorship networks and achieve policy citations shortly after publication, a distinctive feature of cumulative advantage at the science-policy interface. The concentration of influence varies by field: tightly clustered in established domains like climate modeling, and more dispersed in emerging areas like AI governance. Many PI-Sci serve on high-level advisory bodies (e.g., IPCC), and major IGOs frequently co-cite the same PI-Sci papers, indicating synchronized knowledge diffusion through shared expert networks. These findings reveal how network structure and elite brokerage shape the translation of research into global policy, highlighting opportunities to broaden the scope of knowledge that informs policy.

2025-06-07T10:50:00Z Proc. Natl. Acad. Sci. U.S.A. 123(17), e2514861123 (2026) Kimitaka Asatani Yurie Iwata Yuta Tomokiyo Basil Mahfouz Masaru Yarime Ichiro Sakata 10.1073/pnas.2514861123 http://arxiv.org/abs/2503.18236v5 Research impact evaluation based on effective authorship contribution sensitivity: h-leadership index 2025-06-06T22:57:14Z

The evaluation of a researcher's performance has traditionally relied on various bibliometric measures, with the h-index being one of the most prominent. However, the h-index only accounts for the number of citations received in a publication and does not account for other factors such as the number of authors or their specific contributions in collaborative works. Therefore, the h-index has been placed on scrutiny as it has motivated academic integrity issues where non-contributing authors get authorship merely for raising their h-index. In this study, we comprehensively evaluate existing metrics in their ability to account for authorship contribution by their position and introduce a novel variant of the h-index, known as the h-leadership index. The h-leadership index aims to advance the fair evaluation of academic contributions in multi-authored publications by giving importance to authorship position beyond the first and last authors, motivated by Stanford's ranking of the top 2 \% of world scientists. We assign weighted citations based on a modified complementary unit Gaussian curve, ensuring that the contributions of middle authors are appropriately recognised. We apply the h-leadership index to analyse the top 50 researchers across the Group of 8 (Go8) universities in Australia, demonstrating its potential to provide a more balanced assessment of research performance. We provide open-source software for extending the work further.

2025-03-23T23:06:50Z Hardik A. Jain Rohitash Chandra http://arxiv.org/abs/2410.02701v2 Impact of a reclassification of Web of Science articles on bibliometric indicators 2025-06-06T08:26:55Z

This work aims at evaluating a reclassification of Web of Science articles implemented at OST. Articles from the 254 scientific categories of the Web of Science were reclassified at article level in 242 modified categories and 11 disciplines using the method of S. Milojević (2020). The reclassification is based on paper references categories and it no longer assigns papers to multiple or to multidisciplinary categories. It improves the accuracy and the modularity of the WoS classification. As there are important changes in document assignment at the lowest level, usual indicators such as disciplinary profiles or field normalized indicators are significantly modified. This study examines some of these modifications to provide explanations for the recipients of OST reports. Changes in specialization indexes reveal specific journal choices by scientists. In a sample of 25 countries, Brazil and China offer examples of facilities or constraints for selecting journals to publish certain research works.

2024-10-03T17:26:16Z 24 pages, 25 figures Agénor Lahatte Élisabeth de Turckheim http://arxiv.org/abs/2505.20944v2 International collaboration of Ukrainian scholars: Effects of Russia's full-scale invasion of Ukraine 2025-06-04T14:55:52Z

This study explores the effects of Russia's full-scale invasion of Ukraine on the international collaboration of Ukrainian scholars. First and foremost, Ukrainian scholars deserve respect for continuing to publish despite life-threatening conditions, mental strain, shelling and blackouts. In 2022-2023, universities gained more from international collaboration than the NASU. The percentage of internationally co-authored articles remained unchanged for the NASU, while it increased for universities. In 2023, 40.8% of articles published by the NASU and 32,2% of articles published by universities were internationally co-authored. However, these figures are still much lower than in developed countries (60-70%). The citation impact of internationally co-authored articles remained statistically unchanged for the NASU but increased for universities. The highest share of internationally co-authored articles published by the NASU in both periods was in the physical sciences and engineering. However, the citation impact of these articles declined in 2022-2023, nearly erasing their previous citation advantage over university publications. Universities consistently outperformed the NASU in the citation impact of internationally co-authored articles in biomedical and health sciences across both periods. International collaboration can help Ukrainian scholars to go through this difficult time. In turn, they can contribute to the strengthening of Europe.

2025-05-27T09:28:21Z Higher Education (2026) Myroslava Hladchenko 10.1007/s10734-025-01577-y http://arxiv.org/abs/2506.09056v1 MetaInfoSci: An Integrated Web Tool for Scholarly Data Analysis 2025-06-04T07:21:22Z

The exponential increase in academic publications has made it increasingly difficult for researchers to remain up to date and systematically synthesize knowledge scattered across vast and fragmented research domains. Literature reviews, particularly those supported by bibliometric methods, have become essential in organizing prior findings and guiding future research directions. While numerous tools exist for bibliometric analysis and network science, there is currently no single platform that integrates the full range of features from both domains. Researchers are often required to navigate multiple software environments, many of which lack customizable visualizations, cross-database integration, and AI-assisted result summarization. Addressing these limitations, this study introduces MetaInfoSci at www.metainfosci.com, a comprehensive, web-based platform designed to unify bibliometric, scientometric, and network analytical capabilities. The platform supports tailored query design, merges data from diverse sources, enables rich and adaptable visual outputs, and provides automated, AI-driven summaries of analytical results. This integrated approach aims to enhance the accessibility, efficiency, and depth of scientific literature analysis for scholars across disciplines.

2025-06-04T07:21:22Z Kiran Sharmaa Parul Khurana Ziya Uddina http://arxiv.org/abs/2506.03587v1 Preface to the Special Issue of the TAL Journal on Scholarly Document Processing 2025-06-04T05:35:39Z

The rapid growth of scholarly literature makes it increasingly difficult for researchers to keep up with new knowledge. Automated tools are now more essential than ever to help navigate and interpret this vast body of information. Scientific papers pose unique difficulties, with their complex language, specialized terminology, and diverse formats, requiring advanced methods to extract reliable and actionable insights. Large language models (LLMs) offer new opportunities, enabling tasks such as literature reviews, writing assistance, and interactive exploration of research. This special issue of the TAL journal highlights research addressing these challenges and, more broadly, research on natural language processing and information retrieval for scholarly and scientific documents.

2025-06-04T05:35:39Z Traitement Automatique des Langues (TAL), volume 25, n°2/2024 Florian Boudin Akiko Aizawa