https://arxiv.org/api/tCKpNm9sgIF0ZonwSPe+KqD7Aj02026-06-14T17:33:32Z606576515http://arxiv.org/abs/2506.07547v2From Rapid Release to Reinforced Elite: Citation Inequality Is Stronger in Preprints than Journals2025-06-11T12:47:21ZPreprints have been considered primarily as a supplement to journal-based systems for the rapid dissemination of relevant scientific knowledge and have historically been supported by studies indicating that preprints and published reports have comparable authorship, references, and quality. However, as preprints increasingly serve as an independent medium for scholarly communication rather than precursors to the version of record, it remains uncertain how preprint usage is shaping scientific discourse. Our research revealed that the preprint citations exhibit significantly higher inequality than journal citations, consistently among categories. This trend persisted even when controlling for age and the mean citation count of the journal matched to each of the preprint categories. We also found that the citation inequality in preprints is not solely driven by a few highly cited papers or those with no impact, but rather reflects a broader systemic effect. Whether the preprint is subsequently published in a journal or not does not significantly affect the citation inequality. Further analyses of the structural factors show that preferential attachment does not significantly contribute to citation inequality in preprints, whereas author prestige plays a substantial role. Notably, the gap in citation inequality between the preprint category and the journal is more pronounced in fields where preprints are more established, such as mathematics, physics, and high-energy physics. This highlights a potential vulnerability in preprint ecosystems where reputation-driven citation may hinder scientific diversity.2025-06-09T08:38:17ZChiaki MiuraIchiro Sakatahttp://arxiv.org/abs/2409.04432v3A Survey on Knowledge Organization Systems of Research Fields: Resources and Challenges2025-06-11T09:15:33ZKnowledge Organization Systems (KOSs), such as term lists, thesauri, taxonomies, and ontologies, play a fundamental role in categorising, managing, and retrieving information. In the academic domain, KOSs are often adopted for representing research areas and their relationships, primarily aiming to classify research articles, academic courses, patents, books, scientific venues, domain experts, grants, software, experiment materials, and several other relevant products and agents. These structured representations of research areas, widely embraced by many academic fields, have proven effective in empowering AI-based systems to i) enhance retrievability of relevant documents, ii) enable advanced analytic solutions to quantify the impact of academic research, and iii) analyse and forecast research dynamics. This paper aims to present a comprehensive survey of the current KOS for academic disciplines. We analysed and compared 45 KOSs according to five main dimensions: scope, structure, curation, usage, and links to other KOSs. Our results reveal a very heterogeneous scenario in terms of scope, scale, quality, and usage, highlighting the need for more integrated solutions for representing research knowledge across academic fields. We conclude by discussing the main challenges and the most promising future directions.2024-09-06T17:54:43ZPublished at Quantitative Science StudiesAngelo SalatinoTanay AggarwalAndrea MannocciFrancesco OsborneEnrico Motta10.1162/qss_a_00363http://arxiv.org/abs/2412.08258v2Large Language Models for Scholarly Ontology Generation: An Extensive Analysis in the Engineering Field2025-06-11T08:58:18ZOntologies of research topics are crucial for structuring scientific knowledge, enabling scientists to navigate vast amounts of research, and forming the backbone of intelligent systems such as search engines and recommendation systems. However, manual creation of these ontologies is expensive, slow, and often results in outdated and overly general representations. As a solution, researchers have been investigating ways to automate or semi-automate the process of generating these ontologies. This paper offers a comprehensive analysis of the ability of large language models (LLMs) to identify semantic relationships between different research topics, which is a critical step in the development of such ontologies. To this end, we developed a gold standard based on the IEEE Thesaurus to evaluate the task of identifying four types of relationships between pairs of topics: broader, narrower, same-as, and other. Our study evaluates the performance of seventeen LLMs, which differ in scale, accessibility (open vs. proprietary), and model type (full vs. quantised), while also assessing four zero-shot reasoning strategies. Several models have achieved outstanding results, including Mixtral-8x7B, Dolphin-Mistral-7B, and Claude 3 Sonnet, with F1-scores of 0.847, 0.920, and 0.967, respectively. Furthermore, our findings demonstrate that smaller, quantised models, when optimised through prompt engineering, can deliver performance comparable to much larger proprietary models, while requiring significantly fewer computational resources.2024-12-11T10:11:41ZNow accepted to Information Processing & Management. this is the camera readyTanay AggarwalAngelo SalatinoFrancesco OsborneEnrico Mottahttp://arxiv.org/abs/2402.08640v4Forecasting high-impact research topics via machine learning on evolving knowledge graphs2025-06-11T07:14:34ZThe exponential growth in scientific publications poses a severe challenge for human researchers. It forces attention to more narrow sub-fields, which makes it challenging to discover new impactful research ideas and collaborations outside one's own field. While there are ways to predict a scientific paper's future citation counts, they need the research to be finished and the paper written, usually assessing impact long after the idea was conceived. Here we show how to predict the impact of onsets of ideas that have never been published by researchers. For that, we developed a large evolving knowledge graph built from more than 21 million scientific papers. It combines a semantic network created from the content of the papers and an impact network created from the historic citations of papers. Using machine learning, we can predict the dynamic of the evolving network into the future with high accuracy (AUC values beyond 0.9 for most experiments), and thereby the impact of new research directions. We envision that the ability to predict the impact of new ideas will be a crucial component of future artificial muses that can inspire new impactful and interesting scientific ideas.2024-02-13T18:09:38Z15 pages, 12 figures, Comments welcome!Mach. Learn.: Sci. Technol. 6 025041 (2025)Xuemei GuMario Krenn10.1088/2632-2153/add6efhttp://arxiv.org/abs/2412.18063v2LMRPA: Large Language Model-Driven Efficient Robotic Process Automation for OCR2025-06-10T09:32:11ZThis paper introduces LMRPA, a novel Large Model-Driven Robotic Process Automation (RPA) model designed to greatly improve the efficiency and speed of Optical Character Recognition (OCR) tasks. Traditional RPA platforms often suffer from performance bottlenecks when handling high-volume repetitive processes like OCR, leading to a less efficient and more time-consuming process. LMRPA allows the integration of Large Language Models (LLMs) to improve the accuracy and readability of extracted text, overcoming the challenges posed by ambiguous characters and complex text structures.Extensive benchmarks were conducted comparing LMRPA to leading RPA platforms, including UiPath and Automation Anywhere, using OCR engines like Tesseract and DocTR. The results are that LMRPA achieves superior performance, cutting the processing times by up to 52\%. For instance, in Batch 2 of the Tesseract OCR task, LMRPA completed the process in 9.8 seconds, where UiPath finished in 18.1 seconds and Automation Anywhere finished in 18.7 seconds. Similar improvements were observed with DocTR, where LMRPA outperformed other automation tools conducting the same process by completing tasks in 12.7 seconds, while competitors took over 20 seconds to do the same. These findings highlight the potential of LMRPA to revolutionize OCR-driven automation processes, offering a more efficient and effective alternative solution to the existing state-of-the-art RPA models.2024-12-24T00:21:36Z10 pages , 1 figure , 1 algorithmOsama Hosam AbdellaifAbdelrahman NaderAli Hamdihttp://arxiv.org/abs/2506.08300v1Institutional Books 1.0: A 242B token dataset from Harvard Library's collections, refined for accuracy and usability2025-06-10T00:11:30ZLarge language models (LLMs) use data to learn about the world in order to produce meaningful correlations and predictions. As such, the nature, scale, quality, and diversity of the datasets used to train these models, or to support their work at inference time, have a direct impact on their quality. The rapid development and adoption of LLMs of varying quality has brought into focus the scarcity of publicly available, high-quality training data and revealed an urgent need to ground the stewardship of these datasets in sustainable practices with clear provenance chains. To that end, this technical report introduces Institutional Books 1.0, a large collection of public domain books originally digitized through Harvard Library's participation in the Google Books project, beginning in 2006. Working with Harvard Library, we extracted, analyzed, and processed these volumes into an extensively-documented dataset of historic texts. This analysis covers the entirety of Harvard Library's collection scanned as part of that project, originally spanning 1,075,899 volumes written in over 250 different languages for a total of approximately 250 billion tokens. As part of this initial release, the OCR-extracted text (original and post-processed) as well as the metadata (bibliographic, source, and generated) of the 983,004 volumes, or 242B tokens, identified as being in the public domain have been made available. This report describes this project's goals and methods as well as the results of the analyses we performed, all in service of making this historical collection more accessible and easier for humans and machines alike to filter, read and use.2025-06-10T00:11:30ZMatteo CargneluttiCatherine BrobstonJohn HessJack CushmanKristi MukkAristana ScourtasKyle CourtneyGreg LeppertAmanda WatsonMartha WhiteheadJonathan Zittrainhttp://arxiv.org/abs/2506.08199v1Extracting Information About Publication Venues Using Citation-Informed Transformers2025-06-09T20:20:16ZScientific document embeddings contain a variety of rich features which can be harnessed for downstream tasks such as recommendation, ranking, and clustering. We explore which tangible insights can be drawn from scientific document embeddings to understand trends in computer science research featured across nine well-known venues. We collect approximately 60,000 scientific documents published between 2015 and 2023 and analyze their embeddings, which we produce with the SPECTER pre-trained language model. In particular, we examine whether similarity between two venues can be measured using the embeddings of the scientific documents they admit for publication. Our findings indicate that some venues within computer science are indistinguishable when only considering the distributions of their document embeddings. We additionally examine whether any two venues are becoming increasingly similar over time and identify a trend of convergence within some venues in our analysis. We discuss the implications of these results and the potential impact on new scientific contributions.2025-06-09T20:20:16ZBrian D. ZimmermanJoshua FolkinsOlga Vechtomovahttp://arxiv.org/abs/2506.08196v1No Stupid Questions: An Analysis of Question Query Generation for Citation Recommendation2025-06-09T20:13:32ZExisting techniques for citation recommendation are constrained by their adherence to article contents and metadata. We leverage GPT-4o-mini's latent expertise as an inquisitive assistant by instructing it to ask questions which, when answered, could expose new insights about an excerpt from a scientific article. We evaluate the utility of these questions as retrieval queries, measuring their effectiveness in retrieving and ranking masked target documents. In some cases, generated questions ended up being better queries than extractive keyword queries generated by the same model. We additionally propose MMR-RBO, a variation of Maximal Marginal Relevance (MMR) using Rank-Biased Overlap (RBO) to identify which questions will perform competitively with the keyword baseline. As all question queries yield unique result sets, we contend that there are no stupid questions.2025-06-09T20:13:32Z6 pages, 5 figures, 2 tablesBrian D. ZimmermanJulien Aubert-BéduchaudFlorian BoudinAkiko AizawaOlga Vechtomovahttp://arxiv.org/abs/2506.07748v1Research quality evaluation by AI in the era of Large Language Models: Advantages, disadvantages, and systemic effects2025-06-09T13:32:06ZArtificial Intelligence (AI) technologies like ChatGPT now threaten bibliometrics as the primary generators of research quality indicators. They are already used in at least one research quality evaluation system and evidence suggests that they are used informally by many peer reviewers. Since using bibliometrics to support research evaluation continues to be controversial, this article reviews the corresponding advantages and disadvantages of AI-generated quality scores. From a technical perspective, generative AI based on Large Language Models (LLMs) equals or surpasses bibliometrics in most important dimensions, including accuracy (mostly higher correlations with human scores), and coverage (more fields, more recent years) and may reflect more research quality dimensions. Like bibliometrics, current LLMs do not "measure" research quality, however. On the clearly negative side, LLM biases are currently unknown for research evaluation, and LLM scores are less transparent than citation counts. From a systemic perspective, the key issue is how introducing LLM-based indicators into research evaluation will change the behaviour of researchers. Whilst bibliometrics encourage some authors to target journals with high impact factors or to try to write highly cited work, LLM-based indicators may push them towards writing misleading abstracts and overselling their work in the hope of impressing the AI. Moreover, if AI-generated journal indicators replace impact factors, then this would encourage journals to allow authors to oversell their work in abstracts, threatening the integrity of the academic record.2025-06-09T13:32:06ZMike Thelwallhttp://arxiv.org/abs/2506.06753v1Influential scientists shape knowledge flows between science and IGO policy2025-06-07T10:50:00ZIntergovernmental organizations (IGOs) increasingly rely on scientific evidence, yet the pathways through which scientific research enters policy remain opaque. By linking 230,737 scientific papers cited in IGO policy documents (2015-2023) to their authors and collaboration networks, we identify a small group of policy-influential scientists (PI-Sci) who dominate this knowledge flow. These scientists form tightly interconnected, internationally spanning co-authorship networks and achieve policy citations shortly after publication, a distinctive feature of cumulative advantage at the science-policy interface. The concentration of influence varies by field: tightly clustered in established domains like climate modeling, and more dispersed in emerging areas like AI governance. Many PI-Sci serve on high-level advisory bodies (e.g., IPCC), and major IGOs frequently co-cite the same PI-Sci papers, indicating synchronized knowledge diffusion through shared expert networks. These findings reveal how network structure and elite brokerage shape the translation of research into global policy, highlighting opportunities to broaden the scope of knowledge that informs policy.2025-06-07T10:50:00ZProc. Natl. Acad. Sci. U.S.A. 123(17), e2514861123 (2026)Kimitaka AsataniYurie IwataYuta TomokiyoBasil MahfouzMasaru YarimeIchiro Sakata10.1073/pnas.2514861123http://arxiv.org/abs/2503.18236v5Research impact evaluation based on effective authorship contribution sensitivity: h-leadership index2025-06-06T22:57:14ZThe evaluation of a researcher's performance has traditionally relied on various bibliometric measures, with the h-index being one of the most prominent. However, the h-index only accounts for the number of citations received in a publication and does not account for other factors such as the number of authors or their specific contributions in collaborative works. Therefore, the h-index has been placed on scrutiny as it has motivated academic integrity issues where non-contributing authors get authorship merely for raising their h-index. In this study, we comprehensively evaluate existing metrics in their ability to account for authorship contribution by their position and introduce a novel variant of the h-index, known as the h-leadership index. The h-leadership index aims to advance the fair evaluation of academic contributions in multi-authored publications by giving importance to authorship position beyond the first and last authors, motivated by Stanford's ranking of the top 2 \% of world scientists. We assign weighted citations based on a modified complementary unit Gaussian curve, ensuring that the contributions of middle authors are appropriately recognised. We apply the h-leadership index to analyse the top 50 researchers across the Group of 8 (Go8) universities in Australia, demonstrating its potential to provide a more balanced assessment of research performance. We provide open-source software for extending the work further.2025-03-23T23:06:50ZHardik A. JainRohitash Chandrahttp://arxiv.org/abs/2410.02701v2Impact of a reclassification of Web of Science articles on bibliometric indicators2025-06-06T08:26:55ZThis work aims at evaluating a reclassification of Web of Science articles implemented at OST. Articles from the 254 scientific categories of the Web of Science were reclassified at article level in 242 modified categories and 11 disciplines using the method of S. Milojević (2020). The reclassification is based on paper references categories and it no longer assigns papers to multiple or to multidisciplinary categories. It improves the accuracy and the modularity of the WoS classification. As there are important changes in document assignment at the lowest level, usual indicators such as disciplinary profiles or field normalized indicators are significantly modified. This study examines some of these modifications to provide explanations for the recipients of OST reports. Changes in specialization indexes reveal specific journal choices by scientists. In a sample of 25 countries, Brazil and China offer examples of facilities or constraints for selecting journals to publish certain research works.2024-10-03T17:26:16Z24 pages, 25 figuresAgénor LahatteÉlisabeth de Turckheimhttp://arxiv.org/abs/2505.20944v2International collaboration of Ukrainian scholars: Effects of Russia's full-scale invasion of Ukraine2025-06-04T14:55:52ZThis study explores the effects of Russia's full-scale invasion of Ukraine on the international collaboration of Ukrainian scholars. First and foremost, Ukrainian scholars deserve respect for continuing to publish despite life-threatening conditions, mental strain, shelling and blackouts. In 2022-2023, universities gained more from international collaboration than the NASU. The percentage of internationally co-authored articles remained unchanged for the NASU, while it increased for universities. In 2023, 40.8% of articles published by the NASU and 32,2% of articles published by universities were internationally co-authored. However, these figures are still much lower than in developed countries (60-70%). The citation impact of internationally co-authored articles remained statistically unchanged for the NASU but increased for universities. The highest share of internationally co-authored articles published by the NASU in both periods was in the physical sciences and engineering. However, the citation impact of these articles declined in 2022-2023, nearly erasing their previous citation advantage over university publications. Universities consistently outperformed the NASU in the citation impact of internationally co-authored articles in biomedical and health sciences across both periods. International collaboration can help Ukrainian scholars to go through this difficult time. In turn, they can contribute to the strengthening of Europe.2025-05-27T09:28:21ZHigher Education (2026)Myroslava Hladchenko10.1007/s10734-025-01577-yhttp://arxiv.org/abs/2506.09056v1MetaInfoSci: An Integrated Web Tool for Scholarly Data Analysis2025-06-04T07:21:22ZThe exponential increase in academic publications has made it increasingly difficult for researchers to remain up to date and systematically synthesize knowledge scattered across vast and fragmented research domains. Literature reviews, particularly those supported by bibliometric methods, have become essential in organizing prior findings and guiding future research directions. While numerous tools exist for bibliometric analysis and network science, there is currently no single platform that integrates the full range of features from both domains. Researchers are often required to navigate multiple software environments, many of which lack customizable visualizations, cross-database integration, and AI-assisted result summarization. Addressing these limitations, this study introduces MetaInfoSci at www.metainfosci.com, a comprehensive, web-based platform designed to unify bibliometric, scientometric, and network analytical capabilities. The platform supports tailored query design, merges data from diverse sources, enables rich and adaptable visual outputs, and provides automated, AI-driven summaries of analytical results. This integrated approach aims to enhance the accessibility, efficiency, and depth of scientific literature analysis for scholars across disciplines.2025-06-04T07:21:22ZKiran SharmaaParul KhuranaZiya Uddinahttp://arxiv.org/abs/2506.03587v1Preface to the Special Issue of the TAL Journal on Scholarly Document Processing2025-06-04T05:35:39ZThe rapid growth of scholarly literature makes it increasingly difficult for researchers to keep up with new knowledge. Automated tools are now more essential than ever to help navigate and interpret this vast body of information. Scientific papers pose unique difficulties, with their complex language, specialized terminology, and diverse formats, requiring advanced methods to extract reliable and actionable insights. Large language models (LLMs) offer new opportunities, enabling tasks such as literature reviews, writing assistance, and interactive exploration of research. This special issue of the TAL journal highlights research addressing these challenges and, more broadly, research on natural language processing and information retrieval for scholarly and scientific documents.2025-06-04T05:35:39ZTraitement Automatique des Langues (TAL), volume 25, n°2/2024Florian BoudinAkiko Aizawa