https://arxiv.org/api/LAf2p5ax8WmUabsdIbvPhCEMWDI 2026-03-22T14:49:33Z 5870 75 15 http://arxiv.org/abs/2510.13201v2 Paper Copilot: Tracking the Evolution of Peer Review in AI Conferences 2026-02-09T06:41:45Z The rapid growth of AI conferences is straining an already fragile peer-review system, leading to heavy reviewer workloads, expertise mismatches, inconsistent evaluation standards, superficial or templated reviews, and limited accountability under compressed timelines. In response, conference organizers have introduced new policies and interventions to preserve review standards. Yet these ad-hoc changes often create further concerns and confusion about the review process, leaving how papers are ultimately accepted - and how practices evolve across years - largely opaque. We present Paper Copilot, a system that creates durable digital archives of peer reviews across a wide range of computer-science venues, an open dataset that enables researchers to study peer review at scale, and a large-scale empirical analysis of ICLR reviews spanning multiple years. By releasing both the infrastructure and the dataset, Paper Copilot supports reproducible research on the evolution of peer review. We hope these resources help the community track changes, diagnose failure modes, and inform evidence-based improvements toward a more robust, transparent, and reliable peer-review system. 2025-10-15T06:41:06Z ICLR 2026. https://papercopilot.com/ Jing Yang Qiyao Wei Jiaxin Pei http://arxiv.org/abs/2602.07869v1 Recall, Risk, and Governance in Automated Proposal Screening for Research Funding: Evidence from a National Funding Programme 2026-02-08T08:49:08Z Research funding agencies are increasingly exploring automated tools to support early-stage proposal screening. Recent advances in large language models (LLMs) have generated optimism regarding their use for text-based evaluation, yet their institutional suitability for high-stakes screening decisions remains underexplored. In particular, there is limited empirical evidence on how automated screening systems perform when evaluated against institutional error costs. This study compares two automated approaches for proposal screening against the priorities of a national funding call: A transparent, rule-based method using term frequency-inverse document frequency (TF-IDF) with domain-specific keyword engineering, and a semantic classification approach based on a large language model. Using selection committee decisions as ground truth for 959 proposals, we evaluate performance with particular attention to error structure. The results show that the TF-IDF-based approach outperforms the LLM-based system across standard metrics, achieving substantially higher recall (78.95\% vs 45.82\%) and producing far fewer false negatives (68 vs 175). The LLM-based system excludes more than half of the proposals ultimately selected by the committee. While false positives can be corrected through subsequent peer review, false negatives represent an irrecoverable exclusion from expert evaluation. By foregrounding error asymmetry and institutional context, this study demonstrates that the suitability of automated screening systems depends not on model sophistication alone, but on how their error profiles, transparency, and auditability align with research evaluation practice. These findings suggest that evaluation design and error tolerance should guide the use of AI-assisted screening tools in research funding more broadly. 2026-02-08T08:49:08Z 6 tables Chandan G. Nagarajappa Moumita Koley Avinash Kumar Rabindra Panigrahy Pramod Kumar Arya http://arxiv.org/abs/2504.04464v2 In which fields do ChatGPT scores align better than citations with research quality? 2026-02-08T06:23:23Z Although citation-based indicators are widely used for research evaluation, they are not useful for recently published research, reflect only one of the three common dimensions of research quality, and have little value in some social sciences, arts and humanities. Large Language Models (LLMs) have been shown to address some of these weaknesses, with ChatGPT-4o mini showing the most promising results, although on incomplete data. This article reports by far the largest scale evaluation of ChatGPT-4o mini yet and also evaluates its larger sibling ChatGPT-4o and ChatGPT-5 mini. Based on comparisons between LLM scores, averaged over 5 repetitions, and departmental average quality scores for 107,212 UK-based refereed journal articles, ChatGPT-4o is marginally better than ChatGPT-4o mini in most of the 34 field-based Units of Assessment (UoAs) tested, although combining both gives better results than either one. ChatGPT-4o scores have a positive correlation with research quality in 33 of the 34 UoAs, with the results being statistically significant in 31. The most substantial exception is Physics, for which citations are more useful. ChatGPT-4o scores had a higher correlation with research quality than long term citation rates in 21 out of 34 UoAs and a higher correlation than short term citation rates in 26 out of 34 UoAs. ChatGPT-5 mini has even stronger correlations overall. In summary, the results give the first large scale evidence that ChatGPT-4o and ChatGPT-5 mini are competitive with citations as new research quality indicator sources. 2025-04-06T12:25:41Z Mike Thelwall http://arxiv.org/abs/2602.07664v1 Assessing the impact of Open Research Information Infrastructures using NLP driven full-text Scientometrics: A case study of the LXCat open-access platform 2026-02-07T19:15:40Z Open research information (ORI) play a central role in shaping how scientific knowledge is produced, disseminated, validated, and reused across the research lifecycle. While the visibility of such ORI infrastructures is often assessed through citation-based metrics, in this study, we present a full-text, natural language processing (NLP) driven scientometric framework to systematically quantify the impact of ORI infrastructures beyond citation counts, using the LXCat platform for low temperature plasma (LTP) research as a representative case study. The modeling of LTPs and interpretation of LTP experiments rely heavily on accurate data, much of which is hosted on LXCat, a community-driven, open-access platform central to the LTP research ecosystem. To investigate the scholarly impact of the LXCat platform over the past decade, we analyzed a curated corpus of full-text research articles citing three foundational LXCat publications. We present a comprehensive pipeline that integrates chemical entity recognition, dataset and solver mention extraction, affiliation based geographic mapping and topic modeling to extract fine-grained patterns of data usage that reflect implicit research priorities, data practices, differential reliance on specific databases, evolving modes of data reuse and coupling within scientific workflows, and thematic evolution. Importantly, our proposed methodology is domain-agnostic and transferable to other ORI contexts, and highlights the utility of NLP in quantifying the role of scientific data infrastructures and offers a data-driven reflection on how open-access platforms like LXCat contribute to shaping research directions. This work presents a scalable scientometric framework that has the potential to support evidence based evaluation of ORI platforms and to inform infrastructure design, governance, sustainability, and policy for future development. 2026-02-07T19:15:40Z Kalp Pandya Khushi Shah Nirmal Shah Nakshi Shah Bhaskar Chaudhury http://arxiv.org/abs/2506.18804v3 Breakthrough Asymmetries across Disciplines and Countries: A Network approach to Structural Complexity of Scientific Progress 2026-02-06T15:45:21Z Science is driven by community endeavors across diverse fields and specializations, forming a complex structure that renders conventional performance evaluation methods inadequate. Using established indicators, the network-based normalized citation score, and the disruptive index, combined with the GENEPY algorithm, we evaluate the complexity rank of countries based on their breakthrough performance across 89 subfields of physical sciences, drawing on nearly 60 million articles (1900-2023). This quality-focused integrated approach reveals pronounced asymmetries: while countries such as the United States, Israel, and several in Europe sustain long-term structural advantages, emerging nations show rapid gains in later decades. A power-law relationship between aggregated breakthrough performance and countries' R&D expenditure underscores the unequal and scale-dependent nature of global science. These results demonstrate that scientific advancement arises not from uniform growth but from asymmetric complexity, offering actionable insights for policymakers and funding agencies aiming to foster sustainable, high-quality research ecosystems. 2025-06-23T16:08:52Z 21 pages and 12 figures (including Supplementary information) Adarsh Raghuvanshi Hrishidev Unni Vinayak Anirban Chakraborti http://arxiv.org/abs/2510.19585v3 Detecting Latin in Historical Books with Large Language Models: A Multimodal Benchmark 2026-02-06T13:01:47Z This paper presents a novel task of extracting low-resourced and noisy Latin fragments from mixed-language historical documents with varied layouts. We benchmark and evaluate the performance of large foundation models against a multimodal dataset of 724 annotated pages. The results demonstrate that reliable Latin detection with contemporary zero-shot models is achievable, yet these models lack a functional comprehension of Latin. This study establishes a comprehensive baseline for processing Latin within mixed-language corpora, supporting quantitative analysis in intellectual history and historical linguistics. Both the dataset and code are available at https://github.com/COMHIS/EACL26-detect-latin. 2025-10-22T13:37:52Z Accepted by the EACL 2026 main conference. Code and data available at https://github.com/COMHIS/EACL26-detect-latin Yu Wu Ke Shu Jonas Fischer Lidia Pivovarova David Rosson Eetu Mäkelä Mikko Tolonen http://arxiv.org/abs/2602.06607v1 Beyond Pairwise Distance: Cognitive Traversal Distance as a Holistic Measure of Scientific Novelty 2026-02-06T11:11:01Z Scientific novelty is a critical construct in bibliometrics and is commonly measured by aggregating pairwise distances between the knowledge units underlying a paper. While prior work has refined how such distances are computed, less attention has been paid to how dyadic relations are aggregated to characterize novelty at the paper level. We address this limitation by introducing a network-based indicator, Cognitive Traversal Distance (CTD). Conceptualizing the historical literature as a weighted knowledge network, CTD is defined as the length of the shortest path required to connect all knowledge units associated with a paper. CTD provides a paper-level novelty measure that reflects the minimal structural distance needed to integrate multiple knowledge units, moving beyond mean- or quantile-based aggregation of pairwise distances. Using 27 million biomedical publications indexed by OpenAlex and Medical Subject Headings (MeSH) as standardized knowledge units, we evaluate CTD against expert-based novelty benchmarks from F1000Prime-recommended papers and Nobel Prize-winning publications. CTD consistently outperforms conventional aggregation-based indicators. We further show that MeSH-based CTD is less sensitive to novelty driven by the emergence of entirely new conceptual labels, clarifying its scope relative to recent text-based measures. 2026-02-06T11:11:01Z Yi Xiang Pascal Welke Chengzhi Zhang Jian Wang http://arxiv.org/abs/2508.20747v2 An analysis of the effects of open science indicators on citations in the French Open Science Monitor 2026-02-06T10:35:33Z This study investigates the correlation of citation impact with various open science indicators (OSI) within the French Open Science Monitor (FOSM), a dataset comprising approximately 900,000 publications authored by French authors from 2020 to 2022. By integrating data from OpenAlex and Crossref, we analyze open science indicators such as the presence of a pre-print, data sharing, and software sharing in 576,537 publications in the FOSM dataset. Our analysis reveals a positive correlation between these OSI and citation counts. Considering our most complete citation prediction model, we find pre-prints are correlated with a significant positive effect of 19% on citation counts, software sharing of 13.5%, and data sharing of 14.3%. We find large variations in the correlations of OSIs with citations in different research disciplines, and observe that open access status of publications is correlated with a 8.6% increase in citations in our model. While these results remain observational and are limited to the scope of the analysis, they suggest a consistent correlation between citation advantages and open science indicators. Our results may be valuable to policy makers, funding agencies, researchers, publishers, institutions, and other stakeholders who are interested in understanding the academic impacts, or effects, of open science practices. 2025-08-28T13:07:50Z Giovanni Colavizza Lauren Cadwallader Iain Hrynaszkiewicz http://arxiv.org/abs/2602.06510v1 Implications of Russia's full-scale invasion of Ukraine for the international mobility of Ukrainian scholars 2026-02-06T09:00:49Z This study examines the implications of Russia's full-scale invasion of Ukraine for the international mobility of Ukrainian scholars. The dataset, drawn from the CWTS in-house Scopus database, includes Ukrainian scholars who were internationally mobile between 2020 and 2023. The analysis focuses on scholars affiliated with universities and the National Academy of Sciences of Ukraine (NASU) prior to moving abroad. The findings reveal an increase in the number of internationally mobile scholars in 2022-2023, driven primarily by rising mobility from universities. For NASU-affiliated scholars, Russia was the top destination country in 2020-2021 but fell to fourth place in 2022-2023, overtaken by Germany, China, and Poland. For university-affiliated scholars, Poland, Germany, and Russia consistently ranked as the top three destination countries across both periods. Statistical tests indicate no significant difference in mean Field-Weighted Citation Impact (FNCI) between scholars who were internationally mobile in 2020-2021 and those mobile in 2022-2023. However, the share of internationally mobile scholars with articles among the top 10% most cited globally increased among those previously affiliated with universities, while it declined among those affiliated with NASU. In both periods, the proportion of scholars with articles in the top 10% most cited globally, published during the five years prior to changing their country of affiliation, was higher among internationally mobile scholars than among those who remained affiliated with Ukrainian institutions. Whether this mobility constitutes a brain drain requires further research. If effectively leveraged, international mobility may strengthen Ukraine's integration into global scientific networks, support post-war recovery, and contribute to a more resilient, internationally connected, and competitive academic system. 2026-02-06T09:00:49Z Scientometrics, 130(11), 6109-6133 (2025) Myroslava Hladchenko 10.1007/s11192-025-05450-8 http://arxiv.org/abs/2602.05930v1 Compound Deception in Elite Peer Review: A Failure Mode Taxonomy of 100 Fabricated Citations at NeurIPS 2025 2026-02-05T17:43:35Z Large language models (LLMs) are increasingly used in academic writing workflows, yet they frequently hallucinate by generating citations to sources that do not exist. This study analyzes 100 AI-generated hallucinated citations that appeared in papers accepted by the 2025 Conference on Neural Information Processing Systems (NeurIPS), one of the world's most prestigious AI conferences. Despite review by 3-5 expert researchers per paper, these fabricated citations evaded detection, appearing in 53 published papers (approx. 1% of all accepted papers). We develop a five-category taxonomy that classifies hallucinations by their failure mode: Total Fabrication (66%), Partial Attribute Corruption (27%), Identifier Hijacking (4%), Placeholder Hallucination (2%), and Semantic Hallucination (1%). Our analysis reveals a critical finding: every hallucination (100%) exhibited compound failure modes. The distribution of secondary characteristics was dominated by Semantic Hallucination (63%) and Identifier Hijacking (29%), which often appeared alongside Total Fabrication to create a veneer of plausibility and false verifiability. These compound structures exploit multiple verification heuristics simultaneously, explaining why peer review fails to detect them. The distribution exhibits a bimodal pattern: 92% of contaminated papers contain 1-2 hallucinations (minimal AI use) while 8% contain 4-13 hallucinations (heavy reliance). These findings demonstrate that current peer review processes do not include effective citation verification and that the problem extends beyond NeurIPS to other major conferences, government reports, and professional consulting. We propose mandatory automated citation verification at submission as an implementable solution to prevent fabricated citations from becoming normalized in scientific literature. 2026-02-05T17:43:35Z Samar Ansari http://arxiv.org/abs/2602.05867v1 The Case of the Mysterious Citations 2026-02-05T16:46:27Z Mysterious citations are routinely appearing in peer-reviewed publications throughout the scientific community. In this paper, we developed an automated pipeline and examine the proceedings of four major high-performance computing conferences, comparing the accuracy of citations between the 2021 and 2025 proceedings. While none of the 2021 papers contained mysterious citations, every 2025 proceeding did, impacting 2-6\% of published papers. In addition, we observe a sharp rise in paper title and authorship errors, motivating the need for stronger citation-verification practice. No author within our dataset acknowledged using AI to generate citations even though all four conference policies required it, indicating current policies are insufficient. 2026-02-05T16:46:27Z Amanda Bienz Carl Pearson Simon Garcia de Gonzalo http://arxiv.org/abs/2602.05836v1 An FWCI decomposition of Science Foundation Ireland funding 2026-02-05T16:23:52Z In response to the 2008 global financial crisis, Science Foundation Ireland (SFI), now Research Ireland, pivoted to research with potential socioeconomic impact. Given that the latter can encompass higher technology readiness levels, which typically correlates with lower academic impact, it is interesting to understand how academic impact holds up in SFI funded research. Here we decompose SFI \textit{Investigator Awards} - arguably the most academic funding call - into $3,243$ constituent publications and field weighted citation impact (FWCI) values searchable in the SCOPUS database. Given that citation counts are skewed, we highlight the limitation of FWCI as a paper metric, which naively restricts one to comparisons of average FWCI ($\overline{\mathrm{FWCI}}$) in large samples. Neglecting publications with $\textrm{FWCI} < 0.1$ ($8.8\%$), SFI funded publications are well approximated by a lognormal distribution with $μ= -0.0761^{+0.017}_{-0.0039}$ and $ σ= 0.933^{+0.011}_{-0.012}$ at $95 \%$ confidence level. This equates to an $\overline{\mathrm{FWCI}} = 1.433^{+0.029}_{-0.015}$ well above $\overline{\mathrm{FWCI}}=1$ internationally. Broken down by award, we correct $\overline{\mathrm{FWCI}}$ for small samples using simulations and find $\sim 67\%$ exceed \textit{median} international academic interest, thus exhibiting a positive correlation between the potential for socioeconomic impact and academic interest. 2026-02-05T16:23:52Z 7 pages, 5 figures, comments welcome Eoin Ó Colgáin http://arxiv.org/abs/2508.12735v2 Citation accuracy, citation noise, and citation bias: A foundation of citation analysis 2026-02-05T14:11:57Z Citation analysis is widely used in research evaluation to assess the impact of scientific papers. These analyses rest on the assumption that citation decisions by authors are accurate, representing the flow of knowledge from cited to citing papers. However, in practice, researchers often cite for reasons that are not related to the fact that there has been (intellectual) input from previous papers. Citations made for rhetorical reasons or without reading the cited work compromise the value of citations as instrument for research evaluation. Past research on threats to the accuracy of citations has mainly focused on citation bias as the primary concern. In this paper, we argue that citation noise - the undesirable variance in citation decisions - represents an equally critical but underexplored challenge in citation analysis. We define and differentiate two types of citation noise: citation level noise and citation pattern noise. Each type of noise is described in terms of how it arises and the specific ways it can undermine the validity of citation-based research assessments. By conceptually differing citation noise from citation accuracy and citation bias, we propose a framework for the foundation of citation analysis. We discuss strategies and interventions to minimize citation noise, aiming to improve the reliability and validity of citation analysis in research evaluation. We recommend that the current professional reform movement in research evaluation such as the Coalition for Advancing Research Assessment (CoARA) pick up these strategies and interventions as an additional building block for careful, responsible use of bibliometric indicators in research evaluation. 2025-08-18T09:01:03Z Lutz Bornmann Christian Leibel http://arxiv.org/abs/2502.19360v2 Sustaining Knowledge Infrastructures: Asking the Right Questions and Listening for Answers 2026-02-05T10:20:19Z Sustaining knowledge infrastructures remains a persistent issue that requires continued engagement from diverse stakeholders as new questions and values arise in relation to KI maintenance. We draw on existing academic literature, practical experience with KI projects, and our discussions at a 2024 workshop for researchers and practitioners exploring KI evaluation to pose five questions for KI project managers to consider when thinking about how to make their KIs evolve sustainably over time. These questions include reflecting on sustainability throughout the life cycle of KIs, communicating evolving visions and values, engaging communities, right sizing a KI, and developing an iterative process for decision-making. Reflecting on these themes, we suggest, can support KI stakeholders to evolve, not necessarily grow, to meet the needs and values of their communities. How these themes are discussed will necessarily vary by funding sources, disciplines, governance, communities, and other contextual factors. However, adopting a deliberate and strategic approach to KI sustainability and aligning the invisible infrastructural work of KI maintenance with the outward-facing institutional work is, we argue, relevant to all KIs. 2025-02-26T17:56:55Z Kathleen Gregory Jonathan Zurbach Kalpana Shankar Matthew Mayernik Malcolm Campbell Verduyn Louise Bezuidenhout Andrew Treloar http://arxiv.org/abs/2602.05211v1 Quantifying the Knowledge Proximity Between Academic and Industry Research: An Entity and Semantic Perspective 2026-02-05T02:12:47Z The academia and industry are characterized by a reciprocal shaping and dynamic feedback mechanism. Despite distinct institutional logics, they have adapted closely in collaborative publishing and talent mobility, demonstrating tension between institutional divergence and intensive collaboration. Existing studies on their knowledge proximity mainly rely on macro indicators such as the number of collaborative papers or patents, lacking an analysis of knowledge units in the literature. This has led to an insufficient grasp of fine-grained knowledge proximity between industry and academia, potentially undermining collaboration frameworks and resource allocation efficiency. To remedy the limitation, this study quantifies the trajectory of academia-industry co-evolution through fine-grained entities and semantic space. In the entity measurement part, we extract fine-grained knowledge entities via pre-trained models, measure sequence overlaps using cosine similarity, and analyze topological features through complex network analysis. At the semantic level, we employ unsupervised contrastive learning to quantify convergence in semantic spaces by measuring cross-institutional textual similarities. Finally, we use citation distribution patterns to examine correlations between bidirectional knowledge flows and similarity. Analysis reveals that knowledge proximity between academia and industry rises, particularly following technological change. This provides textual evidence of bidirectional adaptation in co-evolution. Additionally, academia's knowledge dominance weakens during technological paradigm shifts. The dataset and code for this paper can be accessed at https://github.com/tinierZhao/Academic-Industrial-associations. 2026-02-05T02:12:47Z Technological Forecasting & Social Change, 2026 Hongye Zhao Yi Zhao Chengzhi Zhang