AnalyticsGPT: An LLM Workflow for Scientometric Question Answering

2026-02-10T14:23:55Z

This paper introduces AnalyticsGPT, an intuitive and efficient large language model (LLM)-powered workflow for scientometric question answering. This underrepresented downstream task addresses the subcategory of meta-scientific questions concerning the "science of science." When compared to traditional scientific question answering based on papers, the task poses unique challenges in the planning phase. Namely, the need for named-entity recognition of academic entities within questions and multi-faceted data retrieval involving scientometric indices, e.g. impact factors. Beyond their exceptional capacity for treating traditional natural language processing tasks, LLMs have shown great potential in more complex applications, such as task decomposition and planning and reasoning. In this paper, we explore the application of LLMs to scientometric question answering, and describe an end-to-end system implementing a sequential workflow with retrieval-augmented generation and agentic concepts. We also address the secondary task of effectively synthesizing the data into presentable and well-structured high-level analyses. As a database for retrieval-augmented generation, we leverage a proprietary research performance assessment platform. For evaluation, we consult experienced subject matter experts and leverage LLMs-as-judges. In doing so, we provide valuable insights on the efficacy of LLMs towards a niche downstream task. Our (skeleton) code and prompts are available at: https://github.com/lyvykhang/llm-agents-scientometric-qa/tree/acl.

AgentCAT: An LLM Agent for Extracting and Analyzing Catalytic Reaction Data from Chemical Engineering Literature

2026-02-10T04:30:11Z

This paper presents a large language model (LLM) agent named AgentCAT, which extracts and analyzes catalytic reaction data from chemical engineering papers, %and supports natural language based interactive analysis of the extracted data. AgentCAT serves as an alternative to overcome the long-standing data bottleneck in chemical engineering field, and its natural language based interactive data analysis functionality is friendly to the community. AgentCAT also presents a formal abstraction and challenge analysis of the catalytic reaction data extraction task in an artificial intelligence-friendly manner. This abstraction would help the artificial intelligence community understand this problem and in turn would attract more attention to address it. Technically, the complex catalytic process leads to complicated dependency structure in catalytic reaction data with respect to elementary reaction steps, molecular behaviors, measurement evidence, etc. This dependency structure makes it challenging to guarantee the correctness and completeness of data extraction, as well as representing them for analysis. AgentCAT addresses this challenge and it makes four folds of technical contributions: (1) a schema-governed extraction pipeline with progressive schema evolution, enabling robust data extraction from chemical engineering papers; (2) a dependency-aware reaction-network knowledge graph that links catalysts/active sites, synthesis-derived descriptors, mechanistic claims with evidence, and macroscopic outcomes, preserving process coupling and traceability; (3) a general querying module that supports natural-language exploration and visualization over the constructed graph for cross-paper analysis; (4) an evaluation on $\sim$800 peer-reviewed chemical engineering publications demonstrating the effectiveness of AgentCAT.

LitBench: A Graph-Centric Large Language Model Benchmarking Tool For Literature Tasks

2026-02-10T04:12:29Z

While large language models (LLMs) have become the de facto framework for literature-related tasks, they still struggle to function as domain-specific literature agents due to their inability to connect pieces of knowledge and reason across domain-specific contexts, terminologies, and nomenclatures. This challenge underscores the need for a tool that facilitates such domain-specific adaptation and enables rigorous benchmarking across literature tasks. To that end, we introduce LitBench, a benchmarking tool designed to enable the development and evaluation of domain-specific LLMs tailored to literature-related tasks. At its core, LitBench uses a data curation process that generates domain-specific literature sub-graphs and constructs training and evaluation datasets based on the textual attributes of the resulting nodes and edges. The tool is designed for flexibility, supporting the curation of literature graphs across any domain chosen by the user, whether high-level fields or specialized interdisciplinary areas. In addition to dataset curation, LitBench defines a comprehensive suite of literature tasks, ranging from node and edge level analyses to advanced applications such as related work generation. These tasks enable LLMs to internalize domain-specific knowledge and relationships embedded in the curated graph during training, while also supporting rigorous evaluation of model performance. Our results show that small domain-specific LLMs trained and evaluated on LitBench datasets achieve competitive performance compared to state-of-the-art models like GPT-4o and DeepSeek-R1. To enhance accessibility and ease of use, we open-source the tool along with an AI agent tool that streamlines data curation, model training, and evaluation.

Paper Copilot: Tracking the Evolution of Peer Review in AI Conferences

2026-02-09T06:41:45Z

The rapid growth of AI conferences is straining an already fragile peer-review system, leading to heavy reviewer workloads, expertise mismatches, inconsistent evaluation standards, superficial or templated reviews, and limited accountability under compressed timelines. In response, conference organizers have introduced new policies and interventions to preserve review standards. Yet these ad-hoc changes often create further concerns and confusion about the review process, leaving how papers are ultimately accepted - and how practices evolve across years - largely opaque. We present Paper Copilot, a system that creates durable digital archives of peer reviews across a wide range of computer-science venues, an open dataset that enables researchers to study peer review at scale, and a large-scale empirical analysis of ICLR reviews spanning multiple years. By releasing both the infrastructure and the dataset, Paper Copilot supports reproducible research on the evolution of peer review. We hope these resources help the community track changes, diagnose failure modes, and inform evidence-based improvements toward a more robust, transparent, and reliable peer-review system.

Recall, Risk, and Governance in Automated Proposal Screening for Research Funding: Evidence from a National Funding Programme

2026-02-08T08:49:08Z

Research funding agencies are increasingly exploring automated tools to support early-stage proposal screening. Recent advances in large language models (LLMs) have generated optimism regarding their use for text-based evaluation, yet their institutional suitability for high-stakes screening decisions remains underexplored. In particular, there is limited empirical evidence on how automated screening systems perform when evaluated against institutional error costs. This study compares two automated approaches for proposal screening against the priorities of a national funding call: A transparent, rule-based method using term frequency-inverse document frequency (TF-IDF) with domain-specific keyword engineering, and a semantic classification approach based on a large language model. Using selection committee decisions as ground truth for 959 proposals, we evaluate performance with particular attention to error structure. The results show that the TF-IDF-based approach outperforms the LLM-based system across standard metrics, achieving substantially higher recall (78.95\% vs 45.82\%) and producing far fewer false negatives (68 vs 175). The LLM-based system excludes more than half of the proposals ultimately selected by the committee. While false positives can be corrected through subsequent peer review, false negatives represent an irrecoverable exclusion from expert evaluation. By foregrounding error asymmetry and institutional context, this study demonstrates that the suitability of automated screening systems depends not on model sophistication alone, but on how their error profiles, transparency, and auditability align with research evaluation practice. These findings suggest that evaluation design and error tolerance should guide the use of AI-assisted screening tools in research funding more broadly.

In which fields do ChatGPT scores align better than citations with research quality?

2026-02-08T06:23:23Z

Although citation-based indicators are widely used for research evaluation, they are not useful for recently published research, reflect only one of the three common dimensions of research quality, and have little value in some social sciences, arts and humanities. Large Language Models (LLMs) have been shown to address some of these weaknesses, with ChatGPT-4o mini showing the most promising results, although on incomplete data. This article reports by far the largest scale evaluation of ChatGPT-4o mini yet and also evaluates its larger sibling ChatGPT-4o and ChatGPT-5 mini. Based on comparisons between LLM scores, averaged over 5 repetitions, and departmental average quality scores for 107,212 UK-based refereed journal articles, ChatGPT-4o is marginally better than ChatGPT-4o mini in most of the 34 field-based Units of Assessment (UoAs) tested, although combining both gives better results than either one. ChatGPT-4o scores have a positive correlation with research quality in 33 of the 34 UoAs, with the results being statistically significant in 31. The most substantial exception is Physics, for which citations are more useful. ChatGPT-4o scores had a higher correlation with research quality than long term citation rates in 21 out of 34 UoAs and a higher correlation than short term citation rates in 26 out of 34 UoAs. ChatGPT-5 mini has even stronger correlations overall. In summary, the results give the first large scale evidence that ChatGPT-4o and ChatGPT-5 mini are competitive with citations as new research quality indicator sources.

Assessing the impact of Open Research Information Infrastructures using NLP driven full-text Scientometrics: A case study of the LXCat open-access platform

2026-02-07T19:15:40Z

Open research information (ORI) play a central role in shaping how scientific knowledge is produced, disseminated, validated, and reused across the research lifecycle. While the visibility of such ORI infrastructures is often assessed through citation-based metrics, in this study, we present a full-text, natural language processing (NLP) driven scientometric framework to systematically quantify the impact of ORI infrastructures beyond citation counts, using the LXCat platform for low temperature plasma (LTP) research as a representative case study. The modeling of LTPs and interpretation of LTP experiments rely heavily on accurate data, much of which is hosted on LXCat, a community-driven, open-access platform central to the LTP research ecosystem. To investigate the scholarly impact of the LXCat platform over the past decade, we analyzed a curated corpus of full-text research articles citing three foundational LXCat publications. We present a comprehensive pipeline that integrates chemical entity recognition, dataset and solver mention extraction, affiliation based geographic mapping and topic modeling to extract fine-grained patterns of data usage that reflect implicit research priorities, data practices, differential reliance on specific databases, evolving modes of data reuse and coupling within scientific workflows, and thematic evolution. Importantly, our proposed methodology is domain-agnostic and transferable to other ORI contexts, and highlights the utility of NLP in quantifying the role of scientific data infrastructures and offers a data-driven reflection on how open-access platforms like LXCat contribute to shaping research directions. This work presents a scalable scientometric framework that has the potential to support evidence based evaluation of ORI platforms and to inform infrastructure design, governance, sustainability, and policy for future development.

Breakthrough Asymmetries across Disciplines and Countries: A Network approach to Structural Complexity of Scientific Progress

2026-02-06T15:45:21Z

Science is driven by community endeavors across diverse fields and specializations, forming a complex structure that renders conventional performance evaluation methods inadequate. Using established indicators, the network-based normalized citation score, and the disruptive index, combined with the GENEPY algorithm, we evaluate the complexity rank of countries based on their breakthrough performance across 89 subfields of physical sciences, drawing on nearly 60 million articles (1900-2023). This quality-focused integrated approach reveals pronounced asymmetries: while countries such as the United States, Israel, and several in Europe sustain long-term structural advantages, emerging nations show rapid gains in later decades. A power-law relationship between aggregated breakthrough performance and countries' R&D expenditure underscores the unequal and scale-dependent nature of global science. These results demonstrate that scientific advancement arises not from uniform growth but from asymmetric complexity, offering actionable insights for policymakers and funding agencies aiming to foster sustainable, high-quality research ecosystems.

Detecting Latin in Historical Books with Large Language Models: A Multimodal Benchmark

2026-02-06T13:01:47Z

This paper presents a novel task of extracting low-resourced and noisy Latin fragments from mixed-language historical documents with varied layouts. We benchmark and evaluate the performance of large foundation models against a multimodal dataset of 724 annotated pages. The results demonstrate that reliable Latin detection with contemporary zero-shot models is achievable, yet these models lack a functional comprehension of Latin. This study establishes a comprehensive baseline for processing Latin within mixed-language corpora, supporting quantitative analysis in intellectual history and historical linguistics. Both the dataset and code are available at https://github.com/COMHIS/EACL26-detect-latin.

Beyond Pairwise Distance: Cognitive Traversal Distance as a Holistic Measure of Scientific Novelty

2026-02-06T11:11:01Z

Scientific novelty is a critical construct in bibliometrics and is commonly measured by aggregating pairwise distances between the knowledge units underlying a paper. While prior work has refined how such distances are computed, less attention has been paid to how dyadic relations are aggregated to characterize novelty at the paper level. We address this limitation by introducing a network-based indicator, Cognitive Traversal Distance (CTD). Conceptualizing the historical literature as a weighted knowledge network, CTD is defined as the length of the shortest path required to connect all knowledge units associated with a paper. CTD provides a paper-level novelty measure that reflects the minimal structural distance needed to integrate multiple knowledge units, moving beyond mean- or quantile-based aggregation of pairwise distances. Using 27 million biomedical publications indexed by OpenAlex and Medical Subject Headings (MeSH) as standardized knowledge units, we evaluate CTD against expert-based novelty benchmarks from F1000Prime-recommended papers and Nobel Prize-winning publications. CTD consistently outperforms conventional aggregation-based indicators. We further show that MeSH-based CTD is less sensitive to novelty driven by the emergence of entirely new conceptual labels, clarifying its scope relative to recent text-based measures.

Implications of Russia's full-scale invasion of Ukraine for the international mobility of Ukrainian scholars

2026-02-06T09:00:49Z

This study examines the implications of Russia's full-scale invasion of Ukraine for the international mobility of Ukrainian scholars. The dataset, drawn from the CWTS in-house Scopus database, includes Ukrainian scholars who were internationally mobile between 2020 and 2023. The analysis focuses on scholars affiliated with universities and the National Academy of Sciences of Ukraine (NASU) prior to moving abroad. The findings reveal an increase in the number of internationally mobile scholars in 2022-2023, driven primarily by rising mobility from universities. For NASU-affiliated scholars, Russia was the top destination country in 2020-2021 but fell to fourth place in 2022-2023, overtaken by Germany, China, and Poland. For university-affiliated scholars, Poland, Germany, and Russia consistently ranked as the top three destination countries across both periods. Statistical tests indicate no significant difference in mean Field-Weighted Citation Impact (FNCI) between scholars who were internationally mobile in 2020-2021 and those mobile in 2022-2023. However, the share of internationally mobile scholars with articles among the top 10% most cited globally increased among those previously affiliated with universities, while it declined among those affiliated with NASU. In both periods, the proportion of scholars with articles in the top 10% most cited globally, published during the five years prior to changing their country of affiliation, was higher among internationally mobile scholars than among those who remained affiliated with Ukrainian institutions. Whether this mobility constitutes a brain drain requires further research. If effectively leveraged, international mobility may strengthen Ukraine's integration into global scientific networks, support post-war recovery, and contribute to a more resilient, internationally connected, and competitive academic system.

Compound Deception in Elite Peer Review: A Failure Mode Taxonomy of 100 Fabricated Citations at NeurIPS 2025

2026-02-05T17:43:35Z

Large language models (LLMs) are increasingly used in academic writing workflows, yet they frequently hallucinate by generating citations to sources that do not exist. This study analyzes 100 AI-generated hallucinated citations that appeared in papers accepted by the 2025 Conference on Neural Information Processing Systems (NeurIPS), one of the world's most prestigious AI conferences. Despite review by 3-5 expert researchers per paper, these fabricated citations evaded detection, appearing in 53 published papers (approx. 1% of all accepted papers). We develop a five-category taxonomy that classifies hallucinations by their failure mode: Total Fabrication (66%), Partial Attribute Corruption (27%), Identifier Hijacking (4%), Placeholder Hallucination (2%), and Semantic Hallucination (1%). Our analysis reveals a critical finding: every hallucination (100%) exhibited compound failure modes. The distribution of secondary characteristics was dominated by Semantic Hallucination (63%) and Identifier Hijacking (29%), which often appeared alongside Total Fabrication to create a veneer of plausibility and false verifiability. These compound structures exploit multiple verification heuristics simultaneously, explaining why peer review fails to detect them. The distribution exhibits a bimodal pattern: 92% of contaminated papers contain 1-2 hallucinations (minimal AI use) while 8% contain 4-13 hallucinations (heavy reliance). These findings demonstrate that current peer review processes do not include effective citation verification and that the problem extends beyond NeurIPS to other major conferences, government reports, and professional consulting. We propose mandatory automated citation verification at submission as an implementable solution to prevent fabricated citations from becoming normalized in scientific literature.

The Case of the Mysterious Citations

2026-02-05T16:46:27Z

Mysterious citations are routinely appearing in peer-reviewed publications throughout the scientific community. In this paper, we developed an automated pipeline and examine the proceedings of four major high-performance computing conferences, comparing the accuracy of citations between the 2021 and 2025 proceedings. While none of the 2021 papers contained mysterious citations, every 2025 proceeding did, impacting 2-6\% of published papers. In addition, we observe a sharp rise in paper title and authorship errors, motivating the need for stronger citation-verification practice. No author within our dataset acknowledged using AI to generate citations even though all four conference policies required it, indicating current policies are insufficient.

An FWCI decomposition of Science Foundation Ireland funding

2026-02-05T16:23:52Z

In response to the 2008 global financial crisis, Science Foundation Ireland (SFI), now Research Ireland, pivoted to research with potential socioeconomic impact. Given that the latter can encompass higher technology readiness levels, which typically correlates with lower academic impact, it is interesting to understand how academic impact holds up in SFI funded research. Here we decompose SFI \textit{Investigator Awards} - arguably the most academic funding call - into $3,243$ constituent publications and field weighted citation impact (FWCI) values searchable in the SCOPUS database. Given that citation counts are skewed, we highlight the limitation of FWCI as a paper metric, which naively restricts one to comparisons of average FWCI ($\overline{\mathrm{FWCI}}$) in large samples. Neglecting publications with $\textrm{FWCI} < 0.1$ ($8.8\%$), SFI funded publications are well approximated by a lognormal distribution with $μ= -0.0761^{+0.017}_{-0.0039}$ and $ σ= 0.933^{+0.011}_{-0.012}$ at $95 \%$ confidence level. This equates to an $\overline{\mathrm{FWCI}} = 1.433^{+0.029}_{-0.015}$ well above $\overline{\mathrm{FWCI}}=1$ internationally. Broken down by award, we correct $\overline{\mathrm{FWCI}}$ for small samples using simulations and find $\sim 67\%$ exceed \textit{median} international academic interest, thus exhibiting a positive correlation between the potential for socioeconomic impact and academic interest.

Citation accuracy, citation noise, and citation bias: A foundation of citation analysis

2026-02-05T14:11:57Z

Citation analysis is widely used in research evaluation to assess the impact of scientific papers. These analyses rest on the assumption that citation decisions by authors are accurate, representing the flow of knowledge from cited to citing papers. However, in practice, researchers often cite for reasons that are not related to the fact that there has been (intellectual) input from previous papers. Citations made for rhetorical reasons or without reading the cited work compromise the value of citations as instrument for research evaluation. Past research on threats to the accuracy of citations has mainly focused on citation bias as the primary concern. In this paper, we argue that citation noise - the undesirable variance in citation decisions - represents an equally critical but underexplored challenge in citation analysis. We define and differentiate two types of citation noise: citation level noise and citation pattern noise. Each type of noise is described in terms of how it arises and the specific ways it can undermine the validity of citation-based research assessments. By conceptually differing citation noise from citation accuracy and citation bias, we propose a framework for the foundation of citation analysis. We discuss strategies and interventions to minimize citation noise, aiming to improve the reliability and validity of citation analysis in research evaluation. We recommend that the current professional reform movement in research evaluation such as the Coalition for Advancing Research Assessment (CoARA) pick up these strategies and interventions as an additional building block for careful, responsible use of bibliometric indicators in research evaluation.