https://arxiv.org/api/NQNhYqr9YCvbSYTtFDl73VccnGs 2026-06-10T05:57:42Z 6061 120 15 http://arxiv.org/abs/2604.25487v1 A contemporary science map through the lens of IEEE and ACM periodicals 2026-04-28T10:44:27Z

ACM and IEEE are the two premier associations on computing and electrical/electronics engineering which publish and organize the great majority of periodicals and conferences, respectively, serving these disciplines. Science is a constantly evolving process, and these publication fora are expected to follow the trends. In this article, we focus on the periodicals published by the two associations and seek to detect and/or confirm any contemporary science trends as these are reflected to the periodical titles established recently. Our study is rather qualitative than quantitative, aiming at revealing patterns immediately comprehensible and validatable by the reader. Among the most notable patterns, we see a growing preference of both associations for the open access mode of publication; we also observe ACM's orientation toward AI-focused periodicals, and most importantly, a significant theme overlap among periodicals of the same association and this is valid for both ACM and IEEE.

2026-04-28T10:44:27Z George Margaritis Dionysios Kritsas Dimitrios Katsaros Yannis Manolopoulos http://arxiv.org/abs/2508.18090v2 Named Entity Recognition of Historical Texts via Large Language Model 2026-04-28T09:40:50Z

Large language models (LLMs) have demonstrated remarkable versatility across a wide range of natural language processing tasks and domains. One such task is Named Entity Recognition (NER), which involves identifying and classifying proper names in text, such as people, organizations, locations, dates, and other specific entities. NER plays a crucial role in extracting information from unstructured textual data, enabling downstream applications such as information retrieval from unstructured text. Traditionally, NER is addressed using supervised machine learning approaches, which require large amounts of annotated training data. However, historical texts present a unique challenge, as the annotated datasets are often scarce or nonexistent, due to the high cost and expertise required for manual labeling. In addition, the variability and noise inherent in historical language, such as inconsistent spelling and archaic vocabulary, further complicate the development of reliable NER systems for these sources. In this study, we explore the feasibility of applying LLMs to NER in historical documents using zero-shot and few-shot prompting strategies, which require little to no task-specific training data. Our experiments, conducted on the HIPE-2022 (Identifying Historical People, Places and other Entities) dataset, show that LLMs can achieve reasonably strong performance on NER tasks in this setting. While their performance falls short of fully supervised models trained on domain-specific annotations, the results are nevertheless promising. These findings suggest that LLMs offer a viable and efficient alternative for information extraction in low-resource or historically significant corpora, where traditional supervised methods are infeasible.

2025-08-25T14:52:11Z Shibingfeng Zhang Giovanni Colavizza http://arxiv.org/abs/2511.21745v2 AI-Augmented Bibliometric Framework: A Paradigm Shift with Agentic AI for Dynamic, Snippet-Based Research Analysis 2026-04-28T06:09:50Z

Our paper introduces a generative, multiagent AI framework designed to overcome the rigidity, limited flexibility and technical barriers of current bibliometric tools. The objective is to enable researchers to perform fully dynamic, code-based scientometric analysis using natural language NL instructions, eliminating the need for specialized programming skills while expanding analytical depth. Methodologically, the system integrates four coordinated AI agents: a custom analytics generator, a full-paper retriever, including a Retrieval Augmented Generation RAG based researcher assistant and an automated report generator. User queries are translated into executable Python scripts, run within a sandbox ensuring safety, reproducibility and auditability. The framework supports automated data cleaning, construction of co-authorship and citation networks, temporal analyses, topic modeling, embedding based clustering and synthesis of research gaps. Each analytical session produces an exportable, end to end report. The novelty lies in unifying NL to code scientometrics, multimodal full paper retrieval, agentic exploration and dynamic metric creation in a single adaptive environment, capabilities absent in existing platforms: VOSviewer, Bibliometrix, SciMAT. Unlike static GUI based workflows, the proposed framework supports iterative what if analysis, hybrid indicators and user driven pipeline modification. Results demonstrate that the framework generates valid analysis scripts, retrieves and synthesizes full papers, identifies frontier themes and produces reproducible scientometric outputs. It establishes a new paradigm for accessible, interactive and extensible bibliometric knowledge.

2025-11-22T08:41:38Z Adela Bara Simona-Vasilica Oprea http://arxiv.org/abs/2604.25057v1 CiteRadar: A Citation Intelligence Platform for Researcher Profiling and Geographic Visualization 2026-04-27T23:12:48Z

Understanding the geographic reach and community structure of one's scholarly citations is increasingly valuable for career development, grant applications, and collaboration discovery -- yet accessible tools for answering these questions remain scarce. Existing bibliometric platforms either require costly institutional subscriptions or expose only aggregate citation counts without granular per-author metadata. We present CiteRadar, an open-source system that accepts a single Google Scholar user identifier and automatically produces a structured output folder containing: the author's complete publication list, all retrieved citing papers with enriched author metadata, two ranked author tables (by citation frequency and by h-index), a plain-text statistical summary, and a self-contained interactive HTML world map -- all from a single command-line invocation. CiteRadar integrates five heterogeneous data sources -- Google Scholar, OpenAlex, CrossRef, Semantic Scholar, and OpenStreetMap Nominatim -- through a carefully engineered five-stage pipeline. Key technical contributions include: (1) a Scholar meta-string parser resilient to Unicode non-breaking-space separators, a pervasive but undocumented quirk in Scholar's HTML that silently corrupts venue and year fields when unhandled; (2) a two-stage author disambiguation system using stop-word-filtered institution name similarity to guard against the well-known same-name entity-merging failure mode in bibliometric databases, demonstrated to eliminate h-index attribution errors of up to 9x the correct value; (3) an OpenAlex web-URL to API-URL conversion fix that raises the fraction of author records with city-level location data from 0% to ~60%; and (4) a logarithmically-scaled interactive Folium world map with per-city researcher popups, rendered as a fully self-contained HTML file.

2026-04-27T23:12:48Z Chenxu Niu Yiming Sun http://arxiv.org/abs/2508.09079v2 Exploring the Shape of Economics: A Multilayer Network Analysis of Social Communities and Intellectual Similarity Among Journals Before and After the 2008 Financial Crisis 2026-04-27T13:07:04Z

This paper develops a multilayer network approach for exploring the evolution of scientific disciplines, using the case of economics before and after the 2008 global financial crisis as a large-scale empirical testing ground. The units of analysis are journals, linked by social and intellectual relationships. The analysis covers all journals indexed in EconLit across three years (2006, 2012 and 2019). In the most recent year (2019), the dataset includes 909 journals, over 30,000 editorial board members, more than 260,000 authors, 134,000 articles, and nearly 2 million cited references. For each period, we model journals as connected in a four-layer multiplex network: the social relationships are based on shared editors (interlocking editorship) and shared authors (interlocking authorship), while the intellectual ones are based on shared references (bibliographic coupling) and textual similarity between articles. These four layers are integrated using Similarity Network Fusion to produce unified similarity networks from which journal communities are identified. Comparing the field across the three periods reveals a high degree of structural continuity. Although research topics changed after the crisis, the fundamental social and intellectual relationships among journals remained remarkably stable. A major result of the analysis is that editorial networks play the dominant role in shaping hierarchies and legitimize knowledge production within the discipline. Whether this finding holds in other scientific disciplines remains an open question for future research.

2025-08-12T16:58:23Z 66 pages, 3 figures, 7 tables Alberto Baccini Lucio Barabesi Carlo Debernardi http://arxiv.org/abs/2602.12206v3 Making the complete OpenAIRE citation graph easily accessible through compact data representation 2026-04-27T12:28:30Z

The OpenAIRE graph contains a large citation graph dataset, with over 200 million publications and over 2 billion citations. The current graph is available as a dump with metadata which, when uncompressed, totals $\sim$2.5 TB. This makes it hard to process on conventional computers. To make this network more accessible for the community, we provide a processed OpenAIRE graph which is downscaled to 16 GB RAM, while preserving the full graph structure. Apart from this we offer the processed data in a very simple format, which allows for further straightforward manipulation. We also provide (1) a Python pipeline, which can be used to process the next releases of the OpenAIRE graph, and (2) a larger version of the dataset including more publication fields such as, the title, list of authors.

2026-02-12T17:44:36Z This work has been funded by a grant from the Programme Johannes Amos Comenius under the Ministry of Education, Youth and Sports of the Czech Republic, CZ.02.01.01/00/23_025/0008711 Abstract updated to reflect new version Joakim Skarding Pavel Sanda 10.5334/johd.520 http://arxiv.org/abs/2604.23827v1 Are Digital Humanities really committed to open? An exploratory study on the availability of methodological workflows and open peer review practices 2026-04-26T18:13:22Z

Open Science has become a central framework for promoting transparency, accessibility, and inclusiveness in scholarly research. While the Digital Humanities (DH) community has long embraced openness in terms of research outputs, less attention seems to have been paid to the openness of the methodological and evaluative processes underlying knowledge production. This paper presents an exploratory study that investigates the current state of openness in DH research practices, focusing specifically on research data management documentation and peer review processes. In particular, this study addresses two research questions: (1) to what extent DH publications that describe data explicitly reference external documentation detailing data creation and management processes; and (2) how widely open peer review practices are adopted across DH conferences and journals. The results revealed a limited adoption of open methodological practices. Only a small fraction of the analysed articles provided explicit, reusable documentation of data creation workflows, and no references to data management plans or formal research data management documentation were found. An even more critical picture emerges from the analysis of peer review practices: the vast majority of DH venues continue to rely on traditional single- or double-blind review models, with open peer review adopted in only a few isolated cases.

2026-04-26T18:13:22Z Silvio Peroni http://arxiv.org/abs/2604.23820v1 The software space of science 2026-04-26T17:51:44Z

Science advances not only through the accumulation of facts but also through the evolution of tools. Crucially, tools are rarely used in isolation. They form tool portfolios, combinations shaped by a discipline's workflows and analytical demands. Software, near-ubiquitous in modern research and traceable across the published literature, offers a unique window to study tool use in science. Here, we map the software space of science by analyzing mentions to software from 1.3 million publications from 2004 to 2021. We construct a network of 520 software tools linked by disciplinary co-usage, with link strength weighted by proximity based on revealed comparative advantage. This network reveals a structured landscape in which tools cluster into 8 functional communities, including computing and statistics, wet lab instrumentation, and several bioinformatics specializations, with each discipline occupying a distinct position reflecting its characteristic tool portfolios. The breadth of a discipline's tool portfolio is shaped by the nature of its research workflow: fields combining experimental and computational tasks draw on multiple communities, while those with narrower methodological demands concentrate in one. These structural differences are stable across the observation period. At the same time, across all broad disciplinary categories, disciplinary tool portfolios are crystallizing, settling on a common set of tools.

2026-04-26T17:51:44Z Zhouming Wu Dakota Murray http://arxiv.org/abs/2604.23699v1 Beyond coauthorship: semantic structure and phantom collaborators in transportation research, 1967--2025 2026-04-26T13:26:42Z

We present a semantic-structural atlas of transportation research built from 120{,}323 papers across 34 peer-reviewed journals published between 1967 and 2025, roughly an order of magnitude larger than and a decade beyond Sun and Rahwan's~(2017) coauthorship study. We use OpenAlex and Crossref as open, CC0-licensed data sources, resolve author identity through OpenAlex author IDs, ORCID records, and manual alias resolution, and embed every paper with SPECTER2 with Arora-style whitening concatenated with concept TF--IDF and venue linear-discriminant projections. On this substrate we report three findings. First, Leiden on the author-level semantic k-nearest-neighbor graph yields 23 topic communities that agree only weakly with the 172 coauthor communities (normalized mutual information $0.23$), opening room for a predictive layer that neither source encodes alone. Second, a multiplex Leiden partition combining both edge types recovers 181 communities and localizes where collaboration and topic structure decouple. Third -- the paper's core methodological contribution -- we define \emph{phantom collaborators}, pairs of authors who are top-$K$ semantic neighbors yet $\geq 3$ hops apart in the coauthor graph, and show via a temporal hold-out (training cutoff 2019) that phantom pairs become real coauthors in 2020--2025 at a rate $16$ to $33$ times above random, popularity-weighted, and same-venue baselines, with a $68$-fold monotone gradient between the highest- and lowest-similarity buckets. All artifacts are released as a live, reproducible web atlas at https://choi-seongjin.github.io/transport-atlas/.

2026-04-26T13:26:42Z Seongjin Choi http://arxiv.org/abs/2604.23430v1 Automating Categorization of Scientific Texts with In-Context Learning and Prompt-Chaining in Large Language Models 2026-04-25T19:52:21Z

The relentless expansion of scientific literature presents significant challenges for navigation and knowledge discovery. Within Research Information Retrieval, established tasks such as text summarization and classification remain crucial for enabling researchers and practitioners to effectively navigate this vast landscape, so that efforts have increasingly been focused on developing advanced research information systems. These systems aim not only to provide standard keyword-based search functionalities but also to incorporate capabilities for automatic content categorization within knowledge-intensive organizations across academia and industry. This study systematically evaluates the performance of off-the-shelf Large Language Models (LLMs) in analyzing scientific texts according to a given classification scheme. We utilized the hierarchical ORKG taxonomy as a classification framework, employing the FORC dataset as ground truth. We investigated the effectiveness of advanced prompt engineering strategies, namely In-Context Learning (ICL) and Prompt Chaining, and experimentally explored the influence of the LLMs' temperature hyperparameter on classification accuracy. Our experiments demonstrate that Prompt Chaining yields superior classification accuracy compared to pure ICL, particularly when applied to the nested structure of the ORKG taxonomy. LLMs with prompt chaining outperform the state-of-the-art models for domain (1st level) prediction and show even better performance for subject (2nd level) prediction compared to the older BERT model. However, LLMs are not yet able to perform well in classifying the topic (3rd level) of research areas based on this specific hierarchical taxonomy, as they only reach about 50% accuracy even with prompt chaining.

2026-04-25T19:52:21Z 25 pages Gautam Kishore Shahi Oliver Hummel http://arxiv.org/abs/2601.15485v2 The Rise of Large Language Models and the Direction and Impact of US Federal Research Funding 2026-04-25T17:03:43Z

Federal research funding shapes the direction, diversity, and impact of the US scientific enterprise. Large language models (LLMs) are rapidly diffusing into scientific practice, holding substantial promise while raising widespread concerns. Despite growing attention to AI use in scientific writing and evaluation, little is known about how the rise of LLMs is reshaping the public funding landscape. Here, we examine LLM involvement at key stages of the federal funding pipeline by combining two complementary data sources: confidential National Science Foundation (NSF) and National Institutes of Health (NIH) proposal submissions from two large US R1 universities, including funded, unfunded, and pending proposals, and the full population of publicly released NSF and NIH awards. We find that LLM use rises sharply beginning in 2023 and exhibits a bimodal distribution, indicating a clear split between minimal and substantive use. Across both private submissions and public awards, higher LLM involvement is consistently associated with lower semantic distinctiveness, positioning projects closer to recently funded work within the same agency. The consequences of this shift are agency-dependent. LLM use is positively associated with proposal success and higher subsequent publication output at NIH, whereas no comparable associations are observed at NSF. Notably, the productivity gains at NIH are concentrated in non-hit papers rather than the most highly cited work. Together, these findings provide large-scale evidence that the rise of LLMs is reshaping how scientific ideas are positioned, selected, and translated into publicly funded research, with implications for portfolio governance, research diversity, and the long-run impact of science.

2026-01-21T21:37:08Z 41 pages, 23 figures, 12 tables Yifan Qian Zhe Wen Alexander C. Furnas Yue Bai Erzhuo Shao Dashun Wang http://arxiv.org/abs/2508.20747v3 An analysis of the effects of open science indicators on citations in the French Open Science Monitor 2026-04-24T14:46:20Z

This study investigates the correlation of citation impact with various open science indicators (OSI) within the French Open Science Monitor (FOSM), a dataset comprising approximately 900,000 publications authored by French authors from 2020 to 2022. By integrating data from OpenAlex and Crossref, we analyze open science indicators such as the presence of a pre-print, data sharing, and software sharing in 576,537 publications in the FOSM dataset. Our analysis reveals a positive correlation between these OSI and citation counts. Considering our most complete citation prediction model, we find pre-prints are correlated with a significant positive effect of 19% on citation counts, software sharing of 13.5%, and data sharing of 14.3%. We find large variations in the correlations of OSIs with citations in different research disciplines, and observe that open access status of publications is correlated with a 8.6% increase in citations in our model. While these results remain observational and are limited to the scope of the analysis, they suggest a consistent correlation between citation advantages and open science indicators. Our results may be valuable to policy makers, funding agencies, researchers, publishers, institutions, and other stakeholders who are interested in understanding the academic impacts, or effects, of open science practices.

2025-08-28T13:07:50Z Giovanni Colavizza Lauren Cadwallader Iain Hrynaszkiewicz http://arxiv.org/abs/2604.22539v1 Evolving Thematic Map Design in Academic Cartography: A Thirty-Year Study Based on Multilingual Journals 2026-04-24T13:28:11Z

Thematic maps play a central role in academic communication, yet their large-scale design evolution has rarely been examined empirically. This study presents a longitudinal and multilingual analysis of thematic map design practices in academic cartography from 1990 to 2020. We compile a corpus of 45,732 research articles from sixteen authoritative Chinese- and English-language journals and extract 23,928 maps using computer vision and large-model-based document parsing to build a structured dataset. Map design characteristics are quantified across three dimensions: map elements, color design, and layout structure. Results show that Chinese- and Englishlanguage academic maps share highly similar structural conventions, typically employing restrained color palettes with neutral dominant hues, low saturation, high brightness, and limited hue diversity, as well as centered layouts with high main-map occupation ratios. Differences exist in that English-language maps show slightly greater hue richness and compactness, whereas Chinese-language maps historically rely more on neutral hues and integrated layouts. Temporal analysis reveals parallel evolutionary trends in both groups, including increasing element richness, legend usage, and hue diversity, alongside stable layout structures. Overall, the findings suggest that academic map design evolution is characterized more by institutional convergence than cultural divergence.

2026-04-24T13:28:11Z Zhiwei Wei Chenxi Song Tazhu Wang Fan Wu Hua Liao Su Ding Nai Yang http://arxiv.org/abs/2604.22458v1 Opening Pandora's box: Paper mills in conference proceedings 2026-04-24T11:22:45Z

Paper mills are a growing threat to the integrity of science, yet their penetration in conference proceedings remains underexplored despite conferences being more important than journals in some scientific subfields. This study aims to identify papers in conference proceedings whose titles have been offered for sale on social media platforms. We collected more than 4,000 unique publication offers from more than 200 social media channels and used semi-automated methods along with human assessment to match offers with papers published in IEEE conference proceedings. We identified 1,720 papers in 286 IEEE conference proceedings, accounting for up to 23.51% of an individual conference. These problematic papers are co-authored by more than 6,500 researchers from over 3,500 affiliations in 55 countries. The identified papers demonstrate collaboration anomalies, high diversity of affiliations per paper, citation manipulation, a predominance of six-author papers, and content-based irregularities. Our findings show that paper mills are a large, organized, and often public market that commercializes scientific misconduct, not limited to papers, but infiltrating multiple parts of the research ecosystem.

2026-04-24T11:22:45Z Anna Abalkina Marie Kunešová Yagmur Ozturk Solal Pirelli http://arxiv.org/abs/2312.16523v2 Mapping bibliographic metadata collections: the case of OpenCitations Meta and OpenAlex 2026-04-24T10:38:18Z

This study describes the methodology and analyses the results of the process of mapping entities between two large open bibliographic metadata collections, OpenCitations Meta and OpenAlex. The primary objective of this mapping is to integrate OpenAlex internal identifiers into the existing metadata of bibliographic resources in OpenCitations Meta, thereby interlinking and aligning these collections. Furthermore, analysing the output of the mapping provides a unique perspective on the consistency and accuracy of bibliographic metadata, offering a valuable tool for identifying potential inconsistencies in the processed data.

2023-12-27T11:04:13Z Elia Rizzetto Silvio Peroni