https://arxiv.org/api/qXLXhR62+LLWd9gWzgXwTH7aiSI 2026-06-14T20:40:56Z 6065 810 15 http://arxiv.org/abs/2505.13276v1 CHAD-KG: A Knowledge Graph for Representing Cultural Heritage Objects and Digitisation Paradata 2025-05-19T15:59:17Z

This paper presents CHAD-KG, a knowledge graph designed to describe bibliographic metadata and digitisation paradata of cultural heritage objects in exhibitions, museums, and collections. It also documents the related data model and materialisation engine. Originally based on two tabular datasets, the data was converted into RDF according to CHAD-AP, an OWL application profile built on standards like CIDOC-CRM, LRMoo, CRMdig, and Getty AAT. A reproducible pipeline, developed with a Morph-KGC extension, was used to generate the graph. CHAD-KG now serves as the main metadata source for the Digital Twin of the temporary exhibition titled \emph{The Other Renaissance - Ulisse Aldrovandi and The Wonders Of The World}, and other collections related to the digitisation work under development in a nationwide funded project, i.e. Project CHANGES (https://fondazionechanges.org). To ensure accessibility and reuse, it offers a SPARQL endpoint, a user interface, open documentation, and is published on Zenodo under a CC0 license. The project improves the semantic interoperability of cultural heritage data, with future work aiming to extend the data model and materialisation pipeline to better capture the complexities of acquisition and digitisation, further enrich the dataset and broaden its relevance to similar initiatives.

2025-05-19T15:59:17Z Sebastian Barzaghi Arianna Moretti Ivan Heibi Silvio Peroni http://arxiv.org/abs/2503.15772v2 Detecting LLM-Generated Peer Reviews 2025-05-19T01:40:25Z

The integrity of peer review is fundamental to scientific progress, but the rise of large language models (LLMs) has introduced concerns that some reviewers may rely on these tools to generate reviews rather than writing them independently. Although some venues have banned LLM-assisted reviewing, enforcement remains difficult as existing detection tools cannot reliably distinguish between fully generated reviews and those merely polished with AI assistance. In this work, we address the challenge of detecting LLM-generated reviews. We consider the approach of performing indirect prompt injection via the paper's PDF, prompting the LLM to embed a covert watermark in the generated review, and subsequently testing for presence of the watermark in the review. We identify and address several pitfalls in naïve implementations of this approach. Our primary contribution is a rigorous watermarking and detection framework that offers strong statistical guarantees. Specifically, we introduce watermarking schemes and hypothesis tests that control the family-wise error rate across multiple reviews, achieving higher statistical power than standard corrections such as Bonferroni, while making no assumptions about the nature of human-written reviews. We explore multiple indirect prompt injection strategies -- including font-based embedding and obfuscated prompts -- and evaluate their effectiveness under various reviewer defense scenarios. Our experiments find high success rates in watermark embedding across various LLMs. We also empirically find that our approach is resilient to common reviewer defenses, and that the bounds on error rates in our statistical tests hold in practice. In contrast, we find that Bonferroni-style corrections are too conservative to be useful in this setting.

2025-03-20T01:11:35Z 27 pages, 2 figures Vishisht Rao Aounon Kumar Himabindu Lakkaraju Nihar B. Shah http://arxiv.org/abs/2505.11946v1 Let's have a chat with the EU AI Act 2025-05-17T10:24:08Z

As artificial intelligence (AI) regulations evolve and the regulatory landscape develops and becomes more complex, ensuring compliance with ethical guidelines and legal frameworks remains a challenge for AI developers. This paper introduces an AI-driven self-assessment chatbot designed to assist users in navigating the European Union AI Act and related standards. Leveraging a Retrieval-Augmented Generation (RAG) framework, the chatbot enables real-time, context-aware compliance verification by retrieving relevant regulatory texts and providing tailored guidance. By integrating both public and proprietary standards, it streamlines regulatory adherence, reduces complexity, and fosters responsible AI development. The paper explores the chatbot's architecture, comparing naive and graph-based RAG models, and discusses its potential impact on AI governance.

2025-05-17T10:24:08Z Adam Kovari Yasin Ghafourian Csaba Hegedus Belal Abu Naim Kitti Mezei Pal Varga Markus Tauber http://arxiv.org/abs/2505.15837v1 Web2Wiki: Characterizing Wikipedia Linking Across the Web 2025-05-17T00:52:24Z

Wikipedia is one of the most visited websites globally, yet its role beyond its own platform remains largely unexplored. In this paper, we present the first large-scale analysis of how Wikipedia is referenced across the Web. Using a dataset from Common Crawl, we identify over 90 million Wikipedia links spanning 1.68% of Web domains and examine their distribution, context, and function. Our analysis of English Wikipedia reveals three key findings: (1) Wikipedia is most frequently cited by news and science websites for informational purposes, while commercial websites reference it less often. (2) The majority of Wikipedia links appear within the main content rather than in boilerplate or user-generated sections, highlighting their role in structured knowledge presentation. (3) Most links (95%) serve as explanatory references rather than as evidence or attribution, reinforcing Wikipedia's function as a background knowledge provider. While this study focuses on English Wikipedia, our publicly released Web2Wiki dataset includes links from multiple language editions, supporting future research on Wikipedia's global influence on the Web.

2025-05-17T00:52:24Z 13 pages, 3 figures, 5 tables Veniamin Veselovsky Tiziano Piccardi Ashton Anderson Robert West Akhil Arora http://arxiv.org/abs/2409.09636v2 Towards understanding evolution of science through language model series 2025-05-16T08:12:41Z

We introduce AnnualBERT, a series of language models designed specifically to capture the temporal evolution of scientific text. Deviating from the prevailing paradigms of subword tokenizations and "one model to rule them all", AnnualBERT adopts whole words as tokens and is composed of a base RoBERTa model pretrained from scratch on the full-text of 1.7 million arXiv papers published until 2008 and a collection of progressively trained models on arXiv papers at an annual basis. We demonstrate the effectiveness of AnnualBERT models by showing that they not only have comparable performances in standard tasks but also achieve state-of-the-art performances on domain-specific NLP tasks as well as link prediction tasks in the arXiv citation network. We then utilize probing tasks to quantify the models' behavior in terms of representation learning and forgetting as time progresses. Our approach enables the pretrained models to not only improve performances on scientific text processing tasks but also to provide insights into the development of scientific discourse over time. The series of the models is available at https://huggingface.co/jd445/AnnualBERTs.

2024-09-15T07:15:05Z Junjie Dong Zhuoqi Lyu Qing Ke http://arxiv.org/abs/2505.18180v1 Clustering scientific publications: lessons learned through experiments with a real citation network 2025-05-15T14:27:53Z

Clustering scientific publications can reveal underlying research structures within bibliographic databases. Graph-based clustering methods, such as spectral, Louvain, and Leiden algorithms, are frequently utilized due to their capacity to effectively model citation networks. However, their performance may degrade when applied to real-world data. This study evaluates the performance of these clustering algorithms on a citation graph comprising approx. 700,000 papers and 4.6 million citations extracted from Web of Science. The results show that while scalable methods like Louvain and Leiden perform efficiently, their default settings often yield poor partitioning. Meaningful outcomes require careful parameter tuning, especially for large networks with uneven structures, including a dense core and loosely connected papers. These findings highlight practical lessons about the challenges of large-scale data, method selection and tuning based on specific structures of bibliometric clustering tasks.

2025-05-15T14:27:53Z Vu Thi Huong Thorsten Koch http://arxiv.org/abs/2505.09995v1 A Survey on Open-Source Edge Computing Simulators and Emulators: The Computing and Networking Convergence Perspective 2025-05-15T06:17:56Z

Edge computing, with its low latency, dynamic scalability, and location awareness, along with the convergence of computing and communication paradigms, has been successfully applied in critical domains such as industrial IoT, smart healthcare, smart homes, and public safety. This paper provides a comprehensive survey of open-source edge computing simulators and emulators, presented in our GitHub repository (https://github.com/qijianpeng/awesome-edge-computing), emphasizing the convergence of computing and networking paradigms. By examining more than 40 tools, including CloudSim, NS-3, and others, we identify the strengths and limitations in simulating and emulating edge environments. This survey classifies these tools into three categories: packet-level, application-level, and emulators. Furthermore, we evaluate them across five dimensions, ranging from resource representation to resource utilization. The survey highlights the integration of different computing paradigms, packet processing capabilities, support for edge environments, user-defined metric interfaces, and scenario visualization. The findings aim to guide researchers in selecting appropriate tools for developing and validating advanced computing and networking technologies.

2025-05-15T06:17:56Z 10 pages, 2 figures, 5 tables Jianpeng Qi Chao Liu Xiao Zhang Lei Wang Rui Wang Junyu Dong Yanwei Yu http://arxiv.org/abs/2503.16623v4 ICLR Points: How Many ICLR Publications Is One Paper in Each Area? 2025-05-14T16:27:13Z

Scientific publications significantly impact academic-related decisions in computer science, where top-tier conferences are particularly influential. However, efforts required to produce a publication differ drastically across various subfields. While existing citation-based studies compare venues within areas, cross-area comparisons remain challenging due to differing publication volumes and citation practices. To address this gap, we introduce the concept of ICLR points, defined as the average effort required to produce one publication at top-tier machine learning conferences such as ICLR, ICML, and NeurIPS. Leveraging comprehensive publication data from DBLP (2019--2023) and faculty information from CSRankings, we quantitatively measure and compare the average publication effort across 27 computer science sub-areas. Our analysis reveals significant differences in average publication effort, validating anecdotal perceptions: systems conferences generally require more effort per publication than AI conferences. We further demonstrate the utility of the ICLR points metric by evaluating publication records of universities, current faculties and recent faculty candidates. Our findings highlight how using this metric enables more meaningful cross-area comparisons in academic evaluation processes. Lastly, we discuss the metric's limitations and caution against its misuse, emphasizing the necessity of holistic assessment criteria beyond publication metrics alone.

2025-03-20T18:23:35Z Zhongtang Luo http://arxiv.org/abs/2505.08533v1 How are research data referenced? The use case of the research data repository RADAR 2025-05-13T13:03:49Z

Publishing research data aims to improve the transparency of research results and facilitate the reuse of datasets. In both cases, referencing the datasets that were used is recommended. Research data repositories can support data referencing through various measures and also benefit from it, for example using this information to demonstrate their impact. However, the literature shows that the practice of formally citing research data is not widespread, data metrics are not yet established, and effective incentive structures are lacking. This article examines how often and in what form datasets published via the research data repository RADAR are referenced. For this purpose, the data sources Google Scholar, DataCite Event Data and the Data Citation Corpus were analyzed. The analysis shows that 27.9 % of the datasets in the repository were referenced at least once. 21.4 % of these references were (also) present in the reference lists and are therefore considered data citations. Datasets were referenced often in data availability statements. A comparison of the three data sources showed that there was little overlap in the coverage of references. In most cases (75.8 %), data and referencing objects were published in the same year. Two definition approaches were considered to investigate data reuse. 118 RADAR datasets were referenced more than once. Only 21 references had no overlaps in the authorship information -- these datasets were referenced by researchers that were not involved in data collection.

2025-05-13T13:03:49Z Dorothea Strecker Kerstin Soltau Felix Bach http://arxiv.org/abs/2505.07577v2 From raw affiliations to organization identifiers 2025-05-13T09:27:19Z

Accurate affiliation matching, which links affiliation strings to standardized organization identifiers, is critical for improving research metadata quality, facilitating comprehensive bibliometric analyses, and supporting data interoperability across scholarly knowledge bases. Existing approaches fail to handle the complexity of affiliation strings that often include mentions of multiple organizations or extraneous information. In this paper, we present AffRo, a novel approach designed to address these challenges, leveraging advanced parsing and disambiguation techniques. We also introduce AffRoDB, an expert-curated dataset to systematically evaluate affiliation matching algorithms, ensuring robust benchmarking. Results demonstrate the effectiveness of AffRp in accurately identifying organizations from complex affiliation strings.

2025-05-12T13:57:47Z 16 pages, 3 figures, 3 tables Myrto Kallipoliti Serafeim Chatzopoulos Miriam Baglioni Eleni Adamidi Paris Koloveas Thanasis Vergoulis http://arxiv.org/abs/2505.07912v1 SciCom Wiki: Fact-Checking and FAIR Knowledge Distribution for Scientific Videos and Podcasts 2025-05-12T13:38:20Z

Democratic societies need accessible, reliable information. Videos and Podcasts have established themselves as the medium of choice for civic dissemination, but also as carriers of misinformation. The emerging Science Communication Knowledge Infrastructure (SciCom KI) curating non-textual media is still fragmented and not adequately equipped to scale against the content flood. Our work sets out to support the SciCom KI with a central, collaborative platform, the SciCom Wiki, to facilitate FAIR (findable, accessible, interoperable, reusable) media representation and the fact-checking of their content, particularly for videos and podcasts. Building an open-source service system centered around Wikibase, we survey requirements from 53 stakeholders, refine these in 11 interviews, and evaluate our prototype based on these requirements with another 14 participants. To address the most requested feature, fact-checking, we developed a neurosymbolic computational fact-checking approach, converting heterogenous media into knowledge graphs. This increases machine-readability and allows comparing statements against equally represented ground-truth. Our computational fact-checking tool was iteratively evaluated through 10 expert interviews, a public user survey with 43 participants verified the necessity and usability of our tool. Overall, our findings identified several needs to systematically support the SciCom KI. The SciCom Wiki, as a FAIR digital library complementing our neurosymbolic computational fact-checking framework, was found suitable to address the raised requirements. Further, we identified that the SciCom KI is severely underdeveloped regarding FAIR knowledge and related systems facilitating its collaborative creation and curation. Our system can provide a central knowledge node, yet a collaborative effort is required to scale against the imminent (mis-)information flood.

2025-05-12T13:38:20Z 18 pages, 10 figures, submitted to TPDL 2025 Tim Wittenborg Constantin Sebastian Tremel Niklas Stehr Oliver Karras Markus Stocker Sören Auer http://arxiv.org/abs/2504.20081v3 Billions at Stake: How Self-Citation Adjusted Metrics Can Transform Equitable Research Funding 2025-05-11T20:34:24Z

Citation metrics serve as the cornerstone of scholarly impact evaluation despite their well-documented vulnerability to inflation through self-citation practices. This paper introduces the Self-Citation Adjusted Index (SCAI), a sophisticated metric designed to recalibrate citation counts by accounting for discipline-specific self-citation patterns. Through comprehensive analysis of 5,000 researcher profiles across diverse disciplines, we demonstrate that excessive self-citation inflates traditional metrics by 10-20%, potentially misdirecting billions in research funding. Recent studies confirm that self-citation patterns exhibit significant gender disparities, with men self-citing up to 70% more frequently than women, exacerbating existing inequalities in academic recognition. Our open-source implementation provides comprehensive tools for calculating SCAI and related metrics, offering a more equitable assessment of research impact that reduces the gender citation gap by approximately 8.5%. This work contributes to the paradigm shift toward transparent, nuanced, and equitable research evaluation methodologies in academia, with direct implications for funding allocation decisions that collectively amount to over $100 billion annually in the United States alone.

2025-04-25T09:20:09Z 8 Pages Rahul Vishwakarma Sinchan Banerjee http://arxiv.org/abs/2505.06938v1 A digital perspective on the role of a stemma in material-philological transmission studies 2025-05-11T11:05:16Z

Taking its point of departure in the recent developments in the field of digital humanities and the increasing automatisation of scholarly workflows, this study explores the implications of digital approaches to textual traditions for the broader field of textual scholarship. It argues that the relative simplicity of creating computergenerated stemmas allows us to view the stemma codicum as a research tool rather than the final product of our scholarly investigation. Using the Old Norse saga of Hrómundur as a case study, this article demonstrates that stemmas can serve as a starting point for exploring textual traditions further. In doing so, they enable us to address research questions that otherwise remain unanswered. The article is accompanied by datasets used to generate stemmas for the Hrómundar saga tradition as well as two custom Python scripts. The scripts are designed to convert XML-based textual data, encoded according to the TEI Guidelines, into the input format used for the analysis in the PHYLIP package to generate unrooted trees of relationships between texts.

2025-05-11T11:05:16Z Katarzyna Anna Kapitan http://arxiv.org/abs/2505.06107v1 Differentiating Emigration from Return Migration of Scholars Using Name-Based Nationality Detection Models 2025-05-09T15:03:39Z

Most web and digital trace data do not include information about an individual's nationality due to privacy concerns. The lack of data on nationality can create challenges for migration research. It can lead to a left-censoring issue since we are uncertain about the migrant's country of origin. Once we observe an emigration event, if we know the nationality, we can differentiate it from return migration. We propose methods to detect the nationality with the least available data, i.e., full names. We use the detected nationality in comparison with the country of academic origin, which is a common approach in studying the migration of researchers. We gathered 2.6 million unique name-nationality pairs from Wikipedia and categorized them into families of nationalities with three granularity levels to use as our training data. Using a character-based machine learning model, we achieved a weighted F1 score of 84% for the broadest and 67% for the most granular, country-level categorization. In our empirical study, we used the trained and tested model to assign nationality to 8+ million scholars' full names in Scopus data. Our results show that using the country of first publication as a proxy for nationality underestimates the size of return flows, especially for countries with a more diverse academic workforce, such as the USA, Australia, and Canada. We found that around 48% of emigration from the USA was return migration once we used the country of name origin, in contrast to 33% based on academic origin. In the most recent period, 79% of scholars whose affiliation has consistently changed from the USA to China, and are considered emigrants, have Chinese names in contrast to 41% with a Chinese academic origin. Our proposed methods for addressing left-censoring issues are beneficial for other research that uses digital trace data to study migration.

2025-05-09T15:03:39Z Accepted to appear @ ICWSM 2025. The link to the camera-ready paper will be added soon Faeze Ghorbanpour Thiago Zordan Malaguth Aliakbar Akbaritabar http://arxiv.org/abs/2207.11116v2 Science of science -- Citation models and research evaluation 2025-05-09T07:22:04Z

Citations in science are being studied from several perspectives, among which approaches such as scientometrics and science of science. In this chapter I briefly review some of the literature on citations, citation distributions and models of citations. These citations feature prominently in another part of the literature which is dealing with research evaluation and the role of metrics and indicators in that process. Here I briefly review part of the discussion in research evaluation. This also touches on the subject of how citations relate to peer review. Finally, I conclude by trying to integrate the two literatures. The fundamental problem in research evaluation is that research quality is unobservable. This has consequences for conclusions that we can draw from quantitative studies of citations and citation models. The term ``indicators'' is a relevant concept in this context, which I try to clarify. Causality is important for properly understanding indicators, especially when indicators are used in practice: when we act on indicators, we enter causal territory. Even when an indicator might have been valid, through its very use, the consequences of its use may invalidate it. By combining citation models with proper causal reasoning and acknowledging the fundamental problem about unobservable research quality, we may hope to make progress.

2022-07-22T14:44:41Z This is a draft. The final version will be available in Handbook of Computational Social Science edited by Taha Yasseri, forthcoming 2025, Edward Elgar Publishing Ltd V. A. Traag