https://arxiv.org/api/WNPLaIiDWNHpAp9JLtCrSgR0m+w 2026-06-14T19:30:54Z 6065 795 15 http://arxiv.org/abs/2505.17945v1 Towards Industrial Convergence : Understanding the evolution of scientific norms and practices in the field of AI 2025-05-23T14:22:15Z

In the field of artificial intelligence (AI) research, there seems to be a rapprochement between academics and industrial forces. The aim of this study is to assess whether and to what extent industrial domination in the field as well as the ever more frequent switch between academia and industry resulted in the adoption of industrial norms and practices by academics. Using bibliometric information and data on scientific code, we aimed to understand academic and industrial researchers' practices, the way of choosing, investing, and succeeding across multiple and concurrent artifacts. Our results show that, although both actors write papers and code, their practices and the norms guiding them differ greatly. Nevertheless, it appears that the presence of industrials in academic studies leads to practices leaning toward the industrial side, but also to greater success in both artifacts, suggesting that if convergence is, then it is passing through those mixed teams rather than through pure academic or industrial studies.

2025-05-23T14:22:15Z 26 Pages , 14 Figure Antoine Houssard http://arxiv.org/abs/2503.20793v2 Semantic Web and Software Agents -- A Forgotten Wave of Artificial Intelligence? 2025-05-23T10:45:18Z

The history of Artificial Intelligence (AI) is a narrative of waves -- rising optimism followed by crashing disappointments. AI winters, such as the early 2000s, are often remembered as barren periods of innovation. This paper argues that such a perspective overlooks a crucial wave of AI that seems to be forgotten: the rise of the Semantic Web, which is based on knowledge representation, logic, and reasoning, and its interplay with intelligent Software Agents. Fast forward to today, and ChatGPT has reignited AI enthusiasm, built on deep learning and advanced neural models. However, before Large Language Models (LLMs) dominated the conversation, another ambitious vision emerged -- one where AI-driven Software Agents autonomously served Web users based on a structured, machine-interpretable Web. The Semantic Web aimed to transform the World Wide Web into an ecosystem where AI could reason, understand, and act. Between 2000 and 2010, this vision sparked a significant research boom, only to fade into obscurity as AI's mainstream narrative shifted elsewhere. Today, as LLMs edge toward autonomous execution, we revisit this overlooked wave. By analyzing its academic impact through bibliometric data, we highlight the Semantic Web's role in AI history and its untapped potential for modern Software Agent development. Recognizing this forgotten chapter not only deepens our understanding of AI's cyclical evolution but also offers key insights for integrating emerging technologies.

2025-03-20T12:55:48Z 28 pages, 9 figures Tapio Pitkäranta Eero Hyvönen http://arxiv.org/abs/2504.07199v3 SemEval-2025 Task 5: LLMs4Subjects -- LLM-based Automated Subject Tagging for a National Technical Library's Open-Access Catalog 2025-05-23T10:14:35Z

We present SemEval-2025 Task 5: LLMs4Subjects, a shared task on automated subject tagging for scientific and technical records in English and German using the GND taxonomy. Participants developed LLM-based systems to recommend top-k subjects, evaluated through quantitative metrics (precision, recall, F1-score) and qualitative assessments by subject specialists. Results highlight the effectiveness of LLM ensembles, synthetic data generation, and multilingual processing, offering insights into applying LLMs for digital library classification.

2025-04-09T18:26:46Z 10 pages, 4 figures, Accepted as SemEval 2025 Task 5 description paper Jennifer D'Souza Sameer Sadruddin Holger Israel Mathias Begoin Diana Slawig http://arxiv.org/abs/2505.18222v1 A Domain Ontology for Modeling the Book of Purification in Islam 2025-05-23T08:55:59Z

This paper aims to address a gap in major Islamic topics by developing an ontology for the Book of Purification in Islam. Many authoritative Islamic texts begin with the Book of Purification, as it is essential for performing prayer (the second pillar of Islam after Shahadah, the profession of faith) and other religious duties such as Umrah and Hajj. The ontology development strategy followed six key steps: (1) domain identification, (2) knowledge acquisition, (3) conceptualization, (4) classification, (5) integration and implementation, and (6) ontology generation. This paper includes examples of the constructed tables and classifications. The focus is on the design and analysis phases, as technical implementation is beyond the scope of this study. However, an initial implementation is provided to illustrate the steps of the proposed strategy. The developed ontology ensures reusability by formally defining and encoding the key concepts, attributes, and relationships related to the Book of Purification. This structured representation is intended to support knowledge sharing and reuse.

2025-05-23T08:55:59Z 9 pages Hessa Alawwad http://arxiv.org/abs/2505.16550v1 Towards Machine-actionable FAIR Digital Objects with a Typing Model that Enables Operations 2025-05-22T11:38:19Z

FAIR Digital Objects support research data management aligned with the FAIR principles. To be machine-actionable, they must support operations that interact with their contents. This can be achieved by associating operations with FAIR-DO data types. However, current typing models and Data Type Registries lack support for type-associated operations. In this work, we introduce a typing model that describes type-associated and technology-agnostic FAIR Digital Object Operations in a machine-actionable way, building and improving on the existing concepts. In addition, we introduce the Integrated Data Type and Operations Registry with Inheritance System, a prototypical implementation of this model that integrates inheritance mechanisms for data types, a rule-based validation system, and the computation of type-operation associations. Our approach significantly improves the machine-actionability of FAIR Digital Objects, paving the way towards dynamic, interoperable, and reproducible research workflows.

2025-05-22T11:38:19Z Maximilian Inckmann Nicolas Blumenröhr Rossella Aversa 10.1109/eScience65000.2025.00022 http://arxiv.org/abs/2505.16506v1 Utilizing citation index and synthetic quality measure to compare Wikipedia languages across various topics 2025-05-22T10:41:55Z

This study presents a comparative analysis of 55 Wikipedia language editions employing a citation index alongside a synthetic quality measure. Specifically, we identified the most significant Wikipedia articles within distinct topical areas, selecting the top 10, top 25, and top 100 most cited articles in each topic and language version. This index was built on the basis of wikilinks between Wikipedia articles in each language version and in order to do that we processed 6.6 billion page-to-page link records. Next, we used a quality score for each Wikipedia article - a synthetic measure scaled from 0 to 100. This approach enabled quality comparison of Wikipedia articles even between language versions with different quality grading schemes. Our results highlight disparities among Wikipedia language editions, revealing strengths and gaps in content coverage and quality across topics.

2025-05-22T10:41:55Z Presented at the Wiki Workshop 2025 Włodzimierz Lewoniewski Krzysztof Węcel Witold Abramowicz http://arxiv.org/abs/2505.16330v1 SC4ANM: Identifying Optimal Section Combinations for Automated Novelty Prediction in Academic Papers 2025-05-22T07:34:59Z

Novelty is a core component of academic papers, and there are multiple perspectives on the assessment of novelty. Existing methods often focus on word or entity combinations, which provide limited insights. The content related to a paper's novelty is typically distributed across different core sections, e.g., Introduction, Methodology and Results. Therefore, exploring the optimal combination of sections for evaluating the novelty of a paper is important for advancing automated novelty assessment. In this paper, we utilize different combinations of sections from academic papers as inputs to drive language models to predict novelty scores. We then analyze the results to determine the optimal section combinations for novelty score prediction. We first employ natural language processing techniques to identify the sectional structure of academic papers, categorizing them into introduction, methods, results, and discussion (IMRaD). Subsequently, we used different combinations of these sections (e.g., introduction and methods) as inputs for pretrained language models (PLMs) and large language models (LLMs), employing novelty scores provided by human expert reviewers as ground truth labels to obtain prediction results. The results indicate that using introduction, results and discussion is most appropriate for assessing the novelty of a paper, while the use of the entire text does not yield significant results. Furthermore, based on the results of the PLMs and LLMs, the introduction and results appear to be the most important section for the task of novelty score prediction. The code and dataset for this paper can be accessed at https://github.com/njust-winchy/SC4ANM.

2025-05-22T07:34:59Z Expert Systems With Applications, 2025 Wenqing Wu Chengzhi Zhang Tong Bao Yi Zhao 10.1016/j.eswa.2025.126778 http://arxiv.org/abs/2505.16061v1 Internal and External Impacts of Natural Language Processing Papers 2025-05-21T22:25:58Z

We investigate the impacts of NLP research published in top-tier conferences (i.e., ACL, EMNLP, and NAACL) from 1979 to 2024. By analyzing citations from research articles and external sources such as patents, media, and policy documents, we examine how different NLP topics are consumed both within the academic community and by the broader public. Our findings reveal that language modeling has the widest internal and external influence, while linguistic foundations have lower impacts. We also observe that internal and external impacts generally align, but topics like ethics, bias, and fairness show significant attention in policy documents with much fewer academic citations. Additionally, external domains exhibit distinct preferences, with patents focusing on practical NLP applications and media and policy documents engaging more with the societal implications of NLP models.

2025-05-21T22:25:58Z 7 pages; Accepted to ACL 2025 Yu Zhang http://arxiv.org/abs/2505.15948v1 Citation Parsing and Analysis with Language Models 2025-05-21T19:06:17Z

A key type of resource needed to address global inequalities in knowledge production and dissemination is a tool that can support journals in understanding how knowledge circulates. The absence of such a tool has resulted in comparatively less information about networks of knowledge sharing in the Global South. In turn, this gap authorizes the exclusion of researchers and scholars from the South in indexing services, reinforcing colonial arrangements that de-center and minoritize those scholars. In order to support citation network tracking on a global scale, we investigate the capacity of open-weight language models to mark up manuscript citations in an indexable format. We assembled a dataset of matched plaintext and annotated citations from preprints and published research papers. Then, we evaluated a number of open-weight language models on the annotation task. We find that, even out of the box, today's language models achieve high levels of accuracy on identifying the constituent components of each citation, outperforming state-of-the-art methods. Moreover, the smallest model we evaluated, Qwen3-0.6B, can parse all fields with high accuracy in $2^5$ passes, suggesting that post-training is likely to be effective in producing small, robust citation parsing models. Such a tool could greatly improve the fidelity of citation networks and thus meaningfully improve research indexing and discovery, as well as further metascientific research.

2025-05-21T19:06:17Z Presented at the Workshop on Open Citations & Open Scholarly Metadata 2025 Parth Sarin Juan Pablo Alperin http://arxiv.org/abs/2508.00829v1 Open Science, Open Innovation? The Role of Open Access in Patenting Activity 2025-05-21T14:02:44Z

Scientific knowledge is a key driver of technological innovation, shaping industrial development and policy decisions worldwide. Understanding how patents incorporate scientific research is essential for assessing the role of academic discoveries in technological progress. Non-Patent References (NPRs) provide a useful indicator of this relationship by revealing the extent to which patents draw upon scientific literature. Here, we show that reliance on scientific research in patents varies significantly across regions. Oceania and Europe display stronger engagement with scientific knowledge, while the Americas exhibit lower reliance. Moreover, NPRs are more likely to be open access than the average scientific publication, a trend that intensifies when Sci-Hub availability is considered. These results highlight the transformative impact of Open Science on global innovation dynamics. By facilitating broader access to research, Open Science strengthens the link between academia and industry, underscoring the need for policies that promote equitable and science-based innovation, particularly in developing regions.

2025-05-21T14:02:44Z Abdelghani Maddi GEMASS Ahmad Yaman Abdin Francesco Fdp de Pretis http://arxiv.org/abs/2505.15384v1 A two-stage model for factors influencing citation counts 2025-05-21T11:22:03Z

This work aims to study a count response random variable, the number of citations of a research paper, affected by some explanatory variables through a suitable regression model. Due to the fact that the count variable exhibits substantial variation since the sample variance is larger than the sample mean, the classical Poisson regression model seems not to be appropriate. We concentrate attention on the negative binomial regression model, which allows the variance of each measurement to be a function of its predicted value. Nevertheless, the process of citations of papers may be divided into two parts. In the first stage, the paper has no citations, and the second part provides the intensity of the citations. A hurdle model for separating the documents with citations and those without citations is considered. The dataset for the empirical application consisted of 43,190 research papers in the field of Economics and Business from 2014-2021, obtained from The Lens database. Citation counts and social attention scores for each article were gathered from Altmetric database. The main findings indicate that both collaboration and funding have a positive impact on citation counts and reduce the likelihood of receiving zero citations. Higher journal impact factors lead to higher citation counts, while lower peer review ratings lead to fewer citations and a higher probability of zero citations. Mentions in news, blogs, and social media have varying but generally limited effects on citation counts. Open access via repositories (green OA) correlates with higher citation counts and a lower probability of zero citations. In contrast, OA via the publisher's website without an explicit open license (bronze OA) is associated with higher citation counts but also with a higher probability of zero citations.

2025-05-21T11:22:03Z 33 pages, 3 figures, 7 tables Pablo Dorta-González Emilio Gómez-Déniz http://arxiv.org/abs/2505.15042v1 GitHub Repository Complexity Leads to Diminished Web Archive Availability 2025-05-21T02:51:30Z

Software is often developed using versioned controlled software, such as Git, and hosted on centralized Web hosts, such as GitHub and GitLab. These Web hosted software repositories are made available to users in the form of traditional HTML Web pages for each source file and directory, as well as a presentational home page and various descriptive pages. We examined more than 12,000 Web hosted Git repository project home pages, primarily from GitHub, to measure how well their presentational components are preserved in the Internet Archive, as well as the source trees of the collected GitHub repositories to assess the extent to which their source code has been preserved. We found that more than 31% of the archived repository home pages examined exhibited some form of minor page damage and 1.6% exhibited major page damage. We also found that of the source trees analyzed, less than 5% of their source files were archived, on average, with the majority of repositories not having source files saved in the Internet Archive at all. The highest concentration of archived source files available were those linked directly from repositories' home pages at a rate of 14.89% across all available repositories and sharply dropping off at deeper levels of a repository's directory tree.

2025-05-21T02:51:30Z David Calano Michele C. Weigle Michael L. Nelson 10.1145/3717867.3717920 http://arxiv.org/abs/2505.18196v1 Comment on "Politicizing science funding undermines public trust in science, academic freedom, and the unbiased generation of knowledge" 2025-05-21T02:13:56Z

In a commentary published in mid-2024 (to which the present work is a direct response), a number of scientists argue that U.S. funding agencies have "politicized" the process by which grants are awarded, in service of diversifying the scientific workforce. The commentary in question, however, makes numerous unfounded assertions while recycling citations to a fusillade of opinion essays written by the same cabal of authors, in an effort to resemble a work of serious scholarship. Basic fact-checking is provided here, demonstrating numerous claims that are unsupported by the source material and readily debunked. The present work also serves to document the reality of inclusion and diversity plans for scientific grant proposals to U.S. funding agencies, as they existed at the end of 2024. It is intended as a bulwark against retroactive false narratives, as the U.S. moves into a period of intense antagonism towards diversity, equity, and inclusion activities.

2025-05-21T02:13:56Z Comment on https://dx.doi.org/10.3389/frma.2024.1418065 John M. Herbert http://arxiv.org/abs/2505.14328v1 From Metadata to Storytelling: A Framework For 3D Cultural Heritage Visualization on RDF Data 2025-05-20T13:15:09Z

This paper introduces a pipeline for integrating semantic metadata, 3D models, and storytelling, enhancing cultural heritage digitization. Using the Aldrovandi Digital Twin case study, it outlines a reusable workflow combining RDF-driven narratives and data visualization for creating interactive experiences to facilitate access to cultural heritage.

2025-05-20T13:15:09Z Sebastian Barzaghi Simona Colitti Arianna Moretti Giulia Renda http://arxiv.org/abs/2505.14149v1 Enhancing Keyphrase Extraction from Academic Articles Using Section Structure Information 2025-05-20T09:57:34Z

The exponential increase in academic papers has significantly increased the time required for researchers to access relevant literature. Keyphrase Extraction (KPE) offers a solution to this situation by enabling researchers to efficiently retrieve relevant literature. The current study on KPE from academic articles aims to improve the performance of extraction models through innovative approaches using Title and Abstract as input corpora. However, the semantic richness of keywords is significantly constrained by the length of the abstract. While full-text-based KPE can address this issue, it simultaneously introduces noise, which significantly diminishes KPE performance. To address this issue, this paper utilized the structural features and section texts obtained from the section structure information of academic articles to extract keyphrase from academic papers. The approach consists of two main parts: (1) exploring the effect of seven structural features on KPE models, and (2) integrating the extraction results from all section texts used as input corpora for KPE models via a keyphrase integration algorithm to obtain the keyphrase integration result. Furthermore, this paper also examined the effect of the classification quality of section structure on the KPE performance. The results show that incorporating structural features improves KPE performance, though different features have varying effects on model efficacy. The keyphrase integration approach yields the best performance, and the classification quality of section structure can affect KPE performance. These findings indicate that using the section structure information of academic articles contributes to effective KPE from academic articles. The code and dataset supporting this study are available at https://github.com/yan-xinyi/SSB_KPE.

2025-05-20T09:57:34Z Scientometrics, 2025 Chengzhi Zhang Xinyi Yan Lei Zhao Yingyi Zhang 10.1007/s11192-025-05286-2