https://arxiv.org/api/3BS/e1xN9jMMPmS/sGm8BkdNViY 2026-06-14T09:17:24Z 6065 645 15 http://arxiv.org/abs/2508.14595v1 Mathematical proof concerning the additivity problem of nonlinear normalized citation counts 2025-08-20T10:24:56Z

The issue of whether nonlinear normalized citation counts can be added is critically important in scientometrics because it touches upon the theoretical foundation of underlying computation in the field. In this paper, we provide rigorous mathematical proofs for the key theorems underlying this fundamental issue. Based on these proofs, we ultimately arrive at the following conclusion: a nonlinear normalization method for citation counts must be a non-equidistant transformation; consequently, the resulting nonlinear normalized citation counts are no longer equidistant and therefore cannot be added. Furthermore, because our mathematical proofs are established over the real number domain, we also derive a more general conclusion that is applicable to data transformations over the real number domain across various scientific fields: a nonlinear transformation becomes a non-equidistant transformation only if it satisfies a certain regularity condition, for example, if this nonlinear transformation is a continuous function, monotonic over a certain interval, or its domain is restricted to rational numbers. In such cases, the resulting nonlinear data are no longer equidistant and therefore cannot be added. This general conclusion can be broadly applied to various linear and nonlinear transformation problems, which offers significant insights for addressing the misuse of nonlinear data.

2025-08-20T10:24:56Z Xing Wang Zhihui Zhang http://arxiv.org/abs/2508.14139v1 The Statistical Validation of Innovation Lens 2025-08-19T13:47:24Z

Information overload and the rapid pace of scientific advancement make it increasingly difficult to evaluate and allocate resources to new research proposals. Is there a structure to scientific discovery that could inform such decisions? We present statistical evidence for such structure, by training a classifier that successfully predicts high-citation research papers between 2010-2024 in the Computer Science, Physics, and PubMed domains.

2025-08-19T13:47:24Z 7 pages, 6 figures Giacomo Radaelli Jonah Lynch http://arxiv.org/abs/2508.18146v1 Red alert: Millions of "homeless" publications in Scopus should be resettled 2025-08-18T15:10:30Z

Scopus is increasingly regarded as a high-quality and reliable data source for research and evaluation of scientific and scholarly activity. However, a puzzling phenomenon has been discovered occasionally: millions of records with author affiliation information collected in Scopus are oddly labeled as "country-undefined" by Scopus which is rarely to be detected in its counterpart Web of Science. This huge number of "homeless" records in Scopus is unacceptable for a widely used high-quality bibliographic database. By using data from the past 124 years, this brief communication tries to probe these affiliated but country-undefined records in Scopus. Our analysis identifies four primary causes for these "homeless" records: incomplete author affiliation addresses, Scopus' inability to recognize different variants of country/territory names, misspelled country/territory names in author affiliation addresses, and Scopus' insufficiency in correctly split and identify the clean affiliation addresses. To address this pressing issue, we put forward several recommendations to relevant stakeholders, with the aim of resettling millions of "homeless" records in Scopus and reducing its potential impact on Scopus-based literature retrieval, analysis, and evaluation.

2025-08-18T15:10:30Z J Assoc Inf Sci Technol, 2025 Weishu Liu Haifeng Wang 10.1002/asi.25011 http://arxiv.org/abs/2509.09687v1 Demonstrating Narrative Pattern Discovery from Biomedical Literature 2025-08-18T10:07:14Z

Digital libraries maintain extensive collections of knowledge and need to provide effective access paths for their users. For instance, PubPharm, the specialized information service for Pharmacy in Germany, provides and develops access paths to their underlying biomedical document collection. In brief, PubPharm supports traditional keyword-based search, search for chemical structures, as well as novel graph-based discovery workflows, e.g., listing or searching for interactions between different pharmaceutical entities. This paper introduces a new search functionality, called narrative pattern mining, allowing users to explore context-relevant entities and entity interactions. We performed interviews with five domain experts to verify the usefulness of our prototype.

2025-08-18T10:07:14Z Accepted Demo at TPDL2025, 10 pages, 3 figures Hermann Kroll Pascal Sackhoff Bill Matthias Thang Christin Katharina Kreutz Wolf-Tilo Balke http://arxiv.org/abs/2504.08171v2 An open framework for archival, reproducible, and transparent science 2025-08-16T18:10:06Z

Digital computational outputs are now ubiquitous in the research workflow and the way in which these data are stored and cataloged is becoming more standardized across fields of research. However, even with accessible data and code, the barrier to recreating figures and reproducing scientific findings remains high. What is generally missing is the computational environment and associated pipelines in which the data and code are executed to generate figures. The archival, reproducible, and transparent science (ARTS) open framework incorporates containers, version control systems, and persistent archives through which all data, code, and figures related to a research project can be stored together, easily recreated, and serve as an accessible platform for long-term sharing and validation. If the underlying principles behind this framework are broadly adopted, it will improve the reproducibility and transparency of research.

2025-04-10T23:45:11Z 15 pages, 6 figures Sabar Dasgupta Paul Nuyujukian http://arxiv.org/abs/1909.08191v3 Exploring Scholarly Data by Semantic Query on Knowledge Graph Embedding Space 2025-08-15T22:37:16Z

The trends of open science have enabled several open scholarly datasets which include millions of papers and authors. Managing, exploring, and utilizing such large and complicated datasets effectively are challenging. In recent years, the knowledge graph has emerged as a universal data format for representing knowledge about heterogeneous entities and their relationships. The knowledge graph can be modeled by knowledge graph embedding methods, which represent entities and relations as embedding vectors in semantic space, then model the interactions between these embedding vectors. However, the semantic structures in the knowledge graph embedding space are not well-studied, thus knowledge graph embedding methods are usually only used for knowledge graph completion but not data representation and analysis. In this paper, we propose to analyze these semantic structures based on the well-studied word embedding space and use them to support data exploration. We also define the semantic queries, which are algebraic operations between the embedding vectors in the knowledge graph embedding space, to solve queries such as similarity and analogy between the entities on the original datasets. We then design a general framework for data exploration by semantic queries and discuss the solution to some traditional scholarly data exploration tasks. We also propose some new interesting tasks that can be solved based on the uncanny semantic structures of the embedding space.

2019-09-17T04:32:00Z TPDL 2019; remove details from the appendix for official dataset publication later Hung Nghiep Tran Atsuhiro Takasu http://arxiv.org/abs/2508.11499v1 Handwritten Text Recognition of Historical Manuscripts Using Transformer-Based Models 2025-08-15T14:20:58Z

Historical handwritten text recognition (HTR) is essential for unlocking the cultural and scholarly value of archival documents, yet digitization is often hindered by scarce transcriptions, linguistic variation, and highly diverse handwriting styles. In this study, we apply TrOCR, a state-of-the-art transformer-based HTR model, to 16th-century Latin manuscripts authored by Rudolf Gwalther. We investigate targeted image preprocessing and a broad suite of data augmentation techniques, introducing four novel augmentation methods designed specifically for historical handwriting characteristics. We also evaluate ensemble learning approaches to leverage the complementary strengths of augmentation-trained models. On the Gwalther dataset, our best single-model augmentation (Elastic) achieves a Character Error Rate (CER) of 1.86, while a top-5 voting ensemble achieves a CER of 1.60 - representing a 50% relative improvement over the best reported TrOCR_BASE result and a 42% improvement over the previous state of the art. These results highlight the impact of domain-specific augmentations and ensemble strategies in advancing HTR performance for historical manuscripts.

2025-08-15T14:20:58Z Erez Meoded http://arxiv.org/abs/2412.06606v2 Vulnerability of Text-Matching in ML/AI Conference Reviewer Assignments to Collusions 2025-08-15T02:10:25Z

In the peer review process of top-tier machine learning (ML) and artificial intelligence (AI) conferences, reviewers are assigned to papers through automated methods. These assignment algorithms consider two main factors: (1) reviewers' expressed interests indicated by their bids for papers, and (2) reviewers' domain expertise inferred from the similarity between the text of their previously published papers and the submitted manuscripts. A significant challenge these conferences face is the existence of collusion rings, where groups of researchers manipulate the assignment process to review each other's papers, providing positive evaluations regardless of their actual quality. Most efforts to combat collusion rings have focused on preventing bid manipulation, under the assumption that the text similarity component is secure. In this paper, we demonstrate that even in the absence of bidding, colluding reviewers and authors can exploit the machine learning based text-matching component of reviewer assignment used at top ML/AI venues to get assigned their target paper. We also highlight specific vulnerabilities within this system and offer suggestions to enhance its robustness.

2024-12-09T15:55:20Z Accepted to 34th USENIX Security Symposium (USENIX Security 25) Jhih-Yi Hsieh Aditi Raghunathan Nihar B. Shah http://arxiv.org/abs/2508.10467v1 FIRESPARQL: A LLM-based Framework for SPARQL Query Generation over Scholarly Knowledge Graphs 2025-08-14T09:08:50Z

Question answering over Scholarly Knowledge Graphs (SKGs) remains a challenging task due to the complexity of scholarly content and the intricate structure of these graphs. Large Language Model (LLM) approaches could be used to translate natural language questions (NLQs) into SPARQL queries; however, these LLM-based approaches struggle with SPARQL query generation due to limited exposure to SKG-specific content and the underlying schema. We identified two main types of errors in the LLM-generated SPARQL queries: (i) structural inconsistencies, such as missing or redundant triples in the queries, and (ii) semantic inaccuracies, where incorrect entities or properties are shown in the queries despite a correct query structure. To address these issues, we propose FIRESPARQL, a modular framework that supports fine-tuned LLMs as a core component, with optional context provided via retrieval-augmented generation (RAG) and a SPARQL query correction layer. We evaluate the framework on the SciQA Benchmark using various configurations (zero-shot, zero-shot with RAG, one-shot, fine-tuning, and fine-tuning with RAG) and compare the performance with baseline and state-of-the-art approaches. We measure query accuracy using BLEU and ROUGE metrics, and query result accuracy using relaxed exact match(RelaxedEM), with respect to the gold standards containing the NLQs, SPARQL queries, and the results of the queries. Experimental results demonstrate that fine-tuning achieves the highest overall performance, reaching 0.90 ROUGE-L for query accuracy and 0.85 RelaxedEM for result accuracy on the test set.

2025-08-14T09:08:50Z Accepted at 17th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K) Xueli Pan Victor de Boer Jacco van Ossenbruggen http://arxiv.org/abs/2508.09936v1 Quo Vadis Handwritten Text Generation for Handwritten Text Recognition? 2025-08-13T16:39:18Z

The digitization of historical manuscripts presents significant challenges for Handwritten Text Recognition (HTR) systems, particularly when dealing with small, author-specific collections that diverge from the training data distributions. Handwritten Text Generation (HTG) techniques, which generate synthetic data tailored to specific handwriting styles, offer a promising solution to address these challenges. However, the effectiveness of various HTG models in enhancing HTR performance, especially in low-resource transcription settings, has not been thoroughly evaluated. In this work, we systematically compare three state-of-the-art styled HTG models (representing the generative adversarial, diffusion, and autoregressive paradigms for HTG) to assess their impact on HTR fine-tuning. We analyze how visual and linguistic characteristics of synthetic data influence fine-tuning outcomes and provide quantitative guidelines for selecting the most effective HTG model. The results of our analysis provide insights into the current capabilities of HTG methods and highlight key areas for further improvement in their application to low-resource HTR.

2025-08-13T16:39:18Z Accepted at ICCV Workshop VisionDocs Vittorio Pippi Konstantina Nikolaidou Silvia Cascianelli George Retsinas Giorgos Sfikas Rita Cucchiara Marcus Liwicki http://arxiv.org/abs/2508.09535v1 AI Blob! LLM-Driven Recontextualization of Italian Television Archives 2025-08-13T06:38:32Z

This paper introduces AI Blob!, an experimental system designed to explore the potential of semantic cataloging and Large Language Models (LLMs) for the retrieval and recontextualization of archival television footage. Drawing methodological inspiration from Italian television programs such as Blob (RAI Tre, 1989-), AI Blob! integrates automatic speech recognition (ASR), semantic embeddings, and retrieval-augmented generation (RAG) to organize and reinterpret archival content. The system processes a curated dataset of 1,547 Italian television videos by transcribing audio, segmenting it into sentence-level units, and embedding these segments into a vector database for semantic querying. Upon user input of a thematic prompt, the LLM generates a range of linguistically and conceptually related queries, guiding the retrieval and recombination of audiovisual fragments. These fragments are algorithmically selected and structured into narrative sequences producing montages that emulate editorial practices of ironic juxtaposition and thematic coherence. By foregrounding dynamic, content-aware retrieval over static metadata schemas, AI Blob! demonstrates how semantic technologies can facilitate new approaches to archival engagement, enabling novel forms of automated narrative construction and cultural analysis. The project contributes to ongoing debates in media historiography and AI-driven archival research, offering both a conceptual framework and a publicly available dataset to support further interdisciplinary experimentation.

2025-08-13T06:38:32Z Preprint 16th Media Mutations International Conference (pp. 123-133) 2026 Roberto Balestri 10.66062/PHBQ6517 http://arxiv.org/abs/2507.18840v2 How is science discussed on Bluesky? 2025-08-11T15:20:17Z

Amid the migration of academics from X, the social media platform Bluesky has emerged as a potential alternative. To assess its viability and relevance for science communication, this study presents the first large-scale analysis of scholarly article dissemination on Bluesky, exploring its potential as a new source of social media metrics. We collected and analysed over 2.6 million Bluesky posts referencing 532,302 scholarly articles from January 2023 to July 2025, integrating metadata from the OpenAlex database. Temporal trends, disciplinary coverage, language use, textual characteristics, and user engagement were examined. A sharp increase in scholarly activity on Bluesky was observed from November 2024 to January 2025, coinciding with broader academic shifts away from X. As on X, Bluesky posts primarily concern the health, social, and environmental sciences and are predominantly written in English. Nevertheless, Bluesky posts demonstrate substantially higher levels of interaction (likes, reposts, replies, and quotes) and greater textual originality than previously reported for X, suggesting both stronger interactive and more interpretive engagement. These findings highlight Bluesky's emerging role as a credible platform for science communication and a promising source for altmetrics. The platform may facilitate not only early visibility of research outputs but also more meaningful scholarly dialogue in the evolving social media landscape.

2025-07-24T22:31:28Z 33 pages, 9 figures Er-Te Zheng Xiaorui Jiang Zhichao Fang Mike Thelwall http://arxiv.org/abs/2508.08347v1 Exploring the Technical Knowledge Interaction of Global Digital Humanities: Three-decade Evidence from Bibliometric-based perspectives 2025-08-11T12:27:39Z

Digital Humanities (DH) is an interdisciplinary field that integrates computational methods with humanities scholarship to investigate innovative topics. Each academic discipline follows a unique developmental path shaped by the topics researchers investigate and the methods they employ. With the help of bibliometric analysis, most of previous studies have examined DH across multiple dimensions such as research hotspots, co-author networks, and institutional rankings. However, these studies have often been limited in their ability to provide deep insights into the current state of technological advancements and topic development in DH. As a result, their conclusions tend to remain superficial or lack interpretability in understanding how methods and topics interrelate in the field. To address this gap, this study introduced a new concept of Topic-Method Composition (TMC), which refers to a hybrid knowledge structure generated by the co-occurrence of specific research topics and the corresponding method. Especially by analyzing the interaction between TMCs, we can see more clearly the intersection and integration of digital technology and humanistic subjects in DH. Moreover, this study developed a TMC-based workflow combining bibliometric analysis, topic modeling, and network analysis to analyze the development characteristics and patterns of research disciplines. By applying this workflow to large-scale bibliometric data, it enables a detailed view of the knowledge structures, providing a tool adaptable to other fields.

2025-08-11T12:27:39Z Proceedings of 2025 Digital Humanities Conference Jiayi Li Chengxi Yan Yurong Zeng Zhichao Fang Huiru Wang http://arxiv.org/abs/2508.07196v1 Can Smaller Large Language Models Evaluate Research Quality? 2025-08-10T06:18:40Z

Although both Google Gemini (1.5 Flash) and ChatGPT (4o and 4o-mini) give research quality evaluation scores that correlate positively with expert scores in nearly all fields, and more strongly that citations in most, it is not known whether this is true for smaller Large Language Models (LLMs). In response, this article assesses Google's Gemma-3-27b-it, a downloadable LLM (60Gb). The results for 104,187 articles show that Gemma-3-27b-it scores correlate positively with an expert research quality score proxy for all 34 Units of Assessment (broad fields) from the UK Research Excellence Framework 2021. The Gemma-3-27b-it correlations have 83.8% of the strength of ChatGPT 4o and 94.7% of the strength of ChatGPT 4o-mini correlations. Differently from the two larger LLMs, the Gemma-3-27b-it correlations do not increase substantially when the scores are averaged across five repetitions, its scores tend to be lower, and its reports are relatively uniform in style. Overall, the results show that research quality score estimation can be conducted by offline LLMs, so this capability is not an emergent property of the largest LLMs. Moreover, score improvement through repetition is not a universal feature of LLMs. In conclusion, although the largest LLMs still have the highest research evaluation score estimation capability, smaller ones can also be used for this task, and this can be helpful for cost saving or when secure offline processing is needed.

2025-08-10T06:18:40Z Mike Thelwall http://arxiv.org/abs/2501.05821v2 Analysing the coverage of the University of Bologna's bibliographic and citation metadata in OpenCitations collections 2025-08-08T08:41:01Z

This study focuses on analysing the coverage of publications' metadata available in the Current Research Information System (CRIS) infrastructure of the University of Bologna (UNIBO), implemented by the IRIS platform, within an authoritative source of open research information, i.e. OpenCitations. The analysis considers data regarding the publication entities alongside the citation links. We precisely quantify the proportion of UNIBO IRIS publications included in OpenCitations, examine their types, and evaluate the number of citations in OpenCitations that involve IRIS publications. Our methodology filters and transforms data dumps of IRIS and OpenCitations, creating novel datasets used for the analysis. Our findings reveal that only 36% of IRIS is covered in OpenCitations, with journal articles exhibiting the highest coverage. We identified 5,129,406 citation links pointing to UNIBO IRIS publications. From a purely quantitative perspective, comparing our results with broader proprietary services like Scopus and Web of Science reveals a comparable quantitative coverage in the number of IRIS bibliographic resources included in all the systems analysed (OpenCitations, Scopus and Web of Science) as well as in the number of citations received by them.

2025-01-10T10:00:21Z Erica Andreose Salvatore Di Marzo Ivan Heibi Silvio Peroni Leonardo Zilli