Exploring Citation Diversity in Scholarly Literature: An Entropy-Based Approach

2025-04-02T13:43:53Z

This study explores the citation diversity in scholarly literature, analyzing different patterns of citations observed within different countries and academic disciplines. We examine citation distributions across top institutions within certain countries and find that the higher end of the distribution follows a Power Law or Pareto Law pattern; the scaling exponent of the Pareto Law varies depending on the number of top institutions included in the analysis. By adopting a novel entropy-based diversity measure, our findings reveal that countries with both small and large economies tend to cluster similarly in terms of citation diversity. The composition of countries within each group changes as the number of top institutions considered in the analysis varies. Moreover, we analyze citation diversity among award-winning scientists across six scientific disciplines, finding significant variations. We also explore the evolution of citation diversity over the past century across multiple fields. A gender-based study in several disciplines confirms varying citation diversities among male and female scientists. Our innovative citation diversity measure stands out as a valuable tool for assessing the unevenness of citation distributions, providing deeper insights that go beyond what traditional citation counts alone can reveal. This comprehensive analysis enhances our understanding of global scientific contributions and fosters a more equitable view of academic achievements.

Semantic Information Management in Low-Temperature Plasma Science and Technology with VIVO

2025-04-02T08:21:51Z

Digital research data management is increasingly integrated across universities and research institutions, addressing the handling of research data throughout its lifecycle according to the FAIR data principles (Findable, Accessible, Interoperable, Reusable). Recent emphasis on the semantic and interlinking aspects of research data, e.g., by using ontologies and knowledge graphs further enhances findability and reusability. This work presents a framework for creating and maintaining a knowledge graph specifically for low-temperature plasma (LTP) science and technology. The framework leverages a domain-specific ontology called Plasma-O, along with the VIVO software as a platform for semantic information management in LTP research. While some research fields are already prepared to use ontologies and knowledge graphs for information management, their application in LTP research is nascent. This work aims to bridge this gap by providing a framework that not only improves research data management but also fosters community participation in building the domain-specific ontology and knowledge graph based on the published materials. The results may also support other research fields in the practical use of knowledge graphs for semantic information management.

LLMs4SchemaDiscovery: A Human-in-the-Loop Workflow for Scientific Schema Mining with Large Language Models

2025-04-01T13:03:33Z

Extracting structured information from unstructured text is crucial for modeling real-world processes, but traditional schema mining relies on semi-structured data, limiting scalability. This paper introduces schema-miner, a novel tool that combines large language models with human feedback to automate and refine schema extraction. Through an iterative workflow, it organizes properties from text, incorporates expert input, and integrates domain-specific ontologies for semantic depth. Applied to materials science--specifically atomic layer deposition--schema-miner demonstrates that expert-guided LLMs generate semantically rich schemas suitable for diverse real-world applications.

Multimodal LLMs for OCR, OCR Post-Correction, and Named Entity Recognition in Historical Documents

2025-04-01T04:21:34Z

We explore how multimodal Large Language Models (mLLMs) can help researchers transcribe historical documents, extract relevant historical information, and construct datasets from historical sources. Specifically, we investigate the capabilities of mLLMs in performing (1) Optical Character Recognition (OCR), (2) OCR Post-Correction, and (3) Named Entity Recognition (NER) tasks on a set of city directories published in German between 1754 and 1870. First, we benchmark the off-the-shelf transcription accuracy of both mLLMs and conventional OCR models. We find that the best-performing mLLM model significantly outperforms conventional state-of-the-art OCR models and other frontier mLLMs. Second, we are the first to introduce multimodal post-correction of OCR output using mLLMs. We find that this novel approach leads to a drastic improvement in transcription accuracy and consistently produces highly accurate transcriptions (<1% CER), without any image pre-processing or model fine-tuning. Third, we demonstrate that mLLMs can efficiently recognize entities in transcriptions of historical documents and parse them into structured dataset formats. Our findings provide early evidence for the long-term potential of mLLMs to introduce a paradigm shift in the approaches to historical data collection and document transcription.

Libraries, Digital Libraries, and Data: Forty years, Four Challenges

2025-04-01T03:41:10Z

"Digital libraries" is an umbrella term that encompasses the automation of library services, online catalogs, information retrieval systems, multi-media databases, data archives, and other internet-facing collections of digital resources. Clifford Lynch has played pivotal roles in the technical development, institutionalization, policy, practice, and dissemination of digital libraries for more than 40 years. Beginning with his foundational role in building MELVYL for the University of California in the early 1980s -- the first internet-native online open access library catalog -- through his convening roles in open access and open data in the 21st century, his career is marked by multiple milestones of innovation. Clifford Lynch's career has traced the trajectory of digital libraries and knowledge infrastructures. Over the course of these 40 years, research libraries have faced four categories of challenges: invisible infrastructure, content and collections, preservation and access, and institutional boundaries. These challenges have become yet more complex in an era of open access, open data, and evolving regimes of intellectual property and scholarly publishing. As the digital library communities have merged and diverged over this time span, we collectively face challenges for at least the next 40 years to sustain access to current resources while growing the next generations of digital libraries and librarians.

From Content Creation to Citation Inflation: A GenAI Case Study

2025-03-30T12:17:26Z

This paper investigates the presence and impact of questionable, AI-generated academic papers on widely used preprint repositories, with a focus on their role in citation manipulation. Motivated by suspicious patterns observed in publications related to our ongoing research on GenAI-enhanced cybersecurity, we identify clusters of questionable papers and profiles. These papers frequently exhibit minimal technical content, repetitive structure, unverifiable authorship, and mutually reinforcing citation patterns among a recurring set of authors. To assess the feasibility and implications of such practices, we conduct a controlled experiment: generating a fake paper using GenAI, embedding citations to suspected questionable publications, and uploading it to one such repository (ResearchGate). Our findings demonstrate that such papers can bypass platform checks, remain publicly accessible, and contribute to inflating citation metrics like the H-index and i10-index. We present a detailed analysis of the mechanisms involved, highlight systemic weaknesses in content moderation, and offer recommendations for improving platform accountability and preserving academic integrity in the age of GenAI.

Is Journal Citation Indicator a good metric for Art & Humanities Journals currently?

2025-03-30T11:17:39Z

Probably Not. Journal Citation Indicator (JCI) was introduced to address the limitations of traditional metrics like the Journal Impact Factor (JIF), particularly its inability to normalize citation impact across different disciplines. This study reveals that JCI faces significant challenges in field normalization for Art & Humanities journals, as evidenced by much lower correlations with a more granular, paper-level metric, CNCI-CT. A detailed analysis of Architecture journals highlights how journal-level misclassification and the interdisciplinary nature of content exacerbate these issues, leading to less reliable evaluations. We recommend improving journal classification systems or adopting paper-level normalization methods, potentially supported by advanced AI techniques, to enhance the accuracy and effectiveness of JCI for Art & Humanities disciplines.

Mapping the changing structure of science through diachronic periodical embeddings

2025-03-30T02:44:58Z

Understanding the changing structure of science over time is essential to elucidating how science evolves. We develop diachronic embeddings of scholarly periodicals to quantify "semantic changes" of periodicals across decades, allowing us to track the evolution of research topics and identify rapidly developing fields. By mapping periodicals within a physical-life-health triangle, we reveal an evolving interdisciplinary science landscape, finding an overall trend toward specialization for most periodicals but increasing interdisciplinarity for bioscience periodicals. Analyzing a periodical's trajectory within this triangle over time allows us to visualize how its research focus shifts. Furthermore, by monitoring the formation of local clusters of periodicals, we can identify emerging research topics such as AIDS research and nanotechnology in the 1980s. Our work offers novel quantification in the science of science and provides a quantitative lens to examine the evolution of science, which may facilitate future investigations into the emergence and development of research fields.

Student-Powered Digital Scholarship CoLab Project in the HKUST Library: Develop a Chinese Named-Entity Recognition (NER) Tool within One Semester from the Ground Up

2025-03-29T04:15:34Z

Starting in February 2024, the HKUST Library further extended the scope of AI literacy to AI utilization, which focuses on fostering student involvement in utilizing state-of-the-art technologies in the projects that initiated by the Library, named "Digital Scholarship (DS) CoLab". A key focus of the DS CoLab scheme has been on cultivating talents and enabling students to utilize advanced technologies in practical context. It aims to reinforce the library's role as a catalyst and hub for fostering multidisciplinary collaboration and cultivate the "can do spirit" among university members. The Library offers 1-2 projects per year for students to engage with advanced technologies in practical contexts while supporting the Library in tackling challenges and streamlining operational tasks. The tool that introduced in this paper was mainly developed by two of the authors, Sherry Yip Sau Lai and Berry Han Liuruo, as part-time student helpers under one of our DS CoLab scheme in the 2024 Spring Semester (February to May 2024). This paper details the complete journey from ideation to implementation of developing a Chinese Named-Entity Recognition (NER) Tool from the group up within one semester, from the initial research and planning stages to execution and come up a viable product. The collaborative spirit fostered by this project, with students playing a central role, exemplifies the power and potential of innovative educational models that prioritize hands-on learning with student involvement.

On the Alignment of Post-Publication Reviews & Bibliometric and Altmetric Impact -- A Case Study on Expert Statements from the Science Media Center Germany

2025-03-28T16:41:41Z

In the context of academic publishing and peer review, this study investigates the relationship between post-publication expert evaluations, their agreement levels, and the subsequent scientific and public recognition of the reviewed research. Using expert statements from the Science Media Center Germany as a dataset, we analyze Research in Context reviews to examine the alignment between qualitative post-publication assessments and bibliometric as well as altmetric indicators. We employ a Large Language Model to translate unstructured expert reviews into a structured rating scheme. Furthermore, we correlate these evaluations with citation counts from the Web of Science and alternative impact metrics such as the Altmetric Attention Score, news mentions, and Mendeley readership statistics from the Altmetric Explorer. We investigate the alignment of positive or critical post-publication reviews and high or low citation or altmetric counts.

Historical Ink: Exploring Large Language Models for Irony Detection in 19th-Century Spanish

2025-03-28T16:33:24Z

This study explores the use of large language models (LLMs) to enhance datasets and improve irony detection in 19th-century Latin American newspapers. Two strategies were employed to evaluate the efficacy of BERT and GPT-4o models in capturing the subtle nuances nature of irony, through both multi-class and binary classification tasks. First, we implemented dataset enhancements focused on enriching emotional and contextual cues; however, these showed limited impact on historical language analysis. The second strategy, a semi-automated annotation process, effectively addressed class imbalance and augmented the dataset with high-quality annotations. Despite the challenges posed by the complexity of irony, this work contributes to the advancement of sentiment analysis through two key contributions: introducing a new historical Spanish dataset tagged for sentiment analysis and irony detection, and proposing a semi-automated annotation methodology where human expertise is crucial for refining LLMs results, enriched by incorporating historical and cultural contexts as core features.

Tackling paper mills requires us to prevent future contamination and clean up the past -- the case of the journal Bioengineered

2025-03-28T05:59:44Z

Introduction: Taylor & Francis journal Bioengineered has been targeted by paper mills. The goal of this study is to identify problematic articles published in Bioengineered during the period 2010 to 2024. Methods: Dimensions was used to search for articles that contained the terms mouse OR mice OR rat OR rats in title or abstract, published in Bioengineered between January 1st 2010 to December 31st 2024. All articles were assessed by eye and by using software to detect inappropriate image duplication and manipulation. An article was classified as problematic if it contained inappropriate image duplication or manipulation or had been previously retracted. Problematic articles were reported on PubPeer by the authors, if they had not been reported previously. All included articles were assessed for post-publication editorial decisions. Results: We have excluded all articles published in 2024 from further analysis, as these were all retraction notices. We assessed the remaining 878 articles, of which 226 (25.7%) were identified as problematic, of which 35 had been previously retracted. One retracted article was later de-retracted. One article received a correction. None of the included articles received an expression of concern or the Taylor & Francis under investigation pop-up. Conclusions: Taylor & Francis lack of editorial action has left the scientific community vulnerable to reading and citing hundreds of problematic articles published in Bioengineered. To uphold scientific integrity, Taylor & Francis should use the findings of this study as a starting point to systematically identify all compromised articles in Bioengineered and take appropriate editorial action.

Mapping the Digital Diplomatic Infrastructure: A Comparative Evaluation of Global Online Directories for Diplomatic Missions

2025-03-27T16:08:15Z

This study provides a comparative evaluation of global diplomatic mission directories. DiplomaticMonitor.org, EmbassyPages.com, and WikiData.org are strategically selected among the top ten global services. After analyzing nearly all available online global diplomatic directory services, these three platforms are selected as they represent fundamentally different approaches to creating worldwide diplomatic mission databases. Using official diplomatic lists from over 150 countries as benchmarks, we assessed data coverage, accuracy, and update frequency across these platforms. DiplomaticMonitor consistently outperforms its counterparts in structure, completeness, and timeliness, accurately reflecting ambassadorial appointment cycles and maintaining high precision across contact and personnel records. EmbassyPages, despite strong search engine visibility and widespread usage, exhibits significant data currency issues, with markedly diminished ambassadorial accuracy attributable to delayed refresh cycles. WikiData offers valuable historical documentation and open-source accessibility but lacks the consistency and verification protocols necessary for reliable real-time diplomatic information. Our findings highlight the critical challenge posed by the absence of a standardized global diplomatic mission registry. In this fragmented landscape, methodologically rigorous third-party platforms can occasionally surpass government-published records in quality and utility. The research demonstrates that in contemporary digital diplomacy, data reliability correlates less with institutional provenance than with disciplined, transparent, and consistent data stewardship practices.

A Quantitative Approach to Evaluating Open-Source EHR Systems for Indian Healthcare

2025-03-27T14:55:03Z

The increasing use of Electronic Health Records (EHR) has emphasized the need for standardization and interoperability in healthcare data management. The Ministry of Health and Family Welfare, Government of India, has introduced the Electronic Health Record Minimum Data Set (EHRMDS) to facilitate uniformity in clinical documentation. However, the compatibility of Open-Source Electronic Health Record Systems (OS-EHRS) with EHRMDS remains largely unexplored. This study conducts a systematic assessment of the alignment between EHRMDS and commonly utilized OS-EHRS to determine the most appropriate system for healthcare environments in India. A quantitative closeness analysis was performed by comparing the metadata elements of EHRMDS with those of 10 selected OS-EHRS. Using crosswalk methodologies based on syntactic and semantic similarity, the study measured the extent of metadata alignment. Results indicate that OpenEMR exhibits the highest compatibility with EHRMDS, covering 73.81% of its metadata elements, while OpenClinic shows the least alignment at 33.33%. Additionally, the analysis identified 47 metadata elements present in OS-EHRS but absent in EHRMDS, suggesting the need for an extended metadata schema. By bridging gaps in clinical metadata, this study contributes to enhancing the interoperability of EHR systems in India. The findings provide valuable insights for healthcare policymakers and organizations seeking to adopt OS-EHRS aligned with national standards. Keywords. EHR metadata, electronic health record systems, EHRMDS, meta data, structured vocabularies, metadata crosswalk, methodologies and tools, SNOMED-CT, UMLS terms.

Resilience and Volatility in Academic Publishing, The Case of the University of Maribor 2004-2023

2025-03-27T12:10:22Z

This article investigates the dynamics of academic publishing resilience and volatility at Slovenia's University of Maribor (UM) from 2004 to 2023. This period was marked by significant economic pressures and policy shifts, including changes to higher education legislation and university funding. Using UM's employment data and OpenAlex publication records, the study examines the relationship between employed researcher numbers and unique authors publishing under the UM affiliation. Despite a substantial decrease in researcher employment during the 2009-2013 economic recession and austerity phase, the number of unique authors publishing with UM affiliation surprisingly increased. This growth was driven by factors such as a shift towards project-based funding, contributions from an expanding doctoral student cohort, and increased international collaborations. Analysis of author turnover reveals a notable contrast: high short-term volatility (annual churn rates of ~40-50%) versus significant mid-term stability (5-year churn rates of ~8-10%). Survival analysis confirms this trend, showing high initial attrition among publishing authors but long-term persistence for a core group. Furthermore, co-authorship network analysis indicates the UM research network has become more resilient over time. A critical finding is a fundamental shift in network structure around 2016, transitioning from dissassortative to assortative mixing, signaling profound changes in collaboration dynamics. The findings carry implications for research policy and university management, highlighting the necessity of balancing short-term performance indicators with the long-term stability and resilience essential for a thriving research community.