https://arxiv.org/api/4Xz/+05eTC+vtX578rbqIf6C+fg 2026-06-14T18:29:10Z 6065 780 15 http://arxiv.org/abs/2506.03321v1 Enhancing Automatic PT Tagging for MEDLINE Citations Using Transformer-Based Models 2025-06-03T19:06:51Z

We investigated the feasibility of predicting Medical Subject Headings (MeSH) Publication Types (PTs) from MEDLINE citation metadata using pre-trained Transformer-based models BERT and DistilBERT. This study addresses limitations in the current automated indexing process, which relies on legacy NLP algorithms. We evaluated monolithic multi-label classifiers and binary classifier ensembles to enhance the retrieval of biomedical literature. Results demonstrate the potential of Transformer models to significantly improve PT tagging accuracy, paving the way for scalable, efficient biomedical indexing.

2025-06-03T19:06:51Z 26 pages, 8 tables, 3 figures Victor H. Cid James Mork http://arxiv.org/abs/2502.19030v2 Sampling nodes and hyperedges via random walks on large hypergraphs 2025-06-03T15:46:14Z

Hypergraphs provide a fundamental framework for representing complex systems involving interactions among three or more entities. As empirical hypergraphs grow in size, characterizing their structural properties becomes increasingly challenging due to computational complexity and, in some cases, restricted access to complete data, requiring efficient sampling methods. Random walks offer a practical approach to hypergraph sampling, as they rely solely on local neighborhood information from nodes and hyperedges. In this study, we investigate methods for simultaneously sampling nodes and hyperedges via random walks on large hypergraphs. First, we compare three existing random walks in the context of hypergraph sampling and identify an advantage of the so-called higher-order random walk. Second, by extending an established technique for graphs to the case of hypergraphs, we present a non-backtracking variant of the higher-order random walk. We derive theoretical results on estimators based on the non-backtracking higher-order random walk and validate them through numerical simulations on large empirical hypergraphs. Third, we apply the non-backtracking higher-order random walk to a large hypergraph of co-authorships indexed in the OpenAlex database, where full access to the data is not readily available. Despite the relatively small sample size, our estimates largely align with previous findings on author productivity, team size, and the prevalence of open-access publications. Our findings contribute to the development of analysis methods for large hypergraphs, offering insights into sampling strategies and estimation techniques applicable to real-world complex systems.

2025-02-26T10:36:38Z Applied Network Science Vol. 10, Article number: 19 (2025) Kazuki Nakajima Masanao Kodakari Masaki Aida 10.1007/s41109-025-00704-z http://arxiv.org/abs/2501.10727v2 In the Picture: Medical Imaging Datasets, Artifacts, and their Living Review 2025-06-02T12:18:57Z

Datasets play a critical role in medical imaging research, yet issues such as label quality, shortcuts, and metadata are often overlooked. This lack of attention may harm the generalizability of algorithms and, consequently, negatively impact patient outcomes. While existing medical imaging literature reviews mostly focus on machine learning (ML) methods, with only a few focusing on datasets for specific applications, these reviews remain static -- they are published once and not updated thereafter. This fails to account for emerging evidence, such as biases, shortcuts, and additional annotations that other researchers may contribute after the dataset is published. We refer to these newly discovered findings of datasets as research artifacts. To address this gap, we propose a living review that continuously tracks public datasets and their associated research artifacts across multiple medical imaging applications. Our approach includes a framework for the living review to monitor data documentation artifacts, and an SQL database to visualize the citation relationships between research artifact and dataset. Lastly, we discuss key considerations for creating medical imaging datasets, review best practices for data annotation, discuss the significance of shortcuts and demographic diversity, and emphasize the importance of managing datasets throughout their entire lifecycle. Our demo is publicly available at http://inthepicture.itu.dk/.

2025-01-18T11:03:59Z ACM Conference on Fairness, Accountability, and Transparency - FAccT 2025 Amelia Jiménez-Sánchez Natalia-Rozalia Avlona Sarah de Boer Víctor M. Campello Aasa Feragen Enzo Ferrante Melanie Ganz Judy Wawira Gichoya Camila González Steff Groefsema Alessa Hering Adam Hulman Leo Joskowicz Dovile Juodelyte Melih Kandemir Thijs Kooi Jorge del Pozo Lérida Livie Yumeng Li Andre Pacheco Tim Rädsch Mauricio Reyes Théo Sourget Bram van Ginneken David Wen Nina Weng Jack Junchi Xu Hubert Dariusz Zając Maria A. Zuluaga Veronika Cheplygina 10.1145/3715275.3732035 http://arxiv.org/abs/2410.09510v2 SciEvo: A 2 Million, 30-Year Cross-disciplinary Dataset for Temporal Scientometric Analysis 2025-06-01T02:58:47Z

Understanding the creation, evolution, and dissemination of scientific knowledge is crucial for bridging diverse subject areas and addressing complex global challenges such as pandemics, climate change, and ethical AI. Scientometrics, the quantitative and qualitative study of scientific literature, provides valuable insights into these processes. We introduce SciEvo, a longitudinal scientometric dataset with over two million academic publications, providing comprehensive contents information and citation graphs to support cross-disciplinary analyses. SciEvo is easy to use and available across platforms, including GitHub, Kaggle, and HuggingFace. Using SciEvo, we conduct a temporal study spanning over 30 years to explore key questions in scientometrics: the evolution of academic terminology, citation patterns, and interdisciplinary knowledge exchange. Our findings reveal critical insights, such as disparities in epistemic cultures, knowledge production modes, and citation practices. For example, rapidly developing, application-driven fields like LLMs exhibit significantly shorter citation age (2.48 years) compared to traditional theoretical disciplines like oral history (9.71 years). Our data and analytic tools can be accessed at https://github.com/Ahren09/SciEvo.

2024-10-12T12:16:57Z We have renamed our dataset from "Scito2M" to "SciEvo" Yiqiao Jin Yijia Xiao Yiyang Wang Jindong Wang http://arxiv.org/abs/2303.16750v2 A Gold Standard Dataset for the Reviewer Assignment Problem 2025-05-30T08:34:07Z

Many peer-review venues are using algorithms to assign submissions to reviewers. The crux of such automated approaches is the notion of the "similarity score" -- a numerical estimate of the expertise of a reviewer in reviewing a paper -- and many algorithms have been proposed to compute these scores. However, these algorithms have not been subjected to a principled comparison, making it difficult for stakeholders to choose the algorithm in an evidence-based manner. The key challenge in comparing existing algorithms and developing better algorithms is the lack of publicly available gold-standard data. We address this challenge by collecting a novel dataset of similarity scores that we release to the research community. Our dataset consists of 477 self-reported expertise scores provided by 58 researchers who evaluated their expertise in reviewing papers they have read previously. Using our dataset, we compare several widely used similarity algorithms and offer key insights. First, all algorithms exhibit significant error, with misranking rates between 12%-30% in easier cases and 36%-43% in harder ones. Second, most specialized algorithms are designed to work with titles and abstracts of papers, and in this regime the SPECTER2 algorithm performs best. Interestingly, classical TF-IDF matches SPECTER2 in accuracy when given access to full submission texts. In contrast, off-the-shelf LLMs lag behind specialized approaches.

2023-03-23T16:15:03Z Ivan Stelmakh John Wieting Sarina Xi Graham Neubig Nihar B. Shah http://arxiv.org/abs/2503.09257v5 A Global Dataset Mapping the AI Innovation from Academic Research to Industrial Patents 2025-05-30T02:55:05Z

In the rapidly evolving field of artificial intelligence (AI), mapping innovation patterns and understanding effective technology transfer from research to applications are essential for economic growth. However, existing data infrastructures suffer from fragmentation, incomplete coverage, and insufficient evaluative capacity. Here, we present DeepInnovationAI, a comprehensive global dataset containing three structured files. DeepPatentAI.csv: Contains 2,356,204 patent records with 8 field-specific attributes. DeepDiveAI.csv: Encompasses 3,511,929 academic publications with 13 metadata fields. These two datasets leverage large language models, multilingual text analysis and dual-layer BERT classifiers to accurately identify AI-related content, while utilizing hypergraph analysis to create robust innovation metrics. Additionally, DeepCosineAI.csv: By applying semantic vector proximity analysis, this file contains 3,511,929 most relevant paper-patent pairs, each described by 3 metadata fields, to facilitate the identification of potential knowledge flows. DeepInnovationAI enables researchers, policymakers, and industry leaders to anticipate trends and identify collaboration opportunities. With extensive temporal and geographical scope, it supports detailed analysis of technological development patterns and international competition dynamics, establishing a foundation for modeling AI innovation and technology transfer processes.

2025-03-12T10:56:02Z 38 pages and 4 figures Haixing Gong Hui Zou Xingzhou Liang Shiyuan Meng Pinlong Cai Xingcheng Xu Jingjing Qu http://arxiv.org/abs/2505.24091v1 Temporally Extending Existing Web Archive Collections for Longitudinal Analysis 2025-05-30T00:38:28Z

The Environmental Governance and Data Initiative (EDGI) regularly crawled US federal environmental websites between 2016 and 2020 to capture changes between two presidential administrations. However, because it does not include the previous administration ending in 2008, the collection is unsuitable for answering our research question, Were the website terms deleted by the Trump administration (2017--2021) added by the Obama administration (2009--2017)? Thus, like many researchers using the Wayback Machine's holdings for historical analysis, we do not have access to a complete collection suiting our needs. To answer our research question, we must extend the EDGI collection back to January, 2008. This includes discovering relevant pages that were not included in the EDGI collection that persisted through 2020, not just going further back in time with the existing pages. We pieced together artifacts collected by various organizations for their purposes through many means (Save Page Now, Archive-It, and more) in order to curate a dataset sufficient for our intentions. In this paper, we contribute a methodology to extend existing web archive collections temporally to enable longitudinal analysis, including a dataset extended with this methodology. We use our new dataset to analyze our question, Were the website terms deleted by the Trump administration added by the Obama administration? We find that 81 percent of the pages in the dataset changed between 2008 and 2020, and that 87 percent of the pages with terms deleted by the Trump administration were terms added during the Obama administration.

2025-05-30T00:38:28Z Presented at RESAW 2025; 23 pages, 14 figures, 8 tables Lesley Frew Michael L. Nelson Michele C. Weigle http://arxiv.org/abs/2506.03180v1 Knowledge Graphs for Digitized Manuscripts in Jagiellonian Digital Library Application 2025-05-29T14:49:24Z

Digitizing cultural heritage collections has become crucial for preservation of historical artifacts and enhancing their availability to the wider public. Galleries, libraries, archives and museums (GLAM institutions) are actively digitizing their holdings and creates extensive digital collections. Those collections are often enriched with metadata describing items but not exactly their contents. The Jagiellonian Digital Library, standing as a good example of such an effort, offers datasets accessible through protocols like OAI-PMH. Despite these improvements, metadata completeness and standardization continue to pose substantial obstacles, limiting the searchability and potential connections between collections. To deal with these challenges, we explore an integrated methodology of computer vision (CV), artificial intelligence (AI), and semantic web technologies to enrich metadata and construct knowledge graphs for digitized manuscripts and incunabula.

2025-05-29T14:49:24Z Jan Ignatowicz Krzysztof Kutt Grzegorz J. Nalepa http://arxiv.org/abs/2503.13503v3 SciHorizon: Benchmarking AI-for-Science Readiness from Scientific Data to Large Language Models 2025-05-29T02:56:23Z

In recent years, the rapid advancement of Artificial Intelligence (AI) technologies, particularly Large Language Models (LLMs), has revolutionized the paradigm of scientific discovery, establishing AI-for-Science (AI4Science) as a dynamic and evolving field. However, there is still a lack of an effective framework for the overall assessment of AI4Science, particularly from a holistic perspective on data quality and model capability. Therefore, in this study, we propose SciHorizon, a comprehensive assessment framework designed to benchmark the readiness of AI4Science from both scientific data and LLM perspectives. First, we introduce a generalizable framework for assessing AI-ready scientific data, encompassing four key dimensions: Quality, FAIRness, Explainability, and Compliance-which are subdivided into 15 sub-dimensions. Drawing on data resource papers published between 2018 and 2023 in peer-reviewed journals, we present recommendation lists of AI-ready datasets for Earth, Life, and Materials Sciences, making a novel and original contribution to the field. Concurrently, to assess the capabilities of LLMs across multiple scientific disciplines, we establish 16 assessment dimensions based on five core indicators Knowledge, Understanding, Reasoning, Multimodality, and Values spanning Mathematics, Physics, Chemistry, Life Sciences, and Earth and Space Sciences. Using the developed benchmark datasets, we have conducted a comprehensive evaluation of over 50 representative open-source and closed source LLMs. All the results are publicly available and can be accessed online at www.scihorizon.cn/en.

2025-03-12T11:34:41Z Chuan Qin Xin Chen Chengrui Wang Pengmin Wu Xi Chen Yihang Cheng Jingyi Zhao Meng Xiao Xiangchao Dong Qingqing Long Boya Pan Han Wu Chengzan Li Yuanchun Zhou Hui Xiong Hengshu Zhu http://arxiv.org/abs/2505.22702v1 Are there stars in Bluesky after the return of Donald Trump to the White House? 2025-05-28T17:49:18Z

This study examines the shift in the scientific community from X (formerly Twitter) to Bluesky, its impact on scientific communication, and consequently on social metrics (altmetrics). We analysed 14,497 publications from multidisciplinary and Library and Information Science (LIS) journals between January 2024 and March 2025. The results reveal a notable increase in Bluesky activity for multidisciplinary journals in November 2024, likely influenced by political and platform changes, with mentions multiplying for journals like Nature and Science. In LIS, the adoption of Bluesky is different and shows marked variation between European and United States journals. Although Bluesky remains a minority platform compared to X over the whole period, when focusing on user engagement after the United States elections, we see a much more even distribution between the two platforms. In two LIS journals, Bluesky even surpasses X, while in most others, the difference in user engagement was no longer as pronounced, marking a significant change from previous patterns in altmetrics.

2025-05-28T17:49:18Z arXiv admin note: substantial text overlap with arXiv:2412.05624 Wenceslao Arroyo-Machado Nicolas Robinson-Garcia Daniel Torres-Salinas 10.1016/j.joi.2025.101700 http://arxiv.org/abs/2505.20103v2 SCIRGC: Multi-Granularity Citation Recommendation and Citation Sentence Preference Alignment 2025-05-27T14:05:49Z

Citations are crucial in scientific research articles as they highlight the connection between the current study and prior work. However, this process is often time-consuming for researchers. In this study, we propose the SciRGC framework, which aims to automatically recommend citation articles and generate citation sentences for citation locations within articles. The framework addresses two key challenges in academic citation generation: 1) how to accurately identify the author's citation intent and find relevant citation papers, and 2) how to generate high-quality citation sentences that align with human preferences. We enhance citation recommendation accuracy in the citation article recommendation module by incorporating citation networks and sentiment intent, and generate reasoning-based citation sentences in the citation sentence generation module by using the original article abstract, local context, citation intent, and recommended articles as inputs. Additionally, we propose a new evaluation metric to fairly assess the quality of generated citation sentences. Through comparisons with baseline models and ablation experiments, the SciRGC framework not only improves the accuracy and relevance of citation recommendations but also ensures the appropriateness of the generated citation sentences in context, providing a valuable tool for interdisciplinary researchers.

2025-05-26T15:09:10Z 15 pages, 7 figures Xiangyu Li Jingqiang Chen http://arxiv.org/abs/2505.21162v1 Leveraging GANs for citation intent classification and its impact on citation network analysis 2025-05-27T13:16:09Z

Citations play a fundamental role in the scientific ecosystem, serving as a foundation for tracking the flow of knowledge, acknowledging prior work, and assessing scholarly influence. In scientometrics, they are also central to the construction of quantitative indicators. Not all citations, however, serve the same function: some provide background, others introduce methods, or compare results. Therefore, understanding citation intent allows for a more nuanced interpretation of scientific impact. In this paper, we adopted a GAN-based method to classify citation intents. Our results revealed that the proposed method achieves competitive classification performance, closely matching state-of-the-art results with substantially fewer parameters. This demonstrates the effectiveness and efficiency of leveraging GAN architectures combined with contextual embeddings in intent classification task. We also investigated whether filtering citation intents affects the centrality of papers in citation networks. Analyzing the network constructed from the unArXiv dataset, we found that paper rankings can be significantly influenced by citation intent. All four centrality metrics examined- degree, PageRank, closeness, and betweenness - were sensitive to the filtering of citation types. The betweenness centrality displayed the greatest sensitivity, showing substantial changes in ranking when specific citation intents were removed.

2025-05-27T13:16:09Z Journal of Informetrics 20 (2) 101791, 2026 Davi A. Bezerra Filipi N. Silva Diego R. Amancio 10.1016/j.joi.2026.101791 http://arxiv.org/abs/2504.02769v3 Curbing the Ramifications of Authorship Abuse in Science 2025-05-26T17:50:46Z

Research performance is often measured using bibliometric indicators, such as publication count, total citations, and $h$-index. These metrics influence career advancements, salary adjustments, administrative opportunities, funding prospects, and professional recognition. However, the reliance on these metrics has also made them targets for manipulation, misuse, and abuse. One primary ethical concern is authorship abuse, which includes paid, ornamental, exploitative, cartel, and colonial authorship. These practices are prevalent because they artificially enhance multiple bibliometric indicators all at once. Our study confirms a significant rise in the mean and median number of authors per publication across multiple disciplines over the last 34 years. While it is important to identify the cases of authorship abuse, a thorough investigation of every paper proves impractical. In this study, we propose a credit allocation scheme based on the reciprocals of the Fibonacci numbers, designed to adjust credit for individual contributions while systematically reducing credit for potential authorship abuse. The proposed scheme aligns with rigorous authorship guidelines from scientific associations, which mandate significant contributions across most phases of a study, while accommodating more lenient guidelines from scientific publishers, which recognize authorship for minimal contributions. We recalibrate the traditional bibliometric indicators to emphasize author contribution rather than participation in publications. Additionally, we propose a new indicator, $T^{\prime}$-index, to assess researchers' leading and contributing roles in their publications. Our proposed credit allocation scheme mitigates the effects of authorship abuse and promotes a more ethical scientific ecosystem.

2025-04-03T17:06:25Z Md Somir Khan Mehmet Engin Tozal http://arxiv.org/abs/2506.03165v1 What Does Information Science Offer for Data Science Research?: A Review of Data and Information Ethics Literature 2025-05-26T14:07:42Z

This paper reviews literature pertaining to the development of data science as a discipline, current issues with data bias and ethics, and the role that the discipline of information science may play in addressing these concerns. Information science research and researchers have much to offer for data science, owing to their background as transdisciplinary scholars who apply human-centered and social-behavioral perspectives to issues within natural science disciplines. Information science researchers have already contributed to a humanistic approach to data ethics within the literature and an emphasis on data science within information schools all but ensures that this literature will continue to grow in coming decades. This review article serves as a reference for the history, current progress, and potential future directions of data ethics research within the corpus of information science literature.

2025-05-26T14:07:42Z Journal of Data and Information Science (2022) Brady D. Lund Ting Wang 10.2478/jdis-2022-0018 http://arxiv.org/abs/2502.19679v4 Architectural Vulnerability and Reliability Challenges in AI Text Annotation: A Survey-Inspired Framework with Independent Probability Assessment 2025-05-26T00:23:51Z

Large Language Models, despite their power, have a fundamental architectural vulnerability stemming from their causal transformer design -- order sensitivity. This architectural constraint may distorts classification outcomes when prompt elements like label options are reordered, revealing a theoretical gap between accuracy metrics and true model reliability. The paper conceptualizes this vulnerability through the lens of survey methodology, where respondent biases parallel LLM positional dependencies. Empirical evidence using the F1000 biomedical dataset across three scales of LLaMA3.1 models (8B, 70B, 405B) demonstrates that these architectural constraints produce inconsistent annotations under controlled perturbations. The paper advances a practical solution for social science - Independent Probability Assessment - which decouples label evaluation to circumvent positional bias inherent in sequential processing. This approach yields an information-theoretic reliability measure (R-score) that quantifies annotation robustness at the case level. The findings establish that architectural vulnerabilities in causal transformers require methodological innovations beyond accuracy metrics to ensure valid social science inference, as demonstrated through downstream regression analyses where order-sensitive annotations significantly alter substantive conclusions about scientific impact.

2025-02-27T01:42:10Z 7 figures Linzhuo li