https://arxiv.org/api/4Xz/+05eTC+vtX578rbqIf6C+fg2026-06-14T18:29:10Z606578015http://arxiv.org/abs/2506.03321v1Enhancing Automatic PT Tagging for MEDLINE Citations Using Transformer-Based Models2025-06-03T19:06:51ZWe investigated the feasibility of predicting Medical Subject Headings (MeSH) Publication Types (PTs) from MEDLINE citation metadata using pre-trained Transformer-based models BERT and DistilBERT. This study addresses limitations in the current automated indexing process, which relies on legacy NLP algorithms. We evaluated monolithic multi-label classifiers and binary classifier ensembles to enhance the retrieval of biomedical literature. Results demonstrate the potential of Transformer models to significantly improve PT tagging accuracy, paving the way for scalable, efficient biomedical indexing.2025-06-03T19:06:51Z26 pages, 8 tables, 3 figuresVictor H. CidJames Morkhttp://arxiv.org/abs/2502.19030v2Sampling nodes and hyperedges via random walks on large hypergraphs2025-06-03T15:46:14ZHypergraphs provide a fundamental framework for representing complex systems involving interactions among three or more entities. As empirical hypergraphs grow in size, characterizing their structural properties becomes increasingly challenging due to computational complexity and, in some cases, restricted access to complete data, requiring efficient sampling methods. Random walks offer a practical approach to hypergraph sampling, as they rely solely on local neighborhood information from nodes and hyperedges. In this study, we investigate methods for simultaneously sampling nodes and hyperedges via random walks on large hypergraphs. First, we compare three existing random walks in the context of hypergraph sampling and identify an advantage of the so-called higher-order random walk. Second, by extending an established technique for graphs to the case of hypergraphs, we present a non-backtracking variant of the higher-order random walk. We derive theoretical results on estimators based on the non-backtracking higher-order random walk and validate them through numerical simulations on large empirical hypergraphs. Third, we apply the non-backtracking higher-order random walk to a large hypergraph of co-authorships indexed in the OpenAlex database, where full access to the data is not readily available. Despite the relatively small sample size, our estimates largely align with previous findings on author productivity, team size, and the prevalence of open-access publications. Our findings contribute to the development of analysis methods for large hypergraphs, offering insights into sampling strategies and estimation techniques applicable to real-world complex systems.2025-02-26T10:36:38ZApplied Network Science Vol. 10, Article number: 19 (2025)Kazuki NakajimaMasanao KodakariMasaki Aida10.1007/s41109-025-00704-zhttp://arxiv.org/abs/2501.10727v2In the Picture: Medical Imaging Datasets, Artifacts, and their Living Review2025-06-02T12:18:57ZDatasets play a critical role in medical imaging research, yet issues such as label quality, shortcuts, and metadata are often overlooked. This lack of attention may harm the generalizability of algorithms and, consequently, negatively impact patient outcomes. While existing medical imaging literature reviews mostly focus on machine learning (ML) methods, with only a few focusing on datasets for specific applications, these reviews remain static -- they are published once and not updated thereafter. This fails to account for emerging evidence, such as biases, shortcuts, and additional annotations that other researchers may contribute after the dataset is published. We refer to these newly discovered findings of datasets as research artifacts. To address this gap, we propose a living review that continuously tracks public datasets and their associated research artifacts across multiple medical imaging applications. Our approach includes a framework for the living review to monitor data documentation artifacts, and an SQL database to visualize the citation relationships between research artifact and dataset. Lastly, we discuss key considerations for creating medical imaging datasets, review best practices for data annotation, discuss the significance of shortcuts and demographic diversity, and emphasize the importance of managing datasets throughout their entire lifecycle. Our demo is publicly available at http://inthepicture.itu.dk/.2025-01-18T11:03:59ZACM Conference on Fairness, Accountability, and Transparency - FAccT 2025Amelia Jiménez-SánchezNatalia-Rozalia AvlonaSarah de BoerVíctor M. CampelloAasa FeragenEnzo FerranteMelanie GanzJudy Wawira GichoyaCamila GonzálezSteff GroefsemaAlessa HeringAdam HulmanLeo JoskowiczDovile JuodelyteMelih KandemirThijs KooiJorge del Pozo LéridaLivie Yumeng LiAndre PachecoTim RädschMauricio ReyesThéo SourgetBram van GinnekenDavid WenNina WengJack Junchi XuHubert Dariusz ZającMaria A. ZuluagaVeronika Cheplygina10.1145/3715275.3732035http://arxiv.org/abs/2410.09510v2SciEvo: A 2 Million, 30-Year Cross-disciplinary Dataset for Temporal Scientometric Analysis2025-06-01T02:58:47ZUnderstanding the creation, evolution, and dissemination of scientific knowledge is crucial for bridging diverse subject areas and addressing complex global challenges such as pandemics, climate change, and ethical AI. Scientometrics, the quantitative and qualitative study of scientific literature, provides valuable insights into these processes. We introduce SciEvo, a longitudinal scientometric dataset with over two million academic publications, providing comprehensive contents information and citation graphs to support cross-disciplinary analyses. SciEvo is easy to use and available across platforms, including GitHub, Kaggle, and HuggingFace. Using SciEvo, we conduct a temporal study spanning over 30 years to explore key questions in scientometrics: the evolution of academic terminology, citation patterns, and interdisciplinary knowledge exchange. Our findings reveal critical insights, such as disparities in epistemic cultures, knowledge production modes, and citation practices. For example, rapidly developing, application-driven fields like LLMs exhibit significantly shorter citation age (2.48 years) compared to traditional theoretical disciplines like oral history (9.71 years). Our data and analytic tools can be accessed at https://github.com/Ahren09/SciEvo.2024-10-12T12:16:57ZWe have renamed our dataset from "Scito2M" to "SciEvo"Yiqiao JinYijia XiaoYiyang WangJindong Wanghttp://arxiv.org/abs/2303.16750v2A Gold Standard Dataset for the Reviewer Assignment Problem2025-05-30T08:34:07ZMany peer-review venues are using algorithms to assign submissions to reviewers. The crux of such automated approaches is the notion of the "similarity score" -- a numerical estimate of the expertise of a reviewer in reviewing a paper -- and many algorithms have been proposed to compute these scores. However, these algorithms have not been subjected to a principled comparison, making it difficult for stakeholders to choose the algorithm in an evidence-based manner. The key challenge in comparing existing algorithms and developing better algorithms is the lack of publicly available gold-standard data. We address this challenge by collecting a novel dataset of similarity scores that we release to the research community. Our dataset consists of 477 self-reported expertise scores provided by 58 researchers who evaluated their expertise in reviewing papers they have read previously.
Using our dataset, we compare several widely used similarity algorithms and offer key insights. First, all algorithms exhibit significant error, with misranking rates between 12%-30% in easier cases and 36%-43% in harder ones. Second, most specialized algorithms are designed to work with titles and abstracts of papers, and in this regime the SPECTER2 algorithm performs best. Interestingly, classical TF-IDF matches SPECTER2 in accuracy when given access to full submission texts. In contrast, off-the-shelf LLMs lag behind specialized approaches.2023-03-23T16:15:03ZIvan StelmakhJohn WietingSarina XiGraham NeubigNihar B. Shahhttp://arxiv.org/abs/2503.09257v5A Global Dataset Mapping the AI Innovation from Academic Research to Industrial Patents2025-05-30T02:55:05ZIn the rapidly evolving field of artificial intelligence (AI), mapping innovation patterns and understanding effective technology transfer from research to applications are essential for economic growth. However, existing data infrastructures suffer from fragmentation, incomplete coverage, and insufficient evaluative capacity. Here, we present DeepInnovationAI, a comprehensive global dataset containing three structured files. DeepPatentAI.csv: Contains 2,356,204 patent records with 8 field-specific attributes. DeepDiveAI.csv: Encompasses 3,511,929 academic publications with 13 metadata fields. These two datasets leverage large language models, multilingual text analysis and dual-layer BERT classifiers to accurately identify AI-related content, while utilizing hypergraph analysis to create robust innovation metrics. Additionally, DeepCosineAI.csv: By applying semantic vector proximity analysis, this file contains 3,511,929 most relevant paper-patent pairs, each described by 3 metadata fields, to facilitate the identification of potential knowledge flows. DeepInnovationAI enables researchers, policymakers, and industry leaders to anticipate trends and identify collaboration opportunities. With extensive temporal and geographical scope, it supports detailed analysis of technological development patterns and international competition dynamics, establishing a foundation for modeling AI innovation and technology transfer processes.2025-03-12T10:56:02Z38 pages and 4 figuresHaixing GongHui ZouXingzhou LiangShiyuan MengPinlong CaiXingcheng XuJingjing Quhttp://arxiv.org/abs/2505.24091v1Temporally Extending Existing Web Archive Collections for Longitudinal Analysis2025-05-30T00:38:28ZThe Environmental Governance and Data Initiative (EDGI) regularly crawled US federal environmental websites between 2016 and 2020 to capture changes between two presidential administrations. However, because it does not include the previous administration ending in 2008, the collection is unsuitable for answering our research question, Were the website terms deleted by the Trump administration (2017--2021) added by the Obama administration (2009--2017)? Thus, like many researchers using the Wayback Machine's holdings for historical analysis, we do not have access to a complete collection suiting our needs. To answer our research question, we must extend the EDGI collection back to January, 2008. This includes discovering relevant pages that were not included in the EDGI collection that persisted through 2020, not just going further back in time with the existing pages. We pieced together artifacts collected by various organizations for their purposes through many means (Save Page Now, Archive-It, and more) in order to curate a dataset sufficient for our intentions. In this paper, we contribute a methodology to extend existing web archive collections temporally to enable longitudinal analysis, including a dataset extended with this methodology. We use our new dataset to analyze our question, Were the website terms deleted by the Trump administration added by the Obama administration? We find that 81 percent of the pages in the dataset changed between 2008 and 2020, and that 87 percent of the pages with terms deleted by the Trump administration were terms added during the Obama administration.2025-05-30T00:38:28ZPresented at RESAW 2025; 23 pages, 14 figures, 8 tablesLesley FrewMichael L. NelsonMichele C. Weiglehttp://arxiv.org/abs/2506.03180v1Knowledge Graphs for Digitized Manuscripts in Jagiellonian Digital Library Application2025-05-29T14:49:24ZDigitizing cultural heritage collections has become crucial for preservation of historical artifacts and enhancing their availability to the wider public. Galleries, libraries, archives and museums (GLAM institutions) are actively digitizing their holdings and creates extensive digital collections. Those collections are often enriched with metadata describing items but not exactly their contents. The Jagiellonian Digital Library, standing as a good example of such an effort, offers datasets accessible through protocols like OAI-PMH. Despite these improvements, metadata completeness and standardization continue to pose substantial obstacles, limiting the searchability and potential connections between collections. To deal with these challenges, we explore an integrated methodology of computer vision (CV), artificial intelligence (AI), and semantic web technologies to enrich metadata and construct knowledge graphs for digitized manuscripts and incunabula.2025-05-29T14:49:24ZJan IgnatowiczKrzysztof KuttGrzegorz J. Nalepahttp://arxiv.org/abs/2503.13503v3SciHorizon: Benchmarking AI-for-Science Readiness from Scientific Data to Large Language Models2025-05-29T02:56:23ZIn recent years, the rapid advancement of Artificial Intelligence (AI) technologies, particularly Large Language Models (LLMs), has revolutionized the paradigm of scientific discovery, establishing AI-for-Science (AI4Science) as a dynamic and evolving field. However, there is still a lack of an effective framework for the overall assessment of AI4Science, particularly from a holistic perspective on data quality and model capability. Therefore, in this study, we propose SciHorizon, a comprehensive assessment framework designed to benchmark the readiness of AI4Science from both scientific data and LLM perspectives. First, we introduce a generalizable framework for assessing AI-ready scientific data, encompassing four key dimensions: Quality, FAIRness, Explainability, and Compliance-which are subdivided into 15 sub-dimensions. Drawing on data resource papers published between 2018 and 2023 in peer-reviewed journals, we present recommendation lists of AI-ready datasets for Earth, Life, and Materials Sciences, making a novel and original contribution to the field. Concurrently, to assess the capabilities of LLMs across multiple scientific disciplines, we establish 16 assessment dimensions based on five core indicators Knowledge, Understanding, Reasoning, Multimodality, and Values spanning Mathematics, Physics, Chemistry, Life Sciences, and Earth and Space Sciences. Using the developed benchmark datasets, we have conducted a comprehensive evaluation of over 50 representative open-source and closed source LLMs. All the results are publicly available and can be accessed online at www.scihorizon.cn/en.2025-03-12T11:34:41ZChuan QinXin ChenChengrui WangPengmin WuXi ChenYihang ChengJingyi ZhaoMeng XiaoXiangchao DongQingqing LongBoya PanHan WuChengzan LiYuanchun ZhouHui XiongHengshu Zhuhttp://arxiv.org/abs/2505.22702v1Are there stars in Bluesky after the return of Donald Trump to the White House?2025-05-28T17:49:18ZThis study examines the shift in the scientific community from X (formerly Twitter) to Bluesky, its impact on scientific communication, and consequently on social metrics (altmetrics). We analysed 14,497 publications from multidisciplinary and Library and Information Science (LIS) journals between January 2024 and March 2025. The results reveal a notable increase in Bluesky activity for multidisciplinary journals in November 2024, likely influenced by political and platform changes, with mentions multiplying for journals like Nature and Science. In LIS, the adoption of Bluesky is different and shows marked variation between European and United States journals. Although Bluesky remains a minority platform compared to X over the whole period, when focusing on user engagement after the United States elections, we see a much more even distribution between the two platforms. In two LIS journals, Bluesky even surpasses X, while in most others, the difference in user engagement was no longer as pronounced, marking a significant change from previous patterns in altmetrics.2025-05-28T17:49:18ZarXiv admin note: substantial text overlap with arXiv:2412.05624Wenceslao Arroyo-MachadoNicolas Robinson-GarciaDaniel Torres-Salinas10.1016/j.joi.2025.101700http://arxiv.org/abs/2505.20103v2SCIRGC: Multi-Granularity Citation Recommendation and Citation Sentence Preference Alignment2025-05-27T14:05:49ZCitations are crucial in scientific research articles as they highlight the connection between the current study and prior work. However, this process is often time-consuming for researchers. In this study, we propose the SciRGC framework, which aims to automatically recommend citation articles and generate citation sentences for citation locations within articles. The framework addresses two key challenges in academic citation generation: 1) how to accurately identify the author's citation intent and find relevant citation papers, and 2) how to generate high-quality citation sentences that align with human preferences. We enhance citation recommendation accuracy in the citation article recommendation module by incorporating citation networks and sentiment intent, and generate reasoning-based citation sentences in the citation sentence generation module by using the original article abstract, local context, citation intent, and recommended articles as inputs. Additionally, we propose a new evaluation metric to fairly assess the quality of generated citation sentences. Through comparisons with baseline models and ablation experiments, the SciRGC framework not only improves the accuracy and relevance of citation recommendations but also ensures the appropriateness of the generated citation sentences in context, providing a valuable tool for interdisciplinary researchers.2025-05-26T15:09:10Z15 pages, 7 figuresXiangyu LiJingqiang Chenhttp://arxiv.org/abs/2505.21162v1Leveraging GANs for citation intent classification and its impact on citation network analysis2025-05-27T13:16:09ZCitations play a fundamental role in the scientific ecosystem, serving as a foundation for tracking the flow of knowledge, acknowledging prior work, and assessing scholarly influence. In scientometrics, they are also central to the construction of quantitative indicators. Not all citations, however, serve the same function: some provide background, others introduce methods, or compare results. Therefore, understanding citation intent allows for a more nuanced interpretation of scientific impact. In this paper, we adopted a GAN-based method to classify citation intents. Our results revealed that the proposed method achieves competitive classification performance, closely matching state-of-the-art results with substantially fewer parameters. This demonstrates the effectiveness and efficiency of leveraging GAN architectures combined with contextual embeddings in intent classification task. We also investigated whether filtering citation intents affects the centrality of papers in citation networks. Analyzing the network constructed from the unArXiv dataset, we found that paper rankings can be significantly influenced by citation intent. All four centrality metrics examined- degree, PageRank, closeness, and betweenness - were sensitive to the filtering of citation types. The betweenness centrality displayed the greatest sensitivity, showing substantial changes in ranking when specific citation intents were removed.2025-05-27T13:16:09ZJournal of Informetrics 20 (2) 101791, 2026Davi A. BezerraFilipi N. SilvaDiego R. Amancio10.1016/j.joi.2026.101791http://arxiv.org/abs/2504.02769v3Curbing the Ramifications of Authorship Abuse in Science2025-05-26T17:50:46ZResearch performance is often measured using bibliometric indicators, such as publication count, total citations, and $h$-index. These metrics influence career advancements, salary adjustments, administrative opportunities, funding prospects, and professional recognition. However, the reliance on these metrics has also made them targets for manipulation, misuse, and abuse. One primary ethical concern is authorship abuse, which includes paid, ornamental, exploitative, cartel, and colonial authorship. These practices are prevalent because they artificially enhance multiple bibliometric indicators all at once. Our study confirms a significant rise in the mean and median number of authors per publication across multiple disciplines over the last 34 years. While it is important to identify the cases of authorship abuse, a thorough investigation of every paper proves impractical. In this study, we propose a credit allocation scheme based on the reciprocals of the Fibonacci numbers, designed to adjust credit for individual contributions while systematically reducing credit for potential authorship abuse. The proposed scheme aligns with rigorous authorship guidelines from scientific associations, which mandate significant contributions across most phases of a study, while accommodating more lenient guidelines from scientific publishers, which recognize authorship for minimal contributions. We recalibrate the traditional bibliometric indicators to emphasize author contribution rather than participation in publications. Additionally, we propose a new indicator, $T^{\prime}$-index, to assess researchers' leading and contributing roles in their publications. Our proposed credit allocation scheme mitigates the effects of authorship abuse and promotes a more ethical scientific ecosystem.2025-04-03T17:06:25ZMd Somir KhanMehmet Engin Tozalhttp://arxiv.org/abs/2506.03165v1What Does Information Science Offer for Data Science Research?: A Review of Data and Information Ethics Literature2025-05-26T14:07:42ZThis paper reviews literature pertaining to the development of data science as a discipline, current issues with data bias and ethics, and the role that the discipline of information science may play in addressing these concerns. Information science research and researchers have much to offer for data science, owing to their background as transdisciplinary scholars who apply human-centered and social-behavioral perspectives to issues within natural science disciplines. Information science researchers have already contributed to a humanistic approach to data ethics within the literature and an emphasis on data science within information schools all but ensures that this literature will continue to grow in coming decades. This review article serves as a reference for the history, current progress, and potential future directions of data ethics research within the corpus of information science literature.2025-05-26T14:07:42ZJournal of Data and Information Science (2022)Brady D. LundTing Wang10.2478/jdis-2022-0018http://arxiv.org/abs/2502.19679v4Architectural Vulnerability and Reliability Challenges in AI Text Annotation: A Survey-Inspired Framework with Independent Probability Assessment2025-05-26T00:23:51ZLarge Language Models, despite their power, have a fundamental architectural vulnerability stemming from their causal transformer design -- order sensitivity. This architectural constraint may distorts classification outcomes when prompt elements like label options are reordered, revealing a theoretical gap between accuracy metrics and true model reliability. The paper conceptualizes this vulnerability through the lens of survey methodology, where respondent biases parallel LLM positional dependencies. Empirical evidence using the F1000 biomedical dataset across three scales of LLaMA3.1 models (8B, 70B, 405B) demonstrates that these architectural constraints produce inconsistent annotations under controlled perturbations. The paper advances a practical solution for social science - Independent Probability Assessment - which decouples label evaluation to circumvent positional bias inherent in sequential processing. This approach yields an information-theoretic reliability measure (R-score) that quantifies annotation robustness at the case level. The findings establish that architectural vulnerabilities in causal transformers require methodological innovations beyond accuracy metrics to ensure valid social science inference, as demonstrated through downstream regression analyses where order-sensitive annotations significantly alter substantive conclusions about scientific impact.2025-02-27T01:42:10Z7 figuresLinzhuo li