https://arxiv.org/api/JAkioW0Fi6PfdnwZkKb682gFcyU2026-06-14T22:50:37Z606584015http://arxiv.org/abs/2504.18889v1Which kind of research papers influence policymaking2025-04-26T10:45:51ZThis study examines the use of evidence in policymaking by analysing a range of journal and article attributes, as well as online engagement metrics. It employs a large-scale citation analysis of nearly 150,000 articles covering diverse policy topics. The findings highlight that scholarly citations exert the strongest positive influence on policy citations. Articles from journals with a higher citation impact and larger Mendeley readership are cited more frequently in policy documents. Other online engagements, such as news and blog mentions, also boost policy citations, while mentions on social media X have a negative effect. The finding that highly cited and widely read papers are also frequently referenced in policy documents likely reflects the perception among policymakers that such research is more trustworthy. In contrast, papers that derive their influence primarily from social media tend to be cited less often in policy contexts.2025-04-26T10:45:51Z30 pages, 6 tablesPablo Dorta-Gonzálezhttp://arxiv.org/abs/2301.10140v2The Semantic Scholar Open Data Platform2025-04-25T18:51:54ZThe volume of scientific output is creating an urgent need for automated tools to help scientists keep up with developments in their field. Semantic Scholar (S2) is an open data platform and website aimed at accelerating science by helping scholars discover and understand scientific literature. We combine public and proprietary data sources using state-of-the-art techniques for scholarly PDF content extraction and automatic knowledge graph construction to build the Semantic Scholar Academic Graph, the largest open scientific literature graph to-date, with 200M+ papers, 80M+ authors, 550M+ paper-authorship edges, and 2.4B+ citation edges. The graph includes advanced semantic features such as structurally parsed text, natural language summaries, and vector embeddings. In this paper, we describe the components of the S2 data processing pipeline and the associated APIs offered by the platform. We will update this living document to reflect changes as we add new data offerings and improve existing services.2023-01-24T17:13:08Z8 pages, 6 figuresRodney KinneyChloe AnastasiadesRussell AuthurIz BeltagyJonathan BraggAlexandra BuraczynskiIsabel CacholaStefan CandraYoganand ChandrasekharArman CohanMiles CrawfordDoug DowneyJason DunkelbergerOren EtzioniRob EvansSergey FeldmanJoseph GorneyDavid GrahamFangzhou HuRegan HuffDaniel KingSebastian KohlmeierBailey KuehlMichael LanganDaniel LinHaokun LiuKyle LoJaron LochnerKelsey MacMillanTyler MurrayChris NewellSmita RaoShaurya RohatgiPaul SayreZejiang ShenAmanpreet SinghLuca SoldainiShivashankar SubramanianAmber TanakaAlex D. WadeLinda WagnerLucy Lu WangChris WilhelmCaroline WuJiangjiang YangAngele ZamarronMadeleine Van ZuylenDaniel S. Weldhttp://arxiv.org/abs/2504.05181v2Lightweight and Direct Document Relevance Optimization for Generative Information Retrieval2025-04-24T23:04:52ZGenerative information retrieval (GenIR) is a promising neural retrieval paradigm that formulates document retrieval as a document identifier (docid) generation task, allowing for end-to-end optimization toward a unified global retrieval objective. However, existing GenIR models suffer from token-level misalignment, where models trained to predict the next token often fail to capture document-level relevance effectively. While reinforcement learning-based methods, such as reinforcement learning from relevance feedback (RLRF), aim to address this misalignment through reward modeling, they introduce significant complexity, requiring the optimization of an auxiliary reward function followed by reinforcement fine-tuning, which is computationally expensive and often unstable. To address these challenges, we propose direct document relevance optimization (DDRO), which aligns token-level docid generation with document-level relevance estimation through direct optimization via pairwise ranking, eliminating the need for explicit reward modeling and reinforcement learning. Experimental results on benchmark datasets, including MS MARCO document and Natural Questions, show that DDRO outperforms reinforcement learning-based methods, achieving a 7.4% improvement in MRR@10 for MS MARCO and a 19.9% improvement for Natural Questions. These findings highlight DDRO's potential to enhance retrieval effectiveness with a simplified optimization approach. By framing alignment as a direct optimization problem, DDRO simplifies the ranking optimization pipeline of GenIR models while offering a viable alternative to reinforcement learning-based methods.2025-04-07T15:27:37Z12 pages, 3 figures. SIGIR '25 Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval July 13--18, 2025 Padua, Italy. Code and pretrained models available at: https://github.com/kidist-amde/ddro/Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '25), pages 1327-1338, 2025Kidist Amde MekonnenYubao TangMaarten de Rijke10.1145/3726302.3730023http://arxiv.org/abs/2504.06314v2Beyond authorship: Analyzing contributions in PLOS ONE and the challenges of appropriate attribution2025-04-24T07:33:17ZThis study aims to evaluate the accuracy of authorship attributions in scientific publications, focusing on the fairness and precision of individual contributions within academic works. The study analyzes 81,823 publications from the journal PLOS ONE, covering the period from January 2018 to June 2023. It examines the authorship attributions within these publications to try and determine the prevalence of inappropriate authorship. It also investigates the demographic and professional profiles of affected authors, exploring trends and potential factors contributing to inaccuracies in authorship. Surprisingly, 9.14% of articles feature at least one author with inappropriate authorship, affecting over 14,000 individuals (2.56% of the sample). Inappropriate authorship is more concentrated in Asia, Africa, and specific European countries like Italy. Established researchers with significant publication records and those affiliated with companies or nonprofits show higher instances of potential monetary authorship. Our findings are based on contributions as declared by the authors, which implies a degree of trust in their transparency. However, this reliance on self-reporting may introduce biases or inaccuracies into the dataset. Further research could employ additional verification methods to enhance the reliability of the findings. These findings have significant implications for journal publishers, highlighting the necessity for robust control mechanisms to ensure the integrity of authorship attributions. Moreover, researchers must exercise discernment in determining when to acknowledge a contributor and when to include them in the author list. Addressing these issues is crucial for maintaining the credibility and fairness of academic publications.2025-04-08T06:47:52ZAbdelghani Maddi, Jaime A. Teixeira da Silva. Beyond authorship: Analyzing contributions in PLOS ONE and the challenges of appropriate attribution[J]. Journal of Data and Information Science, 2024Abdelghani MaddiJaime A. Teixeira da Silva10.2478/jdis-2024-0015http://arxiv.org/abs/2504.17189v1Metadata Augmentation using NLP, Machine Learning and AI chatbots: A comparison2025-04-24T01:53:29ZRecent advances in machine learning and artificial intelligence have provided more alternatives for the implementation of repetitive or monotonous tasks. However, the development of AI tools has not been straightforward, and use case exploration and workflow integration are still ongoing challenges. In this work, we present a detailed qualitative analysis of the performance and user experience of popular commercial AI chatbots when used for document classification with limited data. We report the results for a real-world example of metadata augmentation in academic libraries environment. We compare the results of AI chatbots with other machine learning and natural language processing methods such as XGBoost and BERT-based fine tuning, and share insights from our experience. We found that AI chatbots perform similarly among them while outperforming the machine learning methods we tested, showing their advantage when the method relies on local data for training. We also found that while working with AI chatbots is easier than with code, getting useful results from them still represents a challenge for the user. Furthermore, we encountered alarming conceptual errors in the output of some chatbots, such as not being able to count the number of lines of our inputs and explaining the mistake as ``human error''. Although this is not complete evidence that AI chatbots can be effectively used for metadata classification, we believe that the information provided in this work can be useful to librarians and data curators in developing pathways for the integration and use of AI tools for data curation or metadata augmentation tasks.2025-04-24T01:53:29ZAlfredo González-EspinozaDom JebbiaHaoyong Lanhttp://arxiv.org/abs/2504.08619v4Analyzing 16,193 LLM Papers for Fun and Profits2025-04-22T19:13:09ZLarge Language Models (LLMs) are reshaping the landscape of computer science research, driving significant shifts in research priorities across diverse conferences and fields. This study provides a comprehensive analysis of the publication trend of LLM-related papers in 77 top-tier computer science conferences over the past six years (2019-2024). We approach this analysis from four distinct perspectives: (1) We investigate how LLM research is driving topic shifts within major conferences. (2) We adopt a topic modeling approach to identify various areas of LLM-related topic growth and reveal the topics of concern at different conferences. (3) We explore distinct contribution patterns of academic and industrial institutions. (4) We study the influence of national origins on LLM development trajectories. Synthesizing the findings from these diverse analytical angles, we derive ten key insights that illuminate the dynamics and evolution of the LLM research ecosystem.2025-04-11T15:24:23ZZhiqiu XiaLang ZhuBingzhe LiFeng ChenQiannan LiChunhua LiaoFeiyi WangHang Liuhttp://arxiv.org/abs/2504.15504v1Are Widely Known Findings Easier to Retract?2025-04-22T00:51:22ZFailures of retraction are common in science. Why do these failures occur? And, relatedly, what makes findings harder or easier to retract? We use data from Microsoft Academic Graph, Retraction Watch, and Altmetric -- including retracted papers, citation records, and Altmetric scores and mentions -- to test recently proposed answers to these questions. A recent previous study by LaCroix et al. employ simple network models to argue that the social spread of scientific information helps explain failures of retraction. One prediction of their models is that widely known or well established results, surprisingly, should be easier to retract, since their retraction is more relevant to more scientists. Our results support this conclusion. We find that highly cited papers show more significant reductions in citation after retraction and garner more attention to their retractions as they occur.2025-04-22T00:51:22Z13 pages, 2 figures, Under reviewShahan Ali MemonJevin D. WestCailin O'Connorhttp://arxiv.org/abs/2408.05587v3COARA will not save science from the tyranny of administrative evaluation2025-04-19T15:40:47ZThe Coalition for Advancing Research Assessment (CoARA) agreement is a cornerstone in the ongoing efforts to reform research evaluation. CoARA advocates for administrative evaluations of research that rely on peer review, supported by responsible metrics, as beneficial for both science and society. Its principles can be critically examined through the lens of Philip Kitcher's concept of well-ordered science in a democratic society. From Kitcher's perspective, CoARA's approach faces two significant challenges: definitions of quality and impact are determined by governments or evaluation institutions rather than emerging from broad public deliberation, and a select group of scientists is empowered to assess research based on these predefined criteria. This creates susceptibility to both the ''tyranny of expertise'' and the ''tyranny of ignorance'' that Kitcher cautions against. Achieving Kitcher's ideal would require limiting administrative evaluations to essential tasks, such as researcher recruitment and project funding, while establishing procedures grounded in principles of fairness.2024-08-10T16:13:25Z20 pagesResearch Evaluation 2025Alberto Baccini10.1093/reseval/rvaf024http://arxiv.org/abs/2504.20061v1Research Power Ranking: Adapting the Elo System to Quantify Scientist Evaluation2025-04-19T08:39:29ZThis paper presents an original model for assessing scientific productivity, research power ranking, RPR, which is based on the adaptation of the Elo rating system to the context of scientific activity. Unlike traditional scientometric indicators, RPR accounts for the dynamics, multidimensionality, and relativity of research power. The model comprises three components fundamental, applied, and commercial activity each represented by a separate rating and updated on the basis of probabilistic scientific games analogous to chess matches. The scientific rating of each researcher is calculated as a weighted sum of the components, allowing the model to reflect not only their current position but also their career trajectory, including phase transitions, breakthroughs, and changes in scientific style. Numerical simulations were conducted for both the individual trajectories and population level distributions of the researchers. Phase diagrams were constructed, and a typology of scientific styles was formulated. The results demonstrate that RPR can serve as a universal tool for objective assessment, strategic planning, and visualization of scientific reputation in both academic and applied environments.2025-04-19T08:39:29Z31 pages, 7 figures, 8 tablesEldar Knarhttp://arxiv.org/abs/2504.13404v1Design Priorities in Digital Gateways: A Comparative Study of Authentication and Usability in Academic Library Alliances2025-04-18T01:52:52ZPurpose: This study examines the design and functionality of university library login pages across academic alliances (IVY Plus, BTAA, JULAC, JVU) to identify how these interfaces align with institutional priorities and user needs. It explores consensus features, design variations, and emerging trends in authentication, usability, and security.
Methodology: A multi-method approach was employed: screenshots and HTML files from 46 institutions were analyzed through categorization, statistical analysis, and comparative evaluation. Features were grouped into authentication mechanisms, usability, security/compliance, and library-specific elements.
Findings: Core functionalities (e.g., ID/password, privacy policies) were consistent across alliances. Divergences emerged in feature emphasis: mature alliances (e.g., BTAA) prioritized resource accessibility with streamlined interfaces, while emerging consortia (e.g., JVU) emphasized cybersecurity (IP restrictions, third-party integrations). Usability features, particularly multilingual support, drove cross-alliance differences. The results highlighted regional and institutional influences, with older alliances favoring simplicity and newer ones adopting security-centric designs.
Originality/Value: This is the first systematic comparison of login page designs across academic alliances, offering insights into how regional, technological, and institutional factors shape digital resource access. Findings inform best practices for balancing security, usability, and accessibility in library interfaces. **Keywords**: Academic library consortia, Login page design, User authentication, User experience, Security compliance.2025-04-18T01:52:52ZRui ShangBingjie Huanghttp://arxiv.org/abs/2504.13387v1Bibliometric Analysis of Scientific Publications on Blockchain Research and Applications2025-04-18T00:29:13ZSince the introduction of Bitcoin in 2008, blockchain technology has garnered widespread attention. Scholars from various research fields, countries, and institutions have published a significant number of papers on this subject. However, there is currently a lack of comprehensive analysis specifically focusing on the scientific publications in the field of blockchain.
To conduct a comprehensive analysis, we compiled a corpus of 41,497 publications in blockchain research from 2008 to 2023 using the Clarivate databases. Through bibliometric and citation analyses, we gained valuable insights into the field. Our study offers an overview of the blockchain research landscape, including country, institution, authorship, and subject categories. Additionally, we identified Emerging Research Areas (ERA) using the co-citation clustering approach, examining factors such as recency, growth, and contributions from different countries/regions. Furthermore, we identified influential publications based on citation velocity and analyzed five representative Research Fronts in detail. This analysis provides a fine-grained examination of specific areas within blockchain research. Our findings contribute to understanding evolving trends, emerging applications, and potential directions for future research in the multidisciplinary field of blockchain.2025-04-18T00:29:13ZLingfeng BaoJiameng YangXiaohu YangChunming Ronghttp://arxiv.org/abs/2504.12195v1Validating and monitoring bibliographic and citation data in OpenCitations collections2025-04-16T15:47:44ZPurpose. The increasing emphasis on data quantity in research infrastructures has highlighted the need for equally robust mechanisms ensuring data quality, particularly in bibliographic and citation datasets. This paper addresses the challenge of maintaining high-quality open research information within OpenCitations, a community-guided Open Science Infrastructure, by introducing tools for validating and monitoring bibliographic metadata and citation data.
Methods. We developed a custom validation tool tailored to the OpenCitations Data Model (OCDM), designed to detect and explain ingestion errors from heterogeneous sources, whether due to upstream data inconsistencies or internal software bugs. Additionally, a quality monitoring tool was created to track known data issues post-publication. These tools were applied in two scenarios: (1) validating metadata and citations from Matilda, a potential future source, and (2) monitoring data quality in the existing OpenCitations Meta dataset.
Results. The validation tool successfully identified a variety of structural and semantic issues in the Matilda dataset, demonstrating its precision. The monitoring tool enabled the detection of recurring problems in the OpenCitations Meta collection, as well as their quantification. Together, these tools proved effective in enhancing the reliability of OpenCitations' published data.
Conclusion. The presented validation and monitoring tools represent a step toward ensuring high-quality bibliographic data in open research infrastructures, though they are limited to the data model adopted by OpenCitations. Future developments are aimed at expanding to additional data sources, with particular regard to crowdsourced data.2025-04-16T15:47:44ZIvan HeibiSilvio PeroniElia Rizzettohttp://arxiv.org/abs/2407.21067v2Socio-cognitive Networks between Researchers: Investigating Scientific Dualities with the Group-Oriented Relational Hyperevent Model2025-04-15T20:40:28ZUnderstanding why researchers cite certain works remains a key question in the study of scientific networks. Prior research has identified factors such as relevance, group cohesion, and source crediting. However, the interplay between cognitive and social dimensions in citation behavior - often conceptualized as a socio-cognitive network - is frequently overlooked, particularly regarding the intermediary steps that lead to a citation. Since a citation first requires a work to be published by a set of authors, we examine how the structure of coauthorship networks influences citation patterns. To investigate this relationship, we analyze the citation and collaboration behavior of Chilean astronomers from 2013 to 2015 using the Group-Oriented Relational Hyperevent Model, which allows us to study coauthorship and citation networks in a joint framework. Our findings suggest that when selecting which works to cite, authors favor recent research and maintain cognitive continuity across cited works. At the same time, we observe that coherent groups - closely connected coauthors - tend to be co-cited more frequently in subsequent publications, reinforcing the interdependence of collaboration and citation networks.2024-07-28T09:52:37ZAlejandro Espinosa-RadaJürgen LernerCornelius Fritzhttp://arxiv.org/abs/2504.11492v1Language and Knowledge Representation: A Stratified Approach2025-04-14T20:18:10ZThe thesis proposes the problem of representation heterogeneity to emphasize the fact that heterogeneity is an intrinsic property of any representation, wherein, different observers encode different representations of the same target reality in a stratified manner using different concepts, language and knowledge (as well as data). The thesis then advances a top-down solution approach to the above stratified problem of representation heterogeneity in terms of several solution components, namely: (i) a representation formalism stratified into concept level, language level, knowledge level and data level to accommodate representation heterogeneity, (ii) a top-down language representation using Universal Knowledge Core (UKC), UKC namespaces and domain languages to tackle the conceptual and language level heterogeneity, (iii) a top-down knowledge representation using the notions of language teleontology and knowledge teleontology to tackle the knowledge level heterogeneity, (iv) the usage and further development of the existing LiveKnowledge catalog for enforcing iterative reuse and sharing of language and knowledge representations, and, (v) the kTelos methodology integrating the solution components above to iteratively generate the language and knowledge representations absolving representation heterogeneity. The thesis also includes proof-of-concepts of the language and knowledge representations developed for two international research projects - DataScientia (data catalogs) and JIDEP (materials modelling). Finally, the thesis concludes with future lines of research.2025-04-14T20:18:10ZDoctor of Philosophy (Ph.D) in Information Engineering and Computer Science, DISI, University of Trento, ItalyMayukh Bagchihttp://arxiv.org/abs/2506.21819v1SciMantify -- A Hybrid Approach for the Evolving Semantification of Scientific Knowledge2025-04-14T07:57:55ZScientific publications, primarily digitized as PDFs, remain static and unstructured, limiting the accessibility and reusability of the contained knowledge. At best, scientific knowledge from publications is provided in tabular formats, which lack semantic context. A more flexible, structured, and semantic representation is needed to make scientific knowledge understandable and processable by both humans and machines. We propose an evolution model of knowledge representation, inspired by the 5-star Linked Open Data (LOD) model, with five stages and defined criteria to guide the stepwise transition from a digital artifact, such as a PDF, to a semantic representation integrated in a knowledge graph (KG). Based on an exemplary workflow implementing the entire model, we developed a hybrid approach, called SciMantify, leveraging tabular formats of scientific knowledge, e.g., results from secondary studies, to support its evolving semantification. In the approach, humans and machines collaborate closely by performing semantic annotation tasks (SATs) and refining the results to progressively improve the semantic representation of scientific knowledge. We implemented the approach in the Open Research Knowledge Graph (ORKG), an established platform for improving the findability, accessibility, interoperability, and reusability of scientific knowledge. A preliminary user experiment showed that the approach simplifies the preprocessing of scientific knowledge, reduces the effort for the evolving semantification, and enhances the knowledge representation through better alignment with the KG structures.2025-04-14T07:57:55ZAccepted at the 25th International Conference on Web Engineering 2025Lena JohnKheir Eddine FarfarSören AuerOliver Karras