https://arxiv.org/api/NQNhYqr9YCvbSYTtFDl73VccnGs 2026-03-22T19:12:27Z 5870 120 15 http://arxiv.org/abs/2601.15432v1 MEDFORD in a Box: Improvements and Future Directions for a Metadata Description Language 2026-01-21T19:56:57Z Scientific research metadata is vital to ensure the validity, reusability, and cost-effectiveness of research efforts. The MEDFORD metadata language was previously introduced to simplify the process of writing and maintaining metadata for non-programmers. However, barriers to entry and usability remain, including limited automatic validation, difficulty of data transport, and user unfamiliarity with text file editing. To address these issues, we introduce MEDFORD-in-a-Box (MIAB), a documentation ecosystem to facilitate researcher adoption and earlier metadata capture. MIAB contains many improvements, including an updated MEDFORD parser with expanded validation routines and BagIt export capability. MIAB also includes an improved VS Code extension that supports these changes through a visual IDE. By simplifying metadata generation, this new tool supports the creation of correct, consistent, and reusable metadata, ultimately improving research reproducibility. 2026-01-21T19:56:57Z Extended version of "Cross-Referencing Metadata Through an Extension of the MEDFORD Language" from MTSR 2024 Polina Shpilker Benjamin Stubbs Michael Sayers Yumin Lee Lenore Cowen Donna Slonim Shaun Wallace Alva Couch Noah M. Daniels http://arxiv.org/abs/2601.14916v1 Citation of scientific evidence from video description and its association with attention and impact 2026-01-21T11:57:02Z This study investigates how YouTube content creators utilize scientific evidence in videos. Log-linear regression examines the influence of alternative communication channels on video creators in Biotechnology, using data from 81,302 papers (2018-2023). This reveals a positive association with news articles and Wikipedia pages, but a negative association with scientific papers, policy documents, and patents. Despite the potential for enriching discussions, science video creators seem to favor materials with wider public attention over influential science, technology, and policy papers. These findings suggest a need for improved dissemination strategies for scientific research. Authors, universities, and journals should consider how their work can be made more accessible and engaging for science communicators on video. 2026-01-21T11:57:02Z 18 pages, 6 tables Pablo Dorta-González María Isabel Dorta-González http://arxiv.org/abs/2601.14823v1 Archives, archival bond, and digital representation: A case study with the International Image Interoperability Framework 2026-01-21T09:50:32Z Within the archival sector, digitization has long been a strategic initiative to ensure greater availability of historical documents. In recent years, the promotion of guidelines and standards, combined with technological advancements, has established methodologies and best practices and developed tools to facilitate massive digitization projects. However, despite the availability of technological solutions and guidelines, digitization is intended mostly to scan documents and make the outcome images available online. This practice can be problematic in representing the complex fonds structure made of relations, the archival bond that establishes the natural ordering of documents into archival units. This is particularly relevant when the fonds also has a multimedia component, such as an audiovisual component, that is often reproduced on different platforms disconnected from textual documents. This article addresses the challenges linked to digitization in the archival sector and proposes a methodological framework for representing fonds with respect to their native organization. For this purpose, the International Image Interoperability Framework (IIIF) is employed to configure a specific model that respects the archive's hierarchical structure. In particular, this model is configured to maintain the archival bond and enhance the resource's semantic aspect to make the IIIF model semantically interoperable. To demonstrate the adaptability of the framework to the archival domain, in this work, the ''PCI-Unitelefilm'' fonds of the Fondazione Archivio Audiovisivo del Movimento Operaio e Democratico (AAMOD) served as the case study. 2026-01-21T09:50:32Z Martin Critelli AMU, LERMA http://arxiv.org/abs/2601.14753v1 Many-to-many. Usability challenges of entity reconciliation in art history and photographic studies 2026-01-21T08:13:39Z This article investigates challenges in reconciling heterogeneous records across cultural institutions, focusing on art historical photo archives within the PHAROS consortium. Through case studies, the study analyses reconciliation workflows and cataloguing traditions, with attention to institutional contexts, data granularities, and modelling strategies. Reconciliation is seldom a one-to-one operation. Ambiguities, incomplete data, shifting attributions, and varying practices shape outcomes. Strategies observed include the creation of anonymous or collective entities, the use of umbrella terms, the addition of uncertainty qualifiers, and reticence when ambiguity cannot be resolved. The article highlights the need to model uncertainty explicitly, offering a framework that connects technical reconciliation methods with institutional practices. Insights from PHAROS provide guidance for designing more robust, interoperable, and sustainable cultural heritage infrastructures. 2026-01-21T08:13:39Z Journal of documentation (ahead of print), 1-19 (2025) Marilena Daquino Francesca Mambelli Artem Kozlov 10.1108/JD-09-2025-0284 http://arxiv.org/abs/2601.14429v1 Measuring the State of Open Science in Transportation Using Large Language Models 2026-01-20T19:39:52Z Open science initiatives have strengthened scientific integrity and accelerated research progress across many fields, but the state of their practice within transportation research remains under-investigated. Key features of open science, defined here as data and code availability, are difficult to extract due to the inherent complexity of the field. Previous work has either been limited to small-scale studies due to the labor-intensive nature of manual analysis or has relied on large-scale bibliometric approaches that sacrifice contextual richness. This paper introduces an automatic and scalable feature-extraction pipeline to measure data and code availability in transportation research. We employ Large Language Models (LLMs) for this task and validate their performance against a manually curated dataset and through an inter-rater agreement analysis. We applied this pipeline to examine 10,724 research articles published in the Transportation Research Part series of journals between 2019 and 2024. Our analysis found that only 5% of quantitative papers shared a code repository, 4% of quantitative papers shared a data repository, and about 3% of papers shared both, with trends differing across journals, topics, and geographic regions. We found no significant difference in citation counts or review duration between papers that provided data and code and those that did not, suggesting a misalignment between open science efforts and traditional academic metrics. Consequently, encouraging these practices will likely require structural interventions from journals and funding agencies to supplement the lack of direct author incentives. The pipeline developed in this study can be readily scaled to other journals, representing a critical step toward the automated measurement and monitoring of open science practices in transportation research. 2026-01-20T19:39:52Z Junyi Ji Ruth Lu Linda Belkessa Liming Wang Silvia Varotto Yongqi Dong Nicolas Saunier Mostafa Ameli Gregory S. Macfarlane Bahman Madadi Cathy Wu http://arxiv.org/abs/2601.17036v1 LLM-Generated or Human-Written? Comparing Review and Non-Review Papers on ArXiv 2026-01-19T21:21:42Z ArXiv recently prohibited the upload of unpublished review papers to its servers in the Computer Science domain, citing a high prevalence of LLM-generated content in these categories. However, this decision was not accompanied by quantitative evidence. In this work, we investigate this claim by measuring the proportion of LLM-generated content in review vs. non-review research papers in recent years. Using two high-quality detection methods, we find a substantial increase in LLM-generated content across both review and non-review papers, with a higher prevalence in review papers. However, when considering the number of LLM-generated papers published in each category, the estimates of non-review LLM-generated papers are almost six times higher. Furthermore, we find that this policy will affect papers in certain domains far more than others, with the CS subdiscipline Computers & Society potentially facing cuts of 50%. Our analysis provides an evidence-based framework for evaluating such policy decisions, and we release our code to facilitate future investigations at: https://github.com/yanaiela/llm-review-arxiv. 2026-01-19T21:21:42Z Yanai Elazar Maria Antoniak http://arxiv.org/abs/2602.00055v1 Examining The CoVCues Dataset: Supporting COVID Infodemic Research Through A Novel User Assessment Study 2026-01-19T20:16:37Z The public confidence and trust in online healthcare information have been greatly dented following the COVID-19 pandemic, which triggered a significant rise in online health misinformation. Existing literature shows that different datasets have been created to aid with detecting false information associated with this COVID infodemic. However, most of these datasets contain mostly unimodal data, which comprise primarily textual cues, and not visual cues, like images, infographics, and other graphic data components. Prior works point to the fact that there are only a handful of multimodal datasets that support COVID misinformation identification, and they lack an organized, processed and analyzed repository of visual cues. The novel CoVCues dataset, which represents a varied set of image artifacts, addresses this gap and advocates for the use of visual cues towards detecting online health misinformation. As part of validating the contents and utility of our CoVCues dataset, we have conducted a preliminary user assessment study, where different participants have been surveyed through a set of questionnaires to determine how effectively these dataset images contribute to the user perceived information reliability. These survey responses helped provide early insights into how different stakeholder groups interpret visual cues in the context of online health information and communication. The findings from this novel user assessment study offer valuable feedback for refining our CoVCues dataset and for supporting our claim that visual cues are underutilized but useful in combating the COVID infodemic. To our knowledge, this user assessment research study, as described in this paper, is the first of its kind work, involving COVID visual cues, that demonstrates the important role that our CoVCues dataset can potentially play in aiding COVID infodemic related future research work. 2026-01-19T20:16:37Z 10 pages, To Be Published In Proceedings Of The 1st IEEE Workshop on Healthcare and Medical Device Security, Privacy, Resilience, and Trust (IEEE HMD-SPiRiT), Accepted & Presented At The 7th IEEE International Conference on Trust, Privacy & Security in Intelligent Systems, and Applications (IEEE TPS 2025) on Nov. 11, 2025 in Pittsburgh, PA, USA Shreetika Poudel Ankur Chatterjee http://arxiv.org/abs/2601.17035v1 Deferred Acceptance Algorithm Improves Peer Review Process 2026-01-19T19:07:12Z The peer review process is essential to the success of science, but it also delays publications and absorbs considerable effort. Journals find it increasingly difficult to recruit competent reviewers. This study presents the results of agent-based simulation that models the current peer review process. We compared it to the simulation of a new peer review process that uses the Deferred Acceptance Algorithm (DAA) to match papers to journals. The matches are just as good while dramatically reducing the required number of reviews and delays. The results show that it is possible for the scientific community to significantly optimise the peer review process. 2026-01-19T19:07:12Z 25 pages, 4 figures Christoph Bartneck Richard Watt Etienne Borde Pattara Klinpibul http://arxiv.org/abs/2601.13187v1 Scientific production in the era of Large Language Models 2026-01-19T16:10:22Z Large Language Models (LLMs) are rapidly reshaping scientific research. We analyze these changes in multiple, large-scale datasets with 2.1M preprints, 28K peer review reports, and 246M online accesses to scientific documents. We find: 1) scientists adopting LLMs to draft manuscripts demonstrate a large increase in paper production, ranging from 23.7-89.3% depending on scientific field and author background, 2) LLM use has reversed the relationship between writing complexity and paper quality, leading to an influx of manuscripts that are linguistically complex but substantively underwhelming, and 3) LLM adopters access and cite more diverse prior work, including books and younger, less-cited documents. These findings highlight a stunning shift in scientific production that will likely require a change in how journals, funding agencies, and tenure committees evaluate scientific works. 2026-01-19T16:10:22Z This is the author's version of the work. The definitive version was published in Science on 18 Dec 2025, DOI: 10.1126/science.adw3000. Link to the Final Published Version: https://www.science.org/doi/10.1126/science.adw3000 Science, 390(6779), pp.1240-1243 (2025) Keigo Kusumegi Xinyu Yang Paul Ginsparg Mathijs de Vaan Toby Stuart Yian Yin 10.1126/science.adw3000 http://arxiv.org/abs/2601.17033v1 Big Deal cancellations and scholarly publishing: Insights from faculty and graduate student interviews 2026-01-19T15:52:13Z Big Deal cancellations are increasingly undertaken by academic librarians faced with rising subscription costs and shrinking collections budgets. While past research has focused on librarians' decision-making processes and communication strategies, this study aims to understand the perspectives and experiences of faculty and graduate students with Big Deal cancellations through interviews at three medium-sized Canadian institutions. It considers cancellations as a collaborative process of information exchange, rather than a top-down process. This study's findings can inform how and why cancellation projects can be undertaken with enhanced understandings of their lived realities of Big Deals and the current state of scholarly publishing. This study has been accepted by College and Research Libraries and will be published in July 2026. 2026-01-19T15:52:13Z 28 pages Madelaine Hare Philippe Mongeon Samuel Cassady Catherine A. Johnson http://arxiv.org/abs/2601.12902v1 Audit du syst{è}me d'information et du mod{è}le de gouvernance de la Biblioth{è}que Num{é}rique de l'Espace universitaire Francophone (BNEUF) du projet Initiative pour le D{é}veloppement du Num{é}rique dans l'Espace Universitaire Francophone (IDNEUF) 2026-01-19T09:56:52Z This document provides an assessment of the overall structure of the BNEUF system and how it operates within the framework of the Initiative for Digital Development in French speaking Universities (IDNEUF). This report aims to support the AUF's new strategy for 2021-2025, with its new structural and governance foundations for the implementation of the Francophonie scientifique project. It was therefore decided to reorganize existing and future digital resources and services with a view to incorporating them into the future global collaborative platform for integrated services. This report provides an external assessment with new forms of organization and use of the BNEUF system. The aim is to provide the AUF project team with new avenues for optimized management of the compiled digital resources and to synergize them with the related modules of the Atlas of Expertise and the Francophone Social Network. 2026-01-19T09:56:52Z in French language Mokhtar Ben Henda MICA http://arxiv.org/abs/2511.21755v2 Who Owns the Knowledge? Copyright, GenAI, and the Future of Academic Publishing 2026-01-18T18:17:31Z The integration of generative artificial intelligence (GenAI) and large language models (LLMs) into scientific research and higher education presents a paradigm shift, offering revolutionizing opportunities while simultaneously raising profound ethical, legal, and regulatory questions. This study examines the complex intersection of AI and science, with a specific focus on the challenges posed to copyright law and the principles of open science. The author argues that current regulatory frameworks in key jurisdictions like the United States, China, the European Union, and the United Kingdom, while aiming to foster innovation, contain significant gaps, particularly concerning the use of copyrighted works and open science outputs for AI training. Widely adopted licensing mechanisms, such as Creative Commons, fail to adequately address the nuances of AI training, and the pervasive lack of attribution within AI systems fundamentally challenges established notions of originality. While current doctrine treats AI training as potentially fair use, this paper argues such mechanisms are inadequate and that copyright holders should retain explicit opt-out rights regardless of fair use doctrine. Instead, the author advocates for upholding authors' rights to refuse the use of their works for AI training and proposes that universities assume a leading role in shaping responsible AI governance. The conclusion is that a harmonized international legislative effort is urgently needed to ensure transparency, protect intellectual property, and prevent the emergence of an oligopolistic market structure that could prioritize commercial profit over scientific integrity and equitable knowledge production. This is a substantially expanded and revised version of a work originally presented at the 20th International Conference on Scientometrics & Informetrics (Kochetkov, 2025). 2025-11-24T10:34:38Z The second version version substantially revises the original preprint through expanded legal analysis, representation of the new technical standard (RSL 1.0), and removing substantial material lacking direct relevance to copyright and AI training Dmitry Kochetkov http://arxiv.org/abs/2601.12306v1 Logarithmic scaling and stochastic criticality in collective attention 2026-01-18T08:13:31Z We uncover a universal scaling law governing the dispersion of collective attention and identify its underlying stochastic criticality. By analysing large-scale ensembles of Wikipedia page views, we find that the variance of logarithmic attention grows ultraslowly, $\operatorname{Var}[\ln{X(t)}]\propto\ln{t}$, in sharp contrast to the power-law scaling typically expected for diffusive processes. We show that this behaviour is captured by a minimal stochastic differential equation driven by fractional Brownian motion, in which long-range memory ($H$) and temporal decay of volatility ($η$) enter through the single exponent $ξ\equiv H-η$. At marginality, $ξ=0$, the variance grows logarithmically, marking the critical boundary between power-law growth ($ξ>0$) and saturation ($ξ<0$). By incorporating article-level heterogeneity through a Gaussian mixture model, we further reconstruct the empirical distribution of cumulative attention within the same framework. Our results place collective attention in a distinct class of non-Markovian stochastic processes, with close affinity to ageing-like and ultraslow dynamics in glassy systems. 2026-01-18T08:13:31Z Main Text: 7 pages (2 figures, 2 tables); Supplementary Materials: 5 pages (5 figures) Keisuke Okamura http://arxiv.org/abs/2505.12452v4 Missing vs. Unused Knowledge Hypothesis for Language Model Bottlenecks in Patent Understanding 2026-01-16T22:37:58Z While large language models (LLMs) excel at factual recall, the real challenge lies in knowledge application. A gap persists between their ability to answer complex questions and their effectiveness in performing tasks that require that knowledge. We investigate this gap using a patent classification problem that requires deep conceptual understanding to distinguish semantically similar but objectively different patents written in dense, strategic technical language. We find that LLMs often struggle with this distinction. To diagnose the source of these failures, we introduce a framework that decomposes model errors into two categories: missing knowledge and unused knowledge. Our method prompts models to generate clarifying questions and compares three settings -- raw performance, self-answered questions that activate internal knowledge, and externally provided answers that supply missing knowledge (if any). We show that most errors stem from failures to deploy existing knowledge rather than from true knowledge gaps. We also examine how models differ in constructing task-specific question-answer databases. Smaller models tend to generate simpler questions that they, and other models, can retrieve and use effectively, whereas larger models produce more complex questions that are less effective, suggesting complementary strengths across model scales. Together, our findings highlight that shifting evaluation from static fact recall to dynamic knowledge application offers a more informative view of model capabilities. 2025-05-18T15:04:02Z We open-source our patent dataset at https://huggingface.co/datasets/UchiKlab/patent_understanding Siyang Wu Honglin Bao Nadav Kunievsky James A. Evans http://arxiv.org/abs/2601.11425v1 PubMed-OCR: PMC Open Access OCR Annotations 2026-01-16T16:44:50Z PubMed-OCR is an OCR-centric corpus of scientific articles derived from PubMed Central Open Access PDFs. Each page image is annotated with Google Cloud Vision and released in a compact JSON schema with word-, line-, and paragraph-level bounding boxes. The corpus spans 209.5K articles (1.5M pages; ~1.3B words) and supports layout-aware modeling, coordinate-grounded QA, and evaluation of OCR-dependent pipelines. We analyze corpus characteristics (e.g., journal coverage and detected layout features) and discuss limitations, including reliance on a single OCR engine and heuristic line reconstruction. We release the data and schema to facilitate downstream research and invite extensions. 2026-01-16T16:44:50Z Hunter Heidenreich Yosheb Getachew Olivia Dinica Ben Elliott