Archives, archival bond, and digital representation: A case study with the International Image Interoperability Framework

2026-01-21T09:50:32Z

Within the archival sector, digitization has long been a strategic initiative to ensure greater availability of historical documents. In recent years, the promotion of guidelines and standards, combined with technological advancements, has established methodologies and best practices and developed tools to facilitate massive digitization projects. However, despite the availability of technological solutions and guidelines, digitization is intended mostly to scan documents and make the outcome images available online. This practice can be problematic in representing the complex fonds structure made of relations, the archival bond that establishes the natural ordering of documents into archival units. This is particularly relevant when the fonds also has a multimedia component, such as an audiovisual component, that is often reproduced on different platforms disconnected from textual documents. This article addresses the challenges linked to digitization in the archival sector and proposes a methodological framework for representing fonds with respect to their native organization. For this purpose, the International Image Interoperability Framework (IIIF) is employed to configure a specific model that respects the archive's hierarchical structure. In particular, this model is configured to maintain the archival bond and enhance the resource's semantic aspect to make the IIIF model semantically interoperable. To demonstrate the adaptability of the framework to the archival domain, in this work, the ''PCI-Unitelefilm'' fonds of the Fondazione Archivio Audiovisivo del Movimento Operaio e Democratico (AAMOD) served as the case study.

Many-to-many. Usability challenges of entity reconciliation in art history and photographic studies

2026-01-21T08:13:39Z

This article investigates challenges in reconciling heterogeneous records across cultural institutions, focusing on art historical photo archives within the PHAROS consortium. Through case studies, the study analyses reconciliation workflows and cataloguing traditions, with attention to institutional contexts, data granularities, and modelling strategies. Reconciliation is seldom a one-to-one operation. Ambiguities, incomplete data, shifting attributions, and varying practices shape outcomes. Strategies observed include the creation of anonymous or collective entities, the use of umbrella terms, the addition of uncertainty qualifiers, and reticence when ambiguity cannot be resolved. The article highlights the need to model uncertainty explicitly, offering a framework that connects technical reconciliation methods with institutional practices. Insights from PHAROS provide guidance for designing more robust, interoperable, and sustainable cultural heritage infrastructures.

Measuring the State of Open Science in Transportation Using Large Language Models

2026-01-20T19:39:52Z

Open science initiatives have strengthened scientific integrity and accelerated research progress across many fields, but the state of their practice within transportation research remains under-investigated. Key features of open science, defined here as data and code availability, are difficult to extract due to the inherent complexity of the field. Previous work has either been limited to small-scale studies due to the labor-intensive nature of manual analysis or has relied on large-scale bibliometric approaches that sacrifice contextual richness. This paper introduces an automatic and scalable feature-extraction pipeline to measure data and code availability in transportation research. We employ Large Language Models (LLMs) for this task and validate their performance against a manually curated dataset and through an inter-rater agreement analysis. We applied this pipeline to examine 10,724 research articles published in the Transportation Research Part series of journals between 2019 and 2024. Our analysis found that only 5% of quantitative papers shared a code repository, 4% of quantitative papers shared a data repository, and about 3% of papers shared both, with trends differing across journals, topics, and geographic regions. We found no significant difference in citation counts or review duration between papers that provided data and code and those that did not, suggesting a misalignment between open science efforts and traditional academic metrics. Consequently, encouraging these practices will likely require structural interventions from journals and funding agencies to supplement the lack of direct author incentives. The pipeline developed in this study can be readily scaled to other journals, representing a critical step toward the automated measurement and monitoring of open science practices in transportation research.

LLM-Generated or Human-Written? Comparing Review and Non-Review Papers on ArXiv

2026-01-19T21:21:42Z

ArXiv recently prohibited the upload of unpublished review papers to its servers in the Computer Science domain, citing a high prevalence of LLM-generated content in these categories. However, this decision was not accompanied by quantitative evidence. In this work, we investigate this claim by measuring the proportion of LLM-generated content in review vs. non-review research papers in recent years. Using two high-quality detection methods, we find a substantial increase in LLM-generated content across both review and non-review papers, with a higher prevalence in review papers. However, when considering the number of LLM-generated papers published in each category, the estimates of non-review LLM-generated papers are almost six times higher. Furthermore, we find that this policy will affect papers in certain domains far more than others, with the CS subdiscipline Computers & Society potentially facing cuts of 50%. Our analysis provides an evidence-based framework for evaluating such policy decisions, and we release our code to facilitate future investigations at: https://github.com/yanaiela/llm-review-arxiv.

Examining The CoVCues Dataset: Supporting COVID Infodemic Research Through A Novel User Assessment Study

2026-01-19T20:16:37Z

The public confidence and trust in online healthcare information have been greatly dented following the COVID-19 pandemic, which triggered a significant rise in online health misinformation. Existing literature shows that different datasets have been created to aid with detecting false information associated with this COVID infodemic. However, most of these datasets contain mostly unimodal data, which comprise primarily textual cues, and not visual cues, like images, infographics, and other graphic data components. Prior works point to the fact that there are only a handful of multimodal datasets that support COVID misinformation identification, and they lack an organized, processed and analyzed repository of visual cues. The novel CoVCues dataset, which represents a varied set of image artifacts, addresses this gap and advocates for the use of visual cues towards detecting online health misinformation. As part of validating the contents and utility of our CoVCues dataset, we have conducted a preliminary user assessment study, where different participants have been surveyed through a set of questionnaires to determine how effectively these dataset images contribute to the user perceived information reliability. These survey responses helped provide early insights into how different stakeholder groups interpret visual cues in the context of online health information and communication. The findings from this novel user assessment study offer valuable feedback for refining our CoVCues dataset and for supporting our claim that visual cues are underutilized but useful in combating the COVID infodemic. To our knowledge, this user assessment research study, as described in this paper, is the first of its kind work, involving COVID visual cues, that demonstrates the important role that our CoVCues dataset can potentially play in aiding COVID infodemic related future research work.

Deferred Acceptance Algorithm Improves Peer Review Process

2026-01-19T19:07:12Z

The peer review process is essential to the success of science, but it also delays publications and absorbs considerable effort. Journals find it increasingly difficult to recruit competent reviewers. This study presents the results of agent-based simulation that models the current peer review process. We compared it to the simulation of a new peer review process that uses the Deferred Acceptance Algorithm (DAA) to match papers to journals. The matches are just as good while dramatically reducing the required number of reviews and delays. The results show that it is possible for the scientific community to significantly optimise the peer review process.

Scientific production in the era of Large Language Models

2026-01-19T16:10:22Z

Large Language Models (LLMs) are rapidly reshaping scientific research. We analyze these changes in multiple, large-scale datasets with 2.1M preprints, 28K peer review reports, and 246M online accesses to scientific documents. We find: 1) scientists adopting LLMs to draft manuscripts demonstrate a large increase in paper production, ranging from 23.7-89.3% depending on scientific field and author background, 2) LLM use has reversed the relationship between writing complexity and paper quality, leading to an influx of manuscripts that are linguistically complex but substantively underwhelming, and 3) LLM adopters access and cite more diverse prior work, including books and younger, less-cited documents. These findings highlight a stunning shift in scientific production that will likely require a change in how journals, funding agencies, and tenure committees evaluate scientific works.

Big Deal cancellations and scholarly publishing: Insights from faculty and graduate student interviews

2026-01-19T15:52:13Z

Big Deal cancellations are increasingly undertaken by academic librarians faced with rising subscription costs and shrinking collections budgets. While past research has focused on librarians' decision-making processes and communication strategies, this study aims to understand the perspectives and experiences of faculty and graduate students with Big Deal cancellations through interviews at three medium-sized Canadian institutions. It considers cancellations as a collaborative process of information exchange, rather than a top-down process. This study's findings can inform how and why cancellation projects can be undertaken with enhanced understandings of their lived realities of Big Deals and the current state of scholarly publishing. This study has been accepted by College and Research Libraries and will be published in July 2026.

Audit du syst{è}me d'information et du mod{è}le de gouvernance de la Biblioth{è}que Num{é}rique de l'Espace universitaire Francophone (BNEUF) du projet Initiative pour le D{é}veloppement du Num{é}rique dans l'Espace Universitaire Francophone (IDNEUF)

2026-01-19T09:56:52Z

This document provides an assessment of the overall structure of the BNEUF system and how it operates within the framework of the Initiative for Digital Development in French speaking Universities (IDNEUF). This report aims to support the AUF's new strategy for 2021-2025, with its new structural and governance foundations for the implementation of the Francophonie scientifique project. It was therefore decided to reorganize existing and future digital resources and services with a view to incorporating them into the future global collaborative platform for integrated services. This report provides an external assessment with new forms of organization and use of the BNEUF system. The aim is to provide the AUF project team with new avenues for optimized management of the compiled digital resources and to synergize them with the related modules of the Atlas of Expertise and the Francophone Social Network.

Who Owns the Knowledge? Copyright, GenAI, and the Future of Academic Publishing

2026-01-18T18:17:31Z

The integration of generative artificial intelligence (GenAI) and large language models (LLMs) into scientific research and higher education presents a paradigm shift, offering revolutionizing opportunities while simultaneously raising profound ethical, legal, and regulatory questions. This study examines the complex intersection of AI and science, with a specific focus on the challenges posed to copyright law and the principles of open science. The author argues that current regulatory frameworks in key jurisdictions like the United States, China, the European Union, and the United Kingdom, while aiming to foster innovation, contain significant gaps, particularly concerning the use of copyrighted works and open science outputs for AI training. Widely adopted licensing mechanisms, such as Creative Commons, fail to adequately address the nuances of AI training, and the pervasive lack of attribution within AI systems fundamentally challenges established notions of originality. While current doctrine treats AI training as potentially fair use, this paper argues such mechanisms are inadequate and that copyright holders should retain explicit opt-out rights regardless of fair use doctrine. Instead, the author advocates for upholding authors' rights to refuse the use of their works for AI training and proposes that universities assume a leading role in shaping responsible AI governance. The conclusion is that a harmonized international legislative effort is urgently needed to ensure transparency, protect intellectual property, and prevent the emergence of an oligopolistic market structure that could prioritize commercial profit over scientific integrity and equitable knowledge production. This is a substantially expanded and revised version of a work originally presented at the 20th International Conference on Scientometrics & Informetrics (Kochetkov, 2025).

Logarithmic scaling and stochastic criticality in collective attention

2026-01-18T08:13:31Z

We uncover a universal scaling law governing the dispersion of collective attention and identify its underlying stochastic criticality. By analysing large-scale ensembles of Wikipedia page views, we find that the variance of logarithmic attention grows ultraslowly, $\operatorname{Var}[\ln{X(t)}]\propto\ln{t}$, in sharp contrast to the power-law scaling typically expected for diffusive processes. We show that this behaviour is captured by a minimal stochastic differential equation driven by fractional Brownian motion, in which long-range memory ($H$) and temporal decay of volatility ($η$) enter through the single exponent $ξ\equiv H-η$. At marginality, $ξ=0$, the variance grows logarithmically, marking the critical boundary between power-law growth ($ξ>0$) and saturation ($ξ<0$). By incorporating article-level heterogeneity through a Gaussian mixture model, we further reconstruct the empirical distribution of cumulative attention within the same framework. Our results place collective attention in a distinct class of non-Markovian stochastic processes, with close affinity to ageing-like and ultraslow dynamics in glassy systems.

Missing vs. Unused Knowledge Hypothesis for Language Model Bottlenecks in Patent Understanding

2026-01-16T22:37:58Z

While large language models (LLMs) excel at factual recall, the real challenge lies in knowledge application. A gap persists between their ability to answer complex questions and their effectiveness in performing tasks that require that knowledge. We investigate this gap using a patent classification problem that requires deep conceptual understanding to distinguish semantically similar but objectively different patents written in dense, strategic technical language. We find that LLMs often struggle with this distinction. To diagnose the source of these failures, we introduce a framework that decomposes model errors into two categories: missing knowledge and unused knowledge. Our method prompts models to generate clarifying questions and compares three settings -- raw performance, self-answered questions that activate internal knowledge, and externally provided answers that supply missing knowledge (if any). We show that most errors stem from failures to deploy existing knowledge rather than from true knowledge gaps. We also examine how models differ in constructing task-specific question-answer databases. Smaller models tend to generate simpler questions that they, and other models, can retrieve and use effectively, whereas larger models produce more complex questions that are less effective, suggesting complementary strengths across model scales. Together, our findings highlight that shifting evaluation from static fact recall to dynamic knowledge application offers a more informative view of model capabilities.

PubMed-OCR: PMC Open Access OCR Annotations

2026-01-16T16:44:50Z

PubMed-OCR is an OCR-centric corpus of scientific articles derived from PubMed Central Open Access PDFs. Each page image is annotated with Google Cloud Vision and released in a compact JSON schema with word-, line-, and paragraph-level bounding boxes. The corpus spans 209.5K articles (1.5M pages; ~1.3B words) and supports layout-aware modeling, coordinate-grounded QA, and evaluation of OCR-dependent pipelines. We analyze corpus characteristics (e.g., journal coverage and detected layout features) and discuss limitations, including reliance on a single OCR engine and heuristic line reconstruction. We release the data and schema to facilitate downstream research and invite extensions.

How Do We Engage with Other Disciplines? A Framework to Study Meaningful Interdisciplinary Discourse in Scholarly Publications

2026-01-15T23:16:49Z

With the rising popularity of interdisciplinary work and increasing institutional incentives in this direction, there is a growing need to understand how resulting publications incorporate ideas from multiple disciplines. Existing computational approaches, such as affiliation diversity, keywords, and citation patterns, do not account for how individual citations are used to advance the citing work. Although, in line with addressing this gap, prior studies have proposed taxonomies to classify citation purpose, these frameworks are not well-suited to interdisciplinary research and do not provide quantitative measures of citation engagement quality. To address these limitations, we propose a framework for the evaluation of citation engagement in interdisciplinary Natural Language Processing (NLP) publications. Our approach introduces a citation purpose taxonomy tailored to interdisciplinary work, supported by an annotation study. We demonstrate the utility of this framework through a thorough analysis of publications at the intersection of NLP and Computational Social Science.

Aletheia-Probe: A Tool for Automated Journal Assessment

2026-01-15T14:26:41Z

Assessing journal legitimacy during literature reviews, publication venue selection, and citation verification requires consulting information scattered across multiple incompatible data-sets. This paper introduces Aletheia-Probe, an open-source tool that systematically aggregates curated databases and pattern analysis from multiple authoritative sources to provide transparent, confidence-scored journal assessments. The tool explicitly reports which sources were consulted, what each found, and where evidence conflicts. The tool integrates into research workflows through command-line and programmatic interfaces. It reduces manual assessment overhead while explicitly flagging uncertain cases. We present the tool's architecture, core design principles, and practical integration approach. Comprehensive empirical validation will be presented in forthcoming work.