Contested Citations: The Role of Open Access Publications in Wikipedia's Scientific Disputes

2025-10-15T20:20:16Z

Wikipedia is one of the largest online encyclopedias, which relies on scientific publications as authoritative sources. The increasing prevalence of open access (OA) publishing has expanded the public availability of scientific knowledge; however, its impact on the dynamics of knowledge contestation within collaborative environments such as Wikipedia remains underexplored. To address this gap, we analyze a large-scale dataset that combines Wikipedia edit histories with metadata from scientific publications cited in disputed Wikipedia articles. Our study investigates the characteristics of scientific publications involved in disputes and examines whether OA articles are more likely to be contested than paywalled ones. We find that scientific disputes on Wikipedia are more frequent in the social sciences and humanities, where topics often involve social values and interpretative variability. Publications with higher citation counts and publications in high-impact journals are more likely to be involved in disputes. OA publications are significantly more likely to be involved in disputes and tend to be contested sooner after publication than paywalled articles. This pattern suggests that increased accessibility accelerates both engagement and scrutiny. The relationship between OA status and dispute involvement also varies across disciplines, reflecting differences in Wikipedia editorial practices and norms. These findings highlight the dual role of OA in both expanding access to scientific knowledge and increasing its visibility in contexts of public negotiation and debate. This study contributes to a broader understanding of how scientific knowledge is collaboratively constructed and contested on open platforms, offering insights for research on open science, scholarly communication, and digital knowledge governance.

Position: The Artificial Intelligence and Machine Learning Community Should Adopt a More Transparent and Regulated Peer Review Process

2025-10-15T06:44:35Z

The rapid growth of submissions to top-tier Artificial Intelligence (AI) and Machine Learning (ML) conferences has prompted many venues to transition from closed to open review platforms. Some have fully embraced open peer reviews, allowing public visibility throughout the process, while others adopt hybrid approaches, such as releasing reviews only after final decisions or keeping reviews private despite using open peer review systems. In this work, we analyze the strengths and limitations of these models, highlighting the growing community interest in transparent peer review. To support this discussion, we examine insights from Paper Copilot, a website launched two years ago to aggregate and analyze AI / ML conference data while engaging a global audience. The site has attracted over 200,000 early-career researchers, particularly those aged 18-34 from 177 countries, many of whom are actively engaged in the peer review process. Drawing on our findings, this position paper advocates for a more transparent, open, and well-regulated peer review aiming to foster greater community involvement and propel advancements in the field.

APC waivers and Ukraine's publishing output in Gold OA journals: Evidence from five commercial publishers

2025-10-13T20:01:59Z

This study examines the effect of article processing charge (APC) waivers on the participation of Ukrainian researchers in fully Gold Open Access (Gold OA) journals published by the five largest academic publishers - Elsevier, SAGE, Springer Nature, Taylor & Francis, and Wiley - during the period 2019-2024. These publishers were selected because, in response to the full-scale war launched against Ukraine in 2022, all five introduced emergency 100% APC-waiver policies for Ukrainian authors. Using bibliometric data from the Web of Science Core Collection, the study analyses publication trends in Ukrainian-authored articles in fully Gold OA journals of these publishers before and after 2022. The results show a marked post-2022 increase in Ukraine's Gold OA output, particularly in journals published by Springer Nature and Elsevier. Disciplinary and publisher-specific patterns are evident, with especially strong growth in the medical and applied sciences. The findings underscore the potential of targeted support measures during times of crisis, while also illustrating the inherent limitations of APC-based publishing models in fostering equitable scholarly communication.

Leveraging LLMs for Semi-Automatic Corpus Filtration in Systematic Literature Reviews

2025-10-13T13:48:29Z

The creation of systematic literature reviews (SLR) is critical for analyzing the landscape of a research field and guiding future research directions. However, retrieving and filtering the literature corpus for an SLR is highly time-consuming and requires extensive manual effort, as keyword-based searches in digital libraries often return numerous irrelevant publications. In this work, we propose a pipeline leveraging multiple large language models (LLMs), classifying papers based on descriptive prompts and deciding jointly using a consensus scheme. The entire process is human-supervised and interactively controlled via our open-source visual analytics web interface, LLMSurver, which enables real-time inspection and modification of model outputs. We evaluate our approach using ground-truth data from a recent SLR comprising over 8,000 candidate papers, benchmarking both open and commercial state-of-the-art LLMs from mid-2024 and fall 2025. Results demonstrate that our pipeline significantly reduces manual effort while achieving lower error rates than single human annotators. Furthermore, modern open-source models prove sufficient for this task, making the method accessible and cost-effective. Overall, our work demonstrates how responsible human-AI collaboration can accelerate and enhance systematic literature reviews within academic workflows.

A Multi-lingual Dataset of Classified Paragraphs from Open Access Scientific Publications

2025-10-13T13:10:47Z

We present a dataset of 833k paragraphs extracted from CC-BY licensed scientific publications, classified into four categories: acknowledgments, data mentions, software/code mentions, and clinical trial mentions. The paragraphs are primarily in English and French, with additional European languages represented. Each paragraph is annotated with language identification (using fastText) and scientific domain (from OpenAlex). This dataset, derived from the French Open Science Monitor corpus and processed using GROBID, enables training of text classification models and development of named entity recognition systems for scientific literature mining. The dataset is publicly available on HuggingFace https://doi.org/10.57967/hf/6679 under a CC-BY license.

Updating the Complex Systems Keyword Diagram Using Collective Feedback and Latest Literature Data

2025-10-13T01:48:58Z

The complex systems keyword diagram generated by the author in 2010 has been used widely in a variety of educational and outreach purposes, but it definitely needs a major update and reorganization. This short paper reports our recent attempt to update the keyword diagram using information collected from the following multiple sources: (a) collective feedback posted on social media, (b) recent reference books on complex systems and network science, (c) online resources on complex systems, and (d) keyword search hits obtained using OpenAlex, an open-access bibliographic catalogue of scientific publications. The data (a), (b) and (c) were used to incorporate the research community's and other public communities' perceptions of the relevant topics, whereas the data (d) was used to obtain more objective measurements of the keywords' relevance and associations from publications made in complex systems science. Results revealed differences and overlaps between public perception and actual usage of keywords in publications on complex systems. Four topical communities were obtained from the keyword association network, although they were highly intertwined with each other. We hope that the resulting network visualization of complex systems keywords provides a more up-to-date, accurate topic map of the field of complex systems as of today.

Fractional stochastic model of citation dynamics with memory and volatility

2025-10-12T07:41:44Z

Understanding the statistical laws governing citation dynamics remains a fundamental challenge in network theory and the science of science. Citation networks typically exhibit in-degree distributions well approximated by log-normal distributions yet also display power-law behaviour in the high-citation regime -- an apparent contradiction lacking a unified explanation. Here we identify a previously unrecognised phenomenon: the variance of the logarithm of citation counts per unit time follows a power law with respect to time ($t$) since publication, scaling as $t^{H}$, with $H$ constant. This discovery introduces a new challenge while simultaneously offering a crucial clue to resolving this discrepancy. We develop a stochastic model in which latent attention to publications evolves through a memory-driven process with cumulative advantage, modelled as fractional Brownian motion with Hurst parameter $H$ and volatility. We show that antipersistent fluctuations in attention ($H < 1/2$) yield log-normal citation distributions, whereas persistent attention dynamics ($H > 1/2$) favour heavy-tailed power laws, thus resolving the log-normal--power-law contradiction. Numerical simulations confirm both the $t^{H}$ law and the transition between regimes. Empirical analysis of arXiv e-prints indicates that the latent attention process is intrinsically antipersistent ($H \approx 0.13$). By linking memory effects and stochastic fluctuations in attention to broader network dynamics, our findings provide a unifying framework for understanding the evolution of collective attention in science and other attention-driven processes.

From Funding to Findings (FIND): An Open Database of NSF Awards and Research Outputs

2025-10-11T20:45:42Z

Public funding plays a central role in driving scientific discovery. To better understand the link between research inputs and outputs, we introduce FIND (Funding-Impact NSF Database), an open-access dataset that systematically links NSF grant proposals to their downstream research outputs, including publication metadata and abstracts. The primary contribution of this project is the creation of a large-scale, structured dataset that enables transparency, impact evaluation, and metascience research on the returns to public funding. To illustrate the potential of FIND, we present two proof-of-concept NLP applications. First, we analyze whether the language of grant proposals can predict the subsequent citation impact of funded research. Second, we leverage large language models to extract scientific claims from both proposals and resulting publications, allowing us to measure the extent to which funded projects deliver on their stated goals. Together, these applications highlight the utility of FIND for advancing metascience, informing funding policy, and enabling novel AI-driven analyses of the scientific process.

SoK: Scope and Mission of CS&Law

2025-10-09T18:32:31Z

We systematize the intellectual scope of the ACM Computer Science and Law Symposium (CS&Law). In particular, we address the meaning and importance of the word ''and'' in the name of the symposium. We identify previously published papers (from CS&Law and other forums) that exemplify different aspects of the CS&Law scope and note that the scope is expected to evolve as the symposium and the community grow and change. To round out our systematization of the still nascent research area, we also discuss the mission of CS&Law: What might the symposium seek to accomplish beyond providing a forum for intellectual exchange and community formation?

The geography of novel and atypical research

2025-10-09T06:42:46Z

The production of knowledge has become increasingly a global endeavor. Yet, location related factors, such as local working environment and national policy designs, may continue to affect what kind of science is being pursued. Here we examine the geography of the production of creative science by country, through the lens of novelty and atypicality proposed in Uzzi et al. (2013). We quantify a country's representativeness in novel and atypical science, finding persistent differences in propensity to generate creative works, even among developed countries that are large producers in science. We further cluster countries based on how their tendency to publish novel science changes over time, identifying one group of emerging countries. Our analyses point out the recent emergence of China not only as a large producer in science but also as a leader that disproportionately produces more novel and atypical research. Discipline specific analysis indicates that China's over-production of atypical science is limited to a few disciplines, especially its most prolific ones like materials science and chemistry.

Evolve with Your Research -- Stepwise System Evolution from Document-driven to Fact-centric Research Data Management in Materials Science

2025-10-07T21:44:15Z

The digitalisation of research requires data management systems capable of supporting a broad spectrum of usage scenarios, ranging from document-oriented repositories to fully factographic environments. This paper introduces a methodological approach for the stepwise development of such systems, illustrated by the MatInf Research Data Management System (RDMS). The proposed framework combines a graph-based STAR paradigm-emphasising Statefulness, Traceability, Aim, and Result-with the SET methodology, which enables systematic Standardisation, Extraction, and Testing of research data. Together, these principles provide a pathway towards FAIR-compliant data infrastructures, facilitating reproducibility, re-use, and integration of heterogeneous materials science data. By demonstrating the gradual consolidation of research outputs into unified datasets, this study highlights how adaptive RDMS design can support accelerated scientific discovery and enhance collaborative research in large-scale projects.

A bibliometric study on mathematical oncology: interdisciplinarity, internationality, collaboration and trending topics

2025-10-07T17:46:39Z

Mathematical oncology is an interdisciplinary research field where the mathematical sciences meet cancer research. Being situated at the intersection of these two fields makes mathematical oncology highly dynamic, as practicing researchers are incentivised to quickly adapt to both technical and medical research advances. Determining the scope of mathematical oncology is therefore not straightforward; however, it is important for purposes related to funding allocation, education, scientific communication, and community organisation. To address this issue, we here conduct a bibliometric analysis of mathematical oncology. We compare our results to the broader field of mathematical biology, and position our findings within theoretical science of science frameworks. Based on article metadata and citation flows, our results provide evidence that mathematical oncology has undergone a significant evolution since the 1960s marked by increased interactions with other disciplines, geographical expansion, larger research teams, and greater diversity in studied topics. The latter finding contributes to the greater discussion on which models different research communities consider to be valuable in the era of big data and machine learning. Further, the results presented in this study quantitatively motivate that international collaboration networks should be supported to enable new countries to enter and remain in the field, and that mathematical oncology benefits both mathematics and the life sciences.

The Software Observatory: aggregating and analysing software metadata for trend computation and FAIR assessment

2025-10-07T09:15:02Z

In the ever-changing realm of research software development, it is crucial for the scientific community to grasp current trends to identify gaps that can potentially hinder scientific progress. The adherence to the FAIR (Findable, Accessible, Interoperable, Reusable) principles can serve as a proxy to understand those trends and provide a mechanism to propose specific actions. The Software Observatory at OpenEBench (https://openebench.bsc.es/observatory) is a novel web portal that consolidates software metadata from various sources, offering comprehensive insights into critical research software aspects. Our platform enables users to analyse trends, identify patterns and advancements within the Life Sciences research software ecosystem, and understand its evolution over time. It also evaluates research software according to FAIR principles for research software, providing scores for different indicators. Users have the ability to visualise this metadata at different levels of granularity, ranging from the entire software landscape to specific communities to individual software entries through the FAIRsoft Evaluator. Indeed, the FAIRsoft Evaluator component streamlines the assessment process, helping developers efficiently evaluate and obtain guidance to improve their software's FAIRness. The Software Observatory represents a valuable resource for researchers and software developers, as well as stakeholders, promoting better software development practices and adherence to FAIR principles for research software.

Cosmos 1.0: a multidimensional map of the emerging technology frontier

2025-10-07T00:57:02Z

This paper introduces the Cosmos 1.0 dataset and describes a novel methodology for creating and mapping a universe of technologies, adjacent concepts, and entities. We utilise various source data that contain a rich diversity and breadth of contemporary knowledge. The Cosmos 1.0 dataset comprises 23,544 technology-adjacent entities (TA23k) with a hierarchical structure and eight categories of external indices. Each entity is represented by a 100-dimensional contextual embedding vector, which we use to assign it to seven thematic tech-clusters (TC7) and three meta tech-clusters (TC3). We manually verify 100 emerging technologies (ET100). This dataset is enriched with additional indices specifically developed to assess the landscape of emerging technologies, including the Technology Awareness Index, Generality Index, Deeptech, and Age of Tech Index. The dataset incorporates extensive metadata sourced from Wikipedia and linked data from third-party sources such as Crunchbase, Google Books, OpenAlex and Google Scholar, which are used to validate the relevance and accuracy of the constructed indices.

Identity resolution of software metadata using Large Language Models

2025-10-06T14:17:23Z

Software is an essential component of research. However, little attention has been paid to it compared with that paid to research data. Recently, there has been an increase in efforts to acknowledge and highlight the importance of software in research activities. Structured metadata from platforms like bio.tools, Bioconductor, and Galaxy ToolShed offers valuable insights into research software in the Life Sciences. Although originally intended to support discovery and integration, this metadata can be repurposed for large-scale analysis of software practices. However, its quality and completeness vary across platforms, reflecting diverse documentation practices. To gain a comprehensive view of software development and sustainability, consolidating this metadata is necessary, but requires robust mechanisms to address its heterogeneity and scale. This article presents an evaluation of instruction-tuned large language models for the task of software metadata identity resolution, a critical step in assembling a cohesive collection of research software. Such a collection is the reference component for the Software Observatory at OpenEBench, a platform that aggregates metadata to monitor the FAIRness of research software in the Life Sciences. We benchmarked multiple models against a human-annotated gold standard, examined their behavior on ambiguous cases, and introduced an agreement-based proxy for high-confidence automated decisions. The proxy achieved high precision and statistical robustness, while also highlighting the limitations of current models and the broader challenges of automating semantic judgment in FAIR-aligned software metadata across registries and repositories.