https://arxiv.org/api/me6PFl3BtxOz8fISmMq6aIR1eJ8 2026-06-10T07:23:45Z 6061 135 15 http://arxiv.org/abs/2604.19507v2 Market Dynamics, Governance and Open Research Metadata in the AI Era 2026-04-24T04:56:27Z

The debate about scholarly knowledge infrastructure has long been framed as a contest between openness and commercial enclosure. This framing distorts both policy and practice. The real tension lies between the persistent cost of producing and refining structured metadata under deep technological friction, and the differentiated demands distinct communities place on data quality, focus and granularity. We introduce the innovation annulus: the zone between freely available structured data and the advancing frontier of commercially refined knowledge products. This zone is a permanent, functional feature of the ecosystem -- not a pathology to eliminate. By analogy with the efficient market hypothesis, its width measures production inefficiency, set by the interplay of friction and demand. Artificial intelligence reshapes the annulus, lowering barriers to basic structuring, raising the threshold at which refinement adds value, and introducing systemic risks through unprovenanced AI-derived metadata. CRediT contributions, funding acknowledgements and AI disclosure statements illustrate the annulus lifecycle. Governance should calibrate the annulus, not abolish it: thin enough to serve research efficiently, wide enough to sustain innovation. A formal welfare framework, analogous to the Nordhaus optimal patent life, characterises the trade-offs and yields testable predictions. The Barcelona Declaration offers a promising forum for boundary governance.

2026-04-21T14:25:32Z 18 pages, 3 figures, minor changes and reference added Daniel W. Hook http://arxiv.org/abs/2604.08619v2 Doctoral Theses in France (1985-2025): A Linked Dataset of PhDs, Academic Networks, and Institutions 2026-04-23T16:46:11Z

This paper presents a comprehensive dataset of doctoral theses defended in France between 1985 and 2025, constructed from multiple national academic metadata sources. The dataset is primarily based on data from the French national thesis platform and is enriched using additional authority and bibliographic databases to improve data quality, completeness, and interoperability. The data production pipeline includes the aggregation of heterogeneous sources, the correction of inconsistent identifiers, the enrichment of person and institution records, and the construction of derived variables describing academic careers, jury participation, institutional affiliations, and thesis characteristics. Additional identifiers from major academic repositories and library catalogues are integrated to facilitate linkage with external data sources and future dataset extensions. The resulting dataset provides structured information at the thesis, individual, and institutional levels, enabling both descriptive and relational analyses. This resource is particularly suited for research on doctoral education, academic networks, supervision practices, jury composition, institutional collaboration, and the evolution of research communities over time. The paper documents the data sources, processing pipeline, feature construction, data quality issues, and limitations, with the objective of facilitating reuse of the dataset by other researchers and supporting future extensions and longitudinal analyses of the academic system.

2026-04-09T08:09:43Z 11 pages + 6 appendix pages, 7 figures, 2 tables. See https://doi.org/10.5281/zenodo.19453191 for the dataset. See https://github.com/WilliamAboucaya/phd-theses-france for the code to reproduce the dataset and figures Version 2: Fixed references to tables and figures. Modified unclear wordings in section 3. Updated values in the languages table after a minor bug fix. Standardized figures style William Aboucaya Dastan Jasim http://arxiv.org/abs/2510.04070v2 Markov kernels in Mathlib's probability library 2026-04-23T12:20:48Z

The probability folder of Mathlib, Lean's mathematical library, makes a heavy use of Markov kernels. We present their definition and properties and describe the formalization of the disintegration theorem for Markov kernels. That theorem is used to define conditional probability distributions of random variables as well as posterior distributions. We then explain how Markov kernels are used in a more unusual way to get a common definition of independence and conditional independence and, following the same principles, to define sub-Gaussian random variables. Finally, we also discuss the role of kernels in our formalization of entropy and Kullback-Leibler divergence.

2025-10-05T07:22:17Z 33 pages Rémy Degenne http://arxiv.org/abs/2306.16191v2 OpenCitations Meta 2026-04-23T11:15:24Z

OpenCitations Meta is a new database for open bibliographic metadata of scholarly publications involved in the citations indexed by the OpenCitations infrastructure, adhering to Open Science principles and published under a CC0 license to promote maximum reuse. It presently incorporates bibliographic metadata for publications recorded in Crossref, DataCite and PubMed, making it the largest bibliographic metadata source using Semantic Web technologies. It assigns new globally persistent identifiers (PIDs), known as OpenCitations Meta Identifiers (OMIDs) to all bibliographic resources, enabling it both to disambiguate publications described using different external PIDS (e.g., a DOI in Crossref and a PMID in PubMed), and to handle citations involving publications lacking external PIDs. By hosting bibliographic metadata internally, OpenCitations Meta eliminates its former reliance on API calls to external resources and thus enhances performance in response to user queries. Its automated data curation, following the OpenCitations Data Model, includes deduplication, error correction, metadata enrichment and full provenance tracking, ensuring transparency and traceability of data and bolstering confidence in data integrity, a feature unparalleled in other bibliographic databases. Its commitment to Semantic Web standards ensures superior interoperability compared to other machine-readable formats, with availability via a SPARQL endpoint, REST APIs and data dumps.

2023-06-28T13:15:02Z 31 pages, 8 figures Quantitative Science Studies 2024. 5 (1) 50-75 Arcangelo Massari Fabio Mariani Ivan Heibi Silvio Peroni David Shotton 10.1162/qss_a_00292 http://arxiv.org/abs/2604.21150v1 The State of Scientific Poster Sharing and Reuse 2026-04-22T23:29:39Z

Scientific posters are one of the most common forms of scholarly communication and contain early-stage insights with potential to accelerate scientific discovery. We investigated where posters are shared, to what extent their sharing aligns with the FAIR principles, and how commonly they are reused. We identified 86 platforms hosting posters, with many not assigning persistent identifiers. A total of 150k posters are shared as of 2024 on the 43 platforms where we were able to count, which is relatively low. Looking in more detail at posters shared on Zenodo and Figshare, we found that repositories are not always supporting structured metadata critical for poster discovery, like conference information, and that researchers are not providing such metadata even if they are supported. We also observed that while there is some engagement with posters in terms of views and downloads, citing posters is not yet a common practice. Our recommendations are for the scientific community to encourage poster sharing and reuse and establish clear guidelines to make posters FAIR.

2026-04-22T23:29:39Z Aydan Gasimova Paapa Mensah-Kane Gerard F. Blake Sanjay Soundarajan James ONeill Bhavesh Patel http://arxiv.org/abs/2604.20548v1 Enhancing Research Idea Generation through Combinatorial Innovation and Multi-Agent Iterative Search Strategies 2026-04-22T13:31:12Z

Scientific progress depends on the continual generation of innovative re-search ideas. However, the rapid growth of scientific literature has greatly increased the cost of knowledge filtering, making it harder for researchers to identify novel directions. Although existing large language model (LLM)-based methods show promise in research idea generation, the ideas they produce are often repetitive and lack depth. To address this issue, this study proposes a multi-agent iterative planning search strategy inspired by com-binatorial innovation theory. The framework combines iterative knowledge search with an LLM-based multi-agent system to generate, evaluate, and re-fine research ideas through repeated interaction, with the goal of improving idea diversity and novelty. Experiments in the natural language processing domain show that the proposed method outperforms state-of-the-art base-lines in both diversity and novelty. Further comparison with ideas derived from top-tier machine learning conference papers indicates that the quality of the generated ideas falls between that of accepted and rejected papers. These results suggest that the proposed framework is a promising approach for supporting high-quality research idea generation. The source code and dataset used in this paper are publicly available on Github repository: https://github.com/ChenShuai00/MAGenIdeas. The demo is available at https://huggingface.co/spaces/cshuai20/MAGenIdeas.

2026-04-22T13:31:12Z Scientometrics Shuai Chen Chengzhi Zhang http://arxiv.org/abs/2604.20528v1 Evolution of Research Method Usage Across the Academic Careers of Library and Information Science Scholars 2026-04-22T13:06:58Z

Research methods constitute an indispensable tool for scholars engaged in scientific inquiry. Investigating how scholars use research methods throughout their careers can reveal distinct patterns in method adoption, providing valuable insights for novice researchers in selecting appropriate methods. This study employs a comprehensive dataset comprising full-text journal articles and bibliographic records from the Library and Information Science (LIS) domain. Utilizing an automated classification model based on full-text cognitive analysis, the research methods employed by LIS scholars are systematically identified. Topic modeling was then conducted using Top2Vec. Subsequently, author name disambiguation is performed, and academic age is calculated for each scholar. This study focuses on 435 senior scholars with an academic age of more than 14 years and a consistent publication record at five-year intervals, covering a total of 6,116 articles. The corpus covers 16 research method categories and 20 research topics. The findings indicate that bibliometric methods are the most frequently used across career stages, accounting for 19.61% among early-career scholars and 31.81% among senior scholars. Over the course of a scholarly career, the diversity of research methods initially increases and then declines. Furthermore, scholars exhibit a propensity for combining multiple research methods, including both conventional and unconventional pairings. Notably, the research methods most commonly used by researchers change with age and seniority.

2026-04-22T13:06:58Z Scientometrics Jiayi Hao Chengzhi Zhang http://arxiv.org/abs/2604.01965v2 Do We Need Bigger Models for Science? Task-Aware Retrieval with Small Language Models 2026-04-22T11:29:59Z

Scientific knowledge discovery increasingly relies on large language models, yet many existing scholarly assistants depend on proprietary systems with tens or hundreds of billions of parameters. Such reliance limits reproducibility and accessibility for the research community. In this work, we ask a simple question: do we need bigger models for scientific applications? Specifically, we investigate to what extent carefully designed retrieval pipelines can compensate for reduced model scale in scientific applications. We design a lightweight retrieval-augmented framework that performs task-aware routing to select specialized retrieval strategies based on the input query. The system further integrates evidence from full-text scientific papers and structured scholarly metadata, and employs compact instruction-tuned language models to generate responses with citations. We evaluate the framework across several scholarly tasks, focusing on scholarly question answering (QA), including single- and multi-document scenarios, as well as biomedical QA under domain shift and scientific text compression. Our findings demonstrate that retrieval and model scale are complementary rather than interchangeable. While retrieval design can partially compensate for smaller models, model capacity remains important for complex reasoning tasks. This work highlights retrieval and task-aware design as key factors for building practical and reproducible scholarly assistants.

2026-04-02T12:28:51Z Accepted at NSLP@LREC 2026 Florian Kelber Matthias Jobst Yuni Susanti Michael Färber http://arxiv.org/abs/2604.19578v1 Impact of large language models on peer review opinions from a fine-grained perspective: Evidence from top conference proceedings in AI 2026-04-21T15:33:53Z

With the rapid advancement of Large Language Models (LLMs), the academic community has faced unprecedented disruptions, particularly in the realm of academic communication. The primary function of peer review is improving the quality of academic manuscripts, such as clarity, originality and other evaluation aspects. Although prior studies suggest that LLMs are beginning to influence peer review, it remains unclear whether they are altering its core evaluative functions. Moreover, the extent to which LLMs affect the linguistic form, evaluative focus, and recommendation-related signals of peer-review reports has yet to be systematically examined. In this study, we examine the changes in peer review reports for academic articles following the emergence of LLMs, emphasizing variations at fine-grained level. Specifically, we investigate linguistic features such as the length and complexity of words and sentences in review comments, while also automatically annotating the evaluation aspects of individual review sentences. We also use a maximum likelihood estimation method, previously established, to identify review reports that potentially have modified or generated by LLMs. Finally, we assess the impact of evaluation aspects mentioned in LLM-assisted review reports on the informativeness of recommendation for paper decision-making. The results indicate that following the emergence of LLMs, peer review texts have become longer and more fluent, with increased emphasis on summaries and surface-level clarity, as well as more standardized linguistic patterns, particularly reviewers with lower confidence score. At the same time, attention to deeper evaluative dimensions, such as originality, replicability, and nuanced critical reasoning, has declined.

2026-04-21T15:33:53Z Scientometrics Wenqing Wu Chengzhi Zhang Yi Zhao Tong Bao http://arxiv.org/abs/2604.19505v1 Enhancing Unsupervised Keyword Extraction in Academic Papers through Integrating Highlights with Abstract 2026-04-21T14:22:21Z

Automatic keyword extraction from academic papers is a key area of interest in natural language processing and information retrieval. Although previous research has mainly focused on utilizing abstract and references for keyword extraction, this paper focuses on the highlights section - a summary describing the key findings and contributions, offering readers a quick overview of the research. Our observations indicate that highlights contain valuable keyword information that can effectively complement the abstract. To investigate the impact of incorporating highlights into unsupervised keyword extraction, we evaluate three input scenarios: using only the abstract, the highlights, and a combination of both. Experiments conducted with four unsupervised models on Computer Science (CS), Library and Information Science (LIS) datasets reveal that integrating the abstract with highlights significantly improves extraction performance. Furthermore, we examine the differences in keyword coverage and content between abstract and highlights, exploring how these variations influence extraction outcomes. The data and code are available at https://github.com/xiangyi-njust/Highlight-KPE.

2026-04-21T14:22:21Z Scientometrics Yi Xiang Chengzhi Zhang http://arxiv.org/abs/2110.00601v2 Album: executable building blocks for scientific imaging routines, from sharing to LLM-assisted orchestration 2026-04-21T13:08:07Z

Open-source scientific software is a major driver of scientific progress, yet its development and reuse remain difficult in collaborative settings. Researchers repeatedly face four recurring challenges: discovering and reproducing existing routines, adapting them for new use cases, sharing and scaling them across collaborators, and stabilizing them with reproducible execution environments. We present Album, an open-source framework for packaging and sharing scientific routines as executable artifacts through two minimal primitives: (i) the solution, a Python-native executable entry point that combines machine-readable metadata, arguments, environment specifications, and lifecycle hooks; and (ii) the catalog, a decentralized, git-native distribution mechanism with indexed search and optional web rendering for discovery, provenance, and governance. Album uses a two-context execution model in which a host controller evaluates manifests and prepares per-solution environments, while lifecycle hooks execute inside isolated solution environments. This design supports reproducible execution, post-environment setup, and the composition of routines with incompatible dependencies. Album can be used in conjunction with LLM agents: solutions can be drafted and revised with LLM assistance, and a MCP interface exposes cataloged solutions as callable tools for tool-grounded discovery and orchestration. We evaluate Album through four realworld imaging deployments spanning interactive visualization of electron microscopy data, integration of multiple segmentation methods, the orchestration of cryo-electron tomography competition workflows, and mineral quantification pipelines. Overall, Album complements package managers, workflow systems, and container runtimes by making scientific routines executable, shareable artifacts. Documentation and examples are available at https://album.solutions.

2021-10-01T18:16:35Z 38 pages, 7 figures Jan Philipp Albrecht Deborah Schmidt Lucas Rieckert Maximilian Otto Kyle Harrington http://arxiv.org/abs/2604.19396v1 Scientific tools and Innovation: Big Science Facilities Yield More Novel and Interdisciplinary Knowledge 2026-04-21T12:24:11Z

Scientific tools dictate the boundaries of human knowledge, serving as the foundation for perceptions and explorations. In the era of Big Science, science are increasingly dependent on advanced analytical technologies and experimental platforms. Over the past decades, national and supranational entities have invested massive financial resources, collaborative networks, and collective intelligence to construct Big Science Facilities (BSFs) aimed at generating cutting edge knowledge. However, empirical evaluations of these machines actual performance in driving scientific innovation remain scarce. To address this gap, we collected 310,086 publications from 88 global BSFs and constructed a matched control dataset of approximately 3 million publications sharing the same last authors. Our analysis reveals that the utilization of BSFs has expanded significantly since 1950s. Crucially, publications supported by these facilities exhibit higher recombinant novelty and interdisciplinary integration. Furthermore, this improvement is most pronounced in non physical sciences domains traditionally peripheral to BSFs core focus indicating the emergence of a powerful intra facility knowledge spillover effect. By enriching the Facilitymetrics framework, our findings provide empirical evidence that BSFs act as vital engines for scientific discovery, offering policymakers essential metrics to justify infrastructural investments, while prompting the science of science community to reassess the profound impact of scientific tools on knowledge production

2026-04-21T12:24:11Z 25 pages, 3 figures Mingze Zhang Yizhan Li Yutong Li Zexia Li http://arxiv.org/abs/2604.18584v1 MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval 2026-04-20T17:59:49Z

Mathematical problem solving remains a challenging test of reasoning for large language and multimodal models, yet existing benchmarks are limited in size, language coverage, and task diversity. We introduce MathNet, a high-quality, large-scale, multimodal, and multilingual dataset of Olympiad-level math problems together with a benchmark for evaluating mathematical reasoning in generative models and mathematical retrieval in embedding-based systems. MathNet spans 47 countries, 17 languages, and two decades of competitions, comprising 30,676 expert-authored problems with solutions across diverse domains. In addition to the core dataset, we construct a retrieval benchmark consisting of mathematically equivalent and structurally similar problem pairs curated by human experts. MathNet supports three tasks: (i) Problem Solving, (ii) Math-Aware Retrieval, and (iii) Retrieval-Augmented Problem Solving. Experimental results show that even state-of-the-art reasoning models (78.4% for Gemini-3.1-Pro and 69.3% for GPT-5) remain challenged, while embedding models struggle to retrieve equivalent problems. We further show that retrieval-augmented generation performance is highly sensitive to retrieval quality; for example, DeepSeek-V3.2-Speciale achieves gains of up to 12%, obtaining the highest scores on the benchmark. MathNet provides the largest high-quality Olympiad dataset together with the first benchmark for evaluating mathematical problem retrieval, and we publicly release both the dataset and benchmark at https://mathnet.mit.edu.

2026-04-20T17:59:49Z ICLR 2026; Website: http://mathnet.mit.edu Proceedings of the International Conference on Learning Representations (ICLR), 2026 Shaden Alshammari Kevin Wen Abrar Zainal Mark Hamilton Navid Safaei Sultan Albarakati William T. Freeman Antonio Torralba http://arxiv.org/abs/2601.08841v2 Triples and Knowledge-Infused Embeddings for Clustering and Classification of Scientific Documents 2026-04-20T14:28:54Z

The increasing volume and complexity of scientific literature demand robust methods for organizing and understanding research documents. In this study, we investigate whether structured knowledge, specifically, subject-predicate-object triples-improves clustering and classification of scientific papers. We present a modular pipeline that combines unsupervised clustering and supervised classification across four document representations: abstract, triples, abstract+triples, and hybrid. Using a filtered arXiv corpus, we evaluate four transformer embeddings (MiniLM, MPNet, SciBERT, SPECTER) with KMeans, GMM, and HDBSCAN, and then train downstream classifiers for subject prediction. Across a five-seed benchmark (seeds 40-44), abstract-only inputs provide the strongest and most stable classification performance, reaching 0.923 accuracy and 0.923 macro-F1 (mean). Triple-only and knowledge-infused variants do not consistently outperform this baseline. In clustering, KMeans/GMM generally outperform HDBSCAN on external validity metrics, while HDBSCAN exhibits higher noise sensitivity. We observe that adding extracted triples naively does not guarantee gains and can reduce performance depending on representation choice. These results refine the role of knowledge infusion in scientific document modeling: structured triples are informative but not universally beneficial, and their impact is strongly configuration-dependent. Our findings provide a reproducible benchmark and practical guidance for when knowledge-augmented representations help, and when strong text-only baselines remain preferable.

2025-12-19T20:17:34Z Mihael Arcan http://arxiv.org/abs/2604.18144v1 Self-referentiality and asymmetric knowledge flows between journals. The case of economics 2026-04-20T12:07:41Z

This paper investigates the evolution of self-referentiality and knowledge flows in economics journals before and after the 2008 financial crisis. Using a multi-level approach, we analyze patterns at the discipline, cluster, and journal levels, combining citational measures with a classification of journals based on intellectual similarity and social proximity. At the aggregate level, results suggest a general decline in self-referentiality, indicating increased openness across the discipline. However, this trend conceals substantial heterogeneity. At finer levels of analysis, two clusters - CORE and Finance - emerge as persistent outliers, exhibiting very high levels of self-referentiality. While Finance experienced a gradual reduction over time, the CORE shows increasing closure. By examining reference asymmetries, we uncover a hierarchical structure of knowledge flows. The CORE operates as a central hub and net exporter of knowledge to all other clusters, particularly to the traditional core fields of economics, whereas Finance acts as a net exporter only within its own domain and remains dependent on the CORE. These asymmetries are reinforced at the level of individual journals, where a small set of top journals occupies the apex of a hierarchically ordered system of knowledge transmission. We argue that these patterns reflect the interplay between intellectual dynamics and organizational structures, particularly the role of editorial networks in shaping access to publication and visibility. The findings suggest that, following the financial crisis, economics has experienced a process of increasing epistemic and organizational closure at its core, alongside greater openness in peripheral areas. This dual dynamic raises questions about the representativeness of top journals and the evolving structure of the discipline.

2026-04-20T12:07:41Z 28 pages, 7 figures Alberto Baccini Carlo Debernardi