https://arxiv.org/api/me6PFl3BtxOz8fISmMq6aIR1eJ82026-06-10T07:23:45Z606113515http://arxiv.org/abs/2604.19507v2Market Dynamics, Governance and Open Research Metadata in the AI Era2026-04-24T04:56:27ZThe debate about scholarly knowledge infrastructure has long been framed as a contest between openness and commercial enclosure. This framing distorts both policy and practice. The real tension lies between the persistent cost of producing and refining structured metadata under deep technological friction, and the differentiated demands distinct communities place on data quality, focus and granularity. We introduce the innovation annulus: the zone between freely available structured data and the advancing frontier of commercially refined knowledge products. This zone is a permanent, functional feature of the ecosystem -- not a pathology to eliminate. By analogy with the efficient market hypothesis, its width measures production inefficiency, set by the interplay of friction and demand. Artificial intelligence reshapes the annulus, lowering barriers to basic structuring, raising the threshold at which refinement adds value, and introducing systemic risks through unprovenanced AI-derived metadata. CRediT contributions, funding acknowledgements and AI disclosure statements illustrate the annulus lifecycle. Governance should calibrate the annulus, not abolish it: thin enough to serve research efficiently, wide enough to sustain innovation. A formal welfare framework, analogous to the Nordhaus optimal patent life, characterises the trade-offs and yields testable predictions. The Barcelona Declaration offers a promising forum for boundary governance.2026-04-21T14:25:32Z18 pages, 3 figures, minor changes and reference addedDaniel W. Hookhttp://arxiv.org/abs/2604.08619v2Doctoral Theses in France (1985-2025): A Linked Dataset of PhDs, Academic Networks, and Institutions2026-04-23T16:46:11ZThis paper presents a comprehensive dataset of doctoral theses defended in France between 1985 and 2025, constructed from multiple national academic metadata sources. The dataset is primarily based on data from the French national thesis platform and is enriched using additional authority and bibliographic databases to improve data quality, completeness, and interoperability. The data production pipeline includes the aggregation of heterogeneous sources, the correction of inconsistent identifiers, the enrichment of person and institution records, and the construction of derived variables describing academic careers, jury participation, institutional affiliations, and thesis characteristics. Additional identifiers from major academic repositories and library catalogues are integrated to facilitate linkage with external data sources and future dataset extensions. The resulting dataset provides structured information at the thesis, individual, and institutional levels, enabling both descriptive and relational analyses. This resource is particularly suited for research on doctoral education, academic networks, supervision practices, jury composition, institutional collaboration, and the evolution of research communities over time. The paper documents the data sources, processing pipeline, feature construction, data quality issues, and limitations, with the objective of facilitating reuse of the dataset by other researchers and supporting future extensions and longitudinal analyses of the academic system.2026-04-09T08:09:43Z11 pages + 6 appendix pages, 7 figures, 2 tables. See https://doi.org/10.5281/zenodo.19453191 for the dataset. See https://github.com/WilliamAboucaya/phd-theses-france for the code to reproduce the dataset and figures Version 2: Fixed references to tables and figures. Modified unclear wordings in section 3. Updated values in the languages table after a minor bug fix. Standardized figures styleWilliam AboucayaDastan Jasimhttp://arxiv.org/abs/2510.04070v2Markov kernels in Mathlib's probability library2026-04-23T12:20:48ZThe probability folder of Mathlib, Lean's mathematical library, makes a heavy use of Markov kernels. We present their definition and properties and describe the formalization of the disintegration theorem for Markov kernels. That theorem is used to define conditional probability distributions of random variables as well as posterior distributions. We then explain how Markov kernels are used in a more unusual way to get a common definition of independence and conditional independence and, following the same principles, to define sub-Gaussian random variables. Finally, we also discuss the role of kernels in our formalization of entropy and Kullback-Leibler divergence.2025-10-05T07:22:17Z33 pagesRémy Degennehttp://arxiv.org/abs/2306.16191v2OpenCitations Meta2026-04-23T11:15:24ZOpenCitations Meta is a new database for open bibliographic metadata of scholarly publications involved in the citations indexed by the OpenCitations infrastructure, adhering to Open Science principles and published under a CC0 license to promote maximum reuse. It presently incorporates bibliographic metadata for publications recorded in Crossref, DataCite and PubMed, making it the largest bibliographic metadata source using Semantic Web technologies. It assigns new globally persistent identifiers (PIDs), known as OpenCitations Meta Identifiers (OMIDs) to all bibliographic resources, enabling it both to disambiguate publications described using different external PIDS (e.g., a DOI in Crossref and a PMID in PubMed), and to handle citations involving publications lacking external PIDs. By hosting bibliographic metadata internally, OpenCitations Meta eliminates its former reliance on API calls to external resources and thus enhances performance in response to user queries. Its automated data curation, following the OpenCitations Data Model, includes deduplication, error correction, metadata enrichment and full provenance tracking, ensuring transparency and traceability of data and bolstering confidence in data integrity, a feature unparalleled in other bibliographic databases. Its commitment to Semantic Web standards ensures superior interoperability compared to other machine-readable formats, with availability via a SPARQL endpoint, REST APIs and data dumps.2023-06-28T13:15:02Z31 pages, 8 figuresQuantitative Science Studies 2024. 5 (1) 50-75Arcangelo MassariFabio MarianiIvan HeibiSilvio PeroniDavid Shotton10.1162/qss_a_00292http://arxiv.org/abs/2604.21150v1The State of Scientific Poster Sharing and Reuse2026-04-22T23:29:39ZScientific posters are one of the most common forms of scholarly communication and contain early-stage insights with potential to accelerate scientific discovery. We investigated where posters are shared, to what extent their sharing aligns with the FAIR principles, and how commonly they are reused. We identified 86 platforms hosting posters, with many not assigning persistent identifiers. A total of 150k posters are shared as of 2024 on the 43 platforms where we were able to count, which is relatively low. Looking in more detail at posters shared on Zenodo and Figshare, we found that repositories are not always supporting structured metadata critical for poster discovery, like conference information, and that researchers are not providing such metadata even if they are supported. We also observed that while there is some engagement with posters in terms of views and downloads, citing posters is not yet a common practice. Our recommendations are for the scientific community to encourage poster sharing and reuse and establish clear guidelines to make posters FAIR.2026-04-22T23:29:39ZAydan GasimovaPaapa Mensah-KaneGerard F. BlakeSanjay SoundarajanJames ONeillBhavesh Patelhttp://arxiv.org/abs/2604.20548v1Enhancing Research Idea Generation through Combinatorial Innovation and Multi-Agent Iterative Search Strategies2026-04-22T13:31:12ZScientific progress depends on the continual generation of innovative re-search ideas. However, the rapid growth of scientific literature has greatly increased the cost of knowledge filtering, making it harder for researchers to identify novel directions. Although existing large language model (LLM)-based methods show promise in research idea generation, the ideas they produce are often repetitive and lack depth. To address this issue, this study proposes a multi-agent iterative planning search strategy inspired by com-binatorial innovation theory. The framework combines iterative knowledge search with an LLM-based multi-agent system to generate, evaluate, and re-fine research ideas through repeated interaction, with the goal of improving idea diversity and novelty. Experiments in the natural language processing domain show that the proposed method outperforms state-of-the-art base-lines in both diversity and novelty. Further comparison with ideas derived from top-tier machine learning conference papers indicates that the quality of the generated ideas falls between that of accepted and rejected papers. These results suggest that the proposed framework is a promising approach for supporting high-quality research idea generation. The source code and dataset used in this paper are publicly available on Github repository: https://github.com/ChenShuai00/MAGenIdeas. The demo is available at https://huggingface.co/spaces/cshuai20/MAGenIdeas.2026-04-22T13:31:12ZScientometricsShuai ChenChengzhi Zhanghttp://arxiv.org/abs/2604.20528v1Evolution of Research Method Usage Across the Academic Careers of Library and Information Science Scholars2026-04-22T13:06:58ZResearch methods constitute an indispensable tool for scholars engaged in scientific inquiry. Investigating how scholars use research methods throughout their careers can reveal distinct patterns in method adoption, providing valuable insights for novice researchers in selecting appropriate methods. This study employs a comprehensive dataset comprising full-text journal articles and bibliographic records from the Library and Information Science (LIS) domain. Utilizing an automated classification model based on full-text cognitive analysis, the research methods employed by LIS scholars are systematically identified. Topic modeling was then conducted using Top2Vec. Subsequently, author name disambiguation is performed, and academic age is calculated for each scholar. This study focuses on 435 senior scholars with an academic age of more than 14 years and a consistent publication record at five-year intervals, covering a total of 6,116 articles. The corpus covers 16 research method categories and 20 research topics. The findings indicate that bibliometric methods are the most frequently used across career stages, accounting for 19.61% among early-career scholars and 31.81% among senior scholars. Over the course of a scholarly career, the diversity of research methods initially increases and then declines. Furthermore, scholars exhibit a propensity for combining multiple research methods, including both conventional and unconventional pairings. Notably, the research methods most commonly used by researchers change with age and seniority.2026-04-22T13:06:58ZScientometricsJiayi HaoChengzhi Zhanghttp://arxiv.org/abs/2604.01965v2Do We Need Bigger Models for Science? Task-Aware Retrieval with Small Language Models2026-04-22T11:29:59ZScientific knowledge discovery increasingly relies on large language models, yet many existing scholarly assistants depend on proprietary systems with tens or hundreds of billions of parameters. Such reliance limits reproducibility and accessibility for the research community. In this work, we ask a simple question: do we need bigger models for scientific applications? Specifically, we investigate to what extent carefully designed retrieval pipelines can compensate for reduced model scale in scientific applications. We design a lightweight retrieval-augmented framework that performs task-aware routing to select specialized retrieval strategies based on the input query. The system further integrates evidence from full-text scientific papers and structured scholarly metadata, and employs compact instruction-tuned language models to generate responses with citations. We evaluate the framework across several scholarly tasks, focusing on scholarly question answering (QA), including single- and multi-document scenarios, as well as biomedical QA under domain shift and scientific text compression. Our findings demonstrate that retrieval and model scale are complementary rather than interchangeable. While retrieval design can partially compensate for smaller models, model capacity remains important for complex reasoning tasks. This work highlights retrieval and task-aware design as key factors for building practical and reproducible scholarly assistants.2026-04-02T12:28:51ZAccepted at NSLP@LREC 2026Florian KelberMatthias JobstYuni SusantiMichael Färberhttp://arxiv.org/abs/2604.19578v1Impact of large language models on peer review opinions from a fine-grained perspective: Evidence from top conference proceedings in AI2026-04-21T15:33:53ZWith the rapid advancement of Large Language Models (LLMs), the academic community has faced unprecedented disruptions, particularly in the realm of academic communication. The primary function of peer review is improving the quality of academic manuscripts, such as clarity, originality and other evaluation aspects. Although prior studies suggest that LLMs are beginning to influence peer review, it remains unclear whether they are altering its core evaluative functions. Moreover, the extent to which LLMs affect the linguistic form, evaluative focus, and recommendation-related signals of peer-review reports has yet to be systematically examined. In this study, we examine the changes in peer review reports for academic articles following the emergence of LLMs, emphasizing variations at fine-grained level. Specifically, we investigate linguistic features such as the length and complexity of words and sentences in review comments, while also automatically annotating the evaluation aspects of individual review sentences. We also use a maximum likelihood estimation method, previously established, to identify review reports that potentially have modified or generated by LLMs. Finally, we assess the impact of evaluation aspects mentioned in LLM-assisted review reports on the informativeness of recommendation for paper decision-making. The results indicate that following the emergence of LLMs, peer review texts have become longer and more fluent, with increased emphasis on summaries and surface-level clarity, as well as more standardized linguistic patterns, particularly reviewers with lower confidence score. At the same time, attention to deeper evaluative dimensions, such as originality, replicability, and nuanced critical reasoning, has declined.2026-04-21T15:33:53ZScientometricsWenqing WuChengzhi ZhangYi ZhaoTong Baohttp://arxiv.org/abs/2604.19505v1Enhancing Unsupervised Keyword Extraction in Academic Papers through Integrating Highlights with Abstract2026-04-21T14:22:21ZAutomatic keyword extraction from academic papers is a key area of interest in natural language processing and information retrieval. Although previous research has mainly focused on utilizing abstract and references for keyword extraction, this paper focuses on the highlights section - a summary describing the key findings and contributions, offering readers a quick overview of the research. Our observations indicate that highlights contain valuable keyword information that can effectively complement the abstract. To investigate the impact of incorporating highlights into unsupervised keyword extraction, we evaluate three input scenarios: using only the abstract, the highlights, and a combination of both. Experiments conducted with four unsupervised models on Computer Science (CS), Library and Information Science (LIS) datasets reveal that integrating the abstract with highlights significantly improves extraction performance. Furthermore, we examine the differences in keyword coverage and content between abstract and highlights, exploring how these variations influence extraction outcomes. The data and code are available at https://github.com/xiangyi-njust/Highlight-KPE.2026-04-21T14:22:21ZScientometricsYi XiangChengzhi Zhanghttp://arxiv.org/abs/2110.00601v2Album: executable building blocks for scientific imaging routines, from sharing to LLM-assisted orchestration2026-04-21T13:08:07ZOpen-source scientific software is a major driver of scientific progress, yet its development and reuse remain difficult in collaborative settings. Researchers repeatedly face four recurring challenges: discovering and reproducing existing routines, adapting them for new use cases, sharing and scaling them across collaborators, and stabilizing them with reproducible execution environments. We present Album, an open-source framework for packaging and sharing scientific routines as executable artifacts through two minimal primitives: (i) the solution, a Python-native executable entry point that combines machine-readable metadata, arguments, environment specifications, and lifecycle hooks; and (ii) the catalog, a decentralized, git-native distribution mechanism with indexed search and optional web rendering for discovery, provenance, and governance. Album uses a two-context execution model in which a host controller evaluates manifests and prepares per-solution environments, while lifecycle hooks execute inside isolated solution environments. This design supports reproducible execution, post-environment setup, and the composition of routines with incompatible dependencies. Album can be used in conjunction with LLM agents: solutions can be drafted and revised with LLM assistance, and a MCP interface exposes cataloged solutions as callable tools for tool-grounded discovery and orchestration. We evaluate Album through four realworld imaging deployments spanning interactive visualization of electron microscopy data, integration of multiple segmentation methods, the orchestration of cryo-electron tomography competition workflows, and mineral quantification pipelines. Overall, Album complements package managers, workflow systems, and container runtimes by making scientific routines executable, shareable artifacts. Documentation and examples are available at https://album.solutions.2021-10-01T18:16:35Z38 pages, 7 figuresJan Philipp AlbrechtDeborah SchmidtLucas RieckertMaximilian OttoKyle Harringtonhttp://arxiv.org/abs/2604.19396v1Scientific tools and Innovation: Big Science Facilities Yield More Novel and Interdisciplinary Knowledge2026-04-21T12:24:11ZScientific tools dictate the boundaries of human knowledge, serving as the foundation for perceptions and explorations. In the era of Big Science, science are increasingly dependent on advanced analytical technologies and experimental platforms. Over the past decades, national and supranational entities have invested massive financial resources, collaborative networks, and collective intelligence to construct Big Science Facilities (BSFs) aimed at generating cutting edge knowledge. However, empirical evaluations of these machines actual performance in driving scientific innovation remain scarce. To address this gap, we collected 310,086 publications from 88 global BSFs and constructed a matched control dataset of approximately 3 million publications sharing the same last authors. Our analysis reveals that the utilization of BSFs has expanded significantly since 1950s. Crucially, publications supported by these facilities exhibit higher recombinant novelty and interdisciplinary integration. Furthermore, this improvement is most pronounced in non physical sciences domains traditionally peripheral to BSFs core focus indicating the emergence of a powerful intra facility knowledge spillover effect. By enriching the Facilitymetrics framework, our findings provide empirical evidence that BSFs act as vital engines for scientific discovery, offering policymakers essential metrics to justify infrastructural investments, while prompting the science of science community to reassess the profound impact of scientific tools on knowledge production2026-04-21T12:24:11Z25 pages, 3 figuresMingze ZhangYizhan LiYutong LiZexia Lihttp://arxiv.org/abs/2604.18584v1MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval2026-04-20T17:59:49ZMathematical problem solving remains a challenging test of reasoning for large language and multimodal models, yet existing benchmarks are limited in size, language coverage, and task diversity. We introduce MathNet, a high-quality, large-scale, multimodal, and multilingual dataset of Olympiad-level math problems together with a benchmark for evaluating mathematical reasoning in generative models and mathematical retrieval in embedding-based systems. MathNet spans 47 countries, 17 languages, and two decades of competitions, comprising 30,676 expert-authored problems with solutions across diverse domains. In addition to the core dataset, we construct a retrieval benchmark consisting of mathematically equivalent and structurally similar problem pairs curated by human experts.
MathNet supports three tasks: (i) Problem Solving, (ii) Math-Aware Retrieval, and (iii) Retrieval-Augmented Problem Solving. Experimental results show that even state-of-the-art reasoning models (78.4% for Gemini-3.1-Pro and 69.3% for GPT-5) remain challenged, while embedding models struggle to retrieve equivalent problems. We further show that retrieval-augmented generation performance is highly sensitive to retrieval quality; for example, DeepSeek-V3.2-Speciale achieves gains of up to 12%, obtaining the highest scores on the benchmark. MathNet provides the largest high-quality Olympiad dataset together with the first benchmark for evaluating mathematical problem retrieval, and we publicly release both the dataset and benchmark at https://mathnet.mit.edu.2026-04-20T17:59:49ZICLR 2026; Website: http://mathnet.mit.eduProceedings of the International Conference on Learning Representations (ICLR), 2026Shaden AlshammariKevin WenAbrar ZainalMark HamiltonNavid SafaeiSultan AlbarakatiWilliam T. FreemanAntonio Torralbahttp://arxiv.org/abs/2601.08841v2Triples and Knowledge-Infused Embeddings for Clustering and Classification of Scientific Documents2026-04-20T14:28:54ZThe increasing volume and complexity of scientific literature demand robust methods for organizing and understanding research documents. In this study, we investigate whether structured knowledge, specifically, subject-predicate-object triples-improves clustering and classification of scientific papers. We present a modular pipeline that combines unsupervised clustering and supervised classification across four document representations: abstract, triples, abstract+triples, and hybrid. Using a filtered arXiv corpus, we evaluate four transformer embeddings (MiniLM, MPNet, SciBERT, SPECTER) with KMeans, GMM, and HDBSCAN, and then train downstream classifiers for subject prediction.
Across a five-seed benchmark (seeds 40-44), abstract-only inputs provide the strongest and most stable classification performance, reaching 0.923 accuracy and 0.923 macro-F1 (mean). Triple-only and knowledge-infused variants do not consistently outperform this baseline. In clustering, KMeans/GMM generally outperform HDBSCAN on external validity metrics, while HDBSCAN exhibits higher noise sensitivity. We observe that adding extracted triples naively does not guarantee gains and can reduce performance depending on representation choice.
These results refine the role of knowledge infusion in scientific document modeling: structured triples are informative but not universally beneficial, and their impact is strongly configuration-dependent. Our findings provide a reproducible benchmark and practical guidance for when knowledge-augmented representations help, and when strong text-only baselines remain preferable.2025-12-19T20:17:34ZMihael Arcanhttp://arxiv.org/abs/2604.18144v1Self-referentiality and asymmetric knowledge flows between journals. The case of economics2026-04-20T12:07:41ZThis paper investigates the evolution of self-referentiality and knowledge flows in economics journals before and after the 2008 financial crisis. Using a multi-level approach, we analyze patterns at the discipline, cluster, and journal levels, combining citational measures with a classification of journals based on intellectual similarity and social proximity. At the aggregate level, results suggest a general decline in self-referentiality, indicating increased openness across the discipline. However, this trend conceals substantial heterogeneity. At finer levels of analysis, two clusters - CORE and Finance - emerge as persistent outliers, exhibiting very high levels of self-referentiality. While Finance experienced a gradual reduction over time, the CORE shows increasing closure. By examining reference asymmetries, we uncover a hierarchical structure of knowledge flows. The CORE operates as a central hub and net exporter of knowledge to all other clusters, particularly to the traditional core fields of economics, whereas Finance acts as a net exporter only within its own domain and remains dependent on the CORE. These asymmetries are reinforced at the level of individual journals, where a small set of top journals occupies the apex of a hierarchically ordered system of knowledge transmission. We argue that these patterns reflect the interplay between intellectual dynamics and organizational structures, particularly the role of editorial networks in shaping access to publication and visibility. The findings suggest that, following the financial crisis, economics has experienced a process of increasing epistemic and organizational closure at its core, alongside greater openness in peripheral areas. This dual dynamic raises questions about the representativeness of top journals and the evolving structure of the discipline.2026-04-20T12:07:41Z28 pages, 7 figuresAlberto BacciniCarlo Debernardi