https://arxiv.org/api/IVkaPM4bc8mYzmZh/tlIlc2DWM4 2026-06-14T15:29:58Z 6065 735 15 http://arxiv.org/abs/2506.20225v2 The role of preprints in open science: Accelerating knowledge transfer from science to technology 2025-06-26T06:08:57Z

Preprints have become increasingly essential in the landscape of open science, facilitating not only the exchange of knowledge within the scientific community but also bridging the gap between science and technology. However, the impact of preprints on technological innovation, given their unreviewed nature, remains unclear. This study fills this gap by conducting a comprehensive scientometric analysis of patent citations to bioRxiv preprints submitted between 2013 and 2021, measuring and accessing the contribution of preprints in accelerating knowledge transfer from science to technology. Our findings reveal a growing trend of patent citations to bioRxiv preprints, with a notable surge in 2020, primarily driven by the COVID-19 pandemic. Preprints play a critical role in accelerating innovation, not only expedite the dissemination of scientific knowledge into technological innovation but also enhance the visibility of early research results in the patenting process, while journals remain essential for academic rigor and reliability. The substantial number of post-online-publication patent citations highlights the critical role of the open science model-particularly the "open access" effect of preprints-in amplifying the impact of science on technological innovation. This study provides empirical evidence that open science policies encouraging the early sharing of research outputs, such as preprints, contribute to more efficient linkage between science and technology, suggesting an acceleration in the pace of innovation, higher innovation quality, and economic benefits.

2025-06-25T08:13:05Z Accepted manuscript for publication in Journal of Informetrics.The final version is available at DOI:10.1016/j.joi.2025.101663 Journal of Informetrics (2025) Zhiqi Wang Yue Chen Chun Yang 10.1016/j.joi.2025.101663 http://arxiv.org/abs/2506.20918v1 Metadata Enrichment of Long Text Documents using Large Language Models 2025-06-26T00:55:47Z

In this project, we semantically enriched and enhanced the metadata of long text documents, theses and dissertations, retrieved from the HathiTrust Digital Library in English published from 1920 to 2020 through a combination of manual efforts and large language models. This dataset provides a valuable resource for advancing research in areas such as computational social science, digital humanities, and information science. Our paper shows that enriching metadata using LLMs is particularly beneficial for digital repositories by introducing additional metadata access points that may not have originally been foreseen to accommodate various content types. This approach is particularly effective for repositories that have significant missing data in their existing metadata fields, enhancing search results and improving the accessibility of the digital repository.

2025-06-26T00:55:47Z Manika Lamba You Peng Sophie Nikolov Glen Layne-Worthey J. Stephen Downie http://arxiv.org/abs/2506.17508v2 Mapping the Evolution of Research Contributions using KnoVo 2025-06-25T06:22:45Z

This paper presents KnoVo (Knowledge Evolution), an intelligent framework designed for quantifying and analyzing the evolution of research novelty in the scientific literature. Moving beyond traditional citation analysis, which primarily measures impact, KnoVo determines a paper's novelty relative to both prior and subsequent work within its multilayered citation network. Given a target paper's abstract, KnoVo utilizes Large Language Models (LLMs) to dynamically extract dimensions of comparison (e.g., methodology, application, dataset). The target paper is then compared to related publications along these same extracted dimensions. This comparative analysis, inspired by tournament selection, yields quantitative novelty scores reflecting the relative improvement, equivalence, or inferiority of the target paper in specific aspects. By aggregating these scores and visualizing their progression, for instance, through dynamic evolution graphs and comparative radar charts, KnoVo facilitates researchers not only to assess originality and identify similar work, but also to track knowledge evolution along specific research dimensions, uncover research gaps, and explore cross-disciplinary connections. We demonstrate these capabilities through a detailed analysis of 20 diverse papers from multiple scientific fields and report on the performance of various open-source LLMs within the KnoVo framework.

2025-06-20T23:17:11Z Sajratul Y. Rubaiat Syed N. Sakib Hasan M. Jamil http://arxiv.org/abs/2506.20141v1 Accept More, Reject Less: Reducing up to 19% Unnecessary Desk-Rejections over 11 Years of ICLR Data 2025-06-25T05:23:44Z

The explosive growth of AI research has driven paper submissions at flagship AI conferences to unprecedented levels, necessitating many venues in 2025 (e.g., CVPR, ICCV, KDD, AAAI, IJCAI, WSDM) to enforce strict per-author submission limits and to desk-reject any excess papers by simple ID order. While this policy helps reduce reviewer workload, it may unintentionally discard valuable papers and penalize authors' efforts. In this paper, we ask an essential research question on whether it is possible to follow submission limits while minimizing needless rejections. We first formalize the current desk-rejection policies as an optimization problem, and then develop a practical algorithm based on linear programming relaxation and a rounding scheme. Under extensive evaluation on 11 years of real-world ICLR (International Conference on Learning Representations) data, our method preserves up to $19.23\%$ more papers without violating any author limits. Moreover, our algorithm is highly efficient in practice, with all results on ICLR data computed within at most 53.64 seconds. Our work provides a simple and practical desk-rejection strategy that significantly reduces unnecessary rejections, demonstrating strong potential to improve current CS conference submission policies.

2025-06-25T05:23:44Z Xiaoyu Li Zhao Song Jiahao Zhang http://arxiv.org/abs/2506.22497v1 Peer Review as Structured Commentary: Immutable Identity, Public Dialogue, and Reproducible Scholarship 2025-06-25T03:57:40Z

This paper reconceptualises peer review as structured public commentary. Traditional academic validation is hindered by anonymity, latency, and gatekeeping. We propose a transparent, identity-linked, and reproducible system of scholarly evaluation anchored in open commentary. Leveraging blockchain for immutable audit trails and AI for iterative synthesis, we design a framework that incentivises intellectual contribution, captures epistemic evolution, and enables traceable reputational dynamics. This model empowers fields from computational science to the humanities, reframing academic knowledge as a living process rather than a static credential.

2025-06-25T03:57:40Z 66 pages, 0 figures, interdisciplinary framework, includes proposed architecture and metadata layer structures Craig Steven Wright http://arxiv.org/abs/2501.10326v2 Large language models for automated scholarly paper review: A survey 2025-06-24T16:45:26Z

Large language models (LLMs) have significantly impacted human society, influencing various domains. Among them, academia is not simply a domain affected by LLMs, but it is also the pivotal force in the development of LLMs. In academic publication, this phenomenon is represented during the incorporation of LLMs into the peer review mechanism for reviewing manuscripts. LLMs hold transformative potential for the full-scale implementation of automated scholarly paper review (ASPR), but they also pose new issues and challenges that need to be addressed. In this survey paper, we aim to provide a holistic view of ASPR in the era of LLMs. We begin with a survey to find out which LLMs are used to conduct ASPR. Then, we review what ASPR-related technological bottlenecks have been solved with the incorporation of LLM technology. After that, we move on to explore new methods, new datasets, new source code, and new online systems that come with LLMs for ASPR. Furthermore, we summarize the performance and issues of LLMs in ASPR, and investigate the attitudes and reactions of publishers and academia to ASPR. Lastly, we discuss the challenges and future directions associated with the development of LLMs for ASPR. This survey serves as an inspirational reference for the researchers and can promote the progress of ASPR for its actual implementation.

2025-01-17T17:56:58Z Please cite the version of Information Fusion Information Fusion, Vol. 124, 103332 (2025) Zhenzhen Zhuang Jiandong Chen Hongfeng Xu Yuwen Jiang Jialiang Lin 10.1016/j.inffus.2025.103332 http://arxiv.org/abs/2406.19219v3 Metrics to Detect Small-Scale and Large-Scale Citation Orchestration 2025-06-24T16:16:32Z

Citation counts and related metrics have pervasive uses and misuses in academia and research appraisal, serving as scholarly influence and recognition measures. Hence, comprehending the citation patterns exhibited by authors is essential for assessing their research impact and contributions within their respective fields. Although the h-index, introduced by Hirsch in 2005, has emerged as a popular bibliometric indicator, it fails to account for the intricate relationships between authors and their citation patterns. This limitation becomes particularly relevant in cases where citations are strategically employed to boost the perceived influence of certain individuals or groups, a phenomenon that we term "orchestration". Orchestrated citations can introduce biases in citation rankings and therefore necessitate the identification of such patterns. Here, we use Scopus data to investigate orchestration of citations across all scientific disciplines. Orchestration could be small-scale, when the author him/herself and/or a small number of other authors use citations strategically to boost citation metrics like h-index; or large-scale, where extensive collaborations among many co-authors lead to high h-index for many/all of them. We propose three orchestration indicators: extremely low values in the ratio of citations over the square of the h-index (indicative of small-scale orchestration); extremely small number of authors who can explain at least 50% of an author's total citations (indicative of either small-scale or large-scale orchestration); and extremely large number of co-authors with more than 50 co-authored papers (indicative of large-scale orchestration). The distributions, potential thresholds based on 1% (and 5%) percentiles, and insights from these indicators are explored and put into perspective across science.

2024-06-27T14:44:43Z Iakovos Evdaimon John P. A. Ioannidis Giannis Nikolentzos Michail Chatzianastasis George Panagopoulos Michalis Vazirgiannis http://arxiv.org/abs/2506.19656v1 Video Compression for Spatiotemporal Earth System Data 2025-06-24T14:20:05Z

Large-scale Earth system datasets, from high-resolution remote sensing imagery to spatiotemporal climate model outputs, exhibit characteristics analogous to those of standard videos. Their inherent spatial, temporal, and spectral redundancies can thus be readily exploited by established video compression techniques. Here, we present xarrayvideo, a Python library for compressing multichannel spatiotemporal datasets by encoding them as videos. Our approach achieves compression ratios of up to 250x while maintaining high fidelity by leveraging standard, well-optimized video codecs through ffmpeg. We demonstrate the library's effectiveness on four real-world multichannel spatiotemporal datasets: DynamicEarthNet (very high resolution Planet images), DeepExtremeCubes (high resolution Sentinel-2 images), ERA5 (weather reanalysis data), and the SimpleS2 dataset (high resolution multichannel Sentinel-2 images), achieving Peak Signal-to-Noise Ratios (PSNRs) of 55.86, 40.60, 46.58, and 43.23 dB at 0.1 bits per pixel per band (bpppb) and 65.91, 54.28, 62.90, and 55.04 dB at 1 bpppb. We are redistributing two of these datasets, DeepExtremeCubes (2.3 Tb) and DynamicEarthNet (525 Gb), in the machine-learning-ready and cloud-ready TACO format through HuggingFace at significantly reduced sizes (270 Gb and 8.5 Gb, respectively) without compromising quality (PSNR 55.77-56.65 and 60.15). No performance loss is observed when the compressed versions of these datasets are used in their respective deep learning-based downstream tasks (next step reflectance prediction and landcover segmentation). In conclusion, xarrayvideo presents an efficient solution for handling the rapidly growing size of Earth observation datasets, making advanced compression techniques accessible and practical to the Earth science community. The library is available for use at https://github.com/IPL-UV/xarrayvideo

2025-06-24T14:20:05Z Oscar J. Pellicer-Valero Cesar Aybar Gustau Camps Valls http://arxiv.org/abs/2506.18069v2 Unfolding the Past: A Comprehensive Deep Learning Approach to Analyzing Incunabula Pages 2025-06-24T11:19:13Z

We developed a proof-of-concept method for the automatic analysis of the structure and content of incunabula pages. A custom dataset comprising 500 annotated pages from five different incunabula was created using resources from the Jagiellonian Digital Library. Each page was manually labeled with five predefined classes: Text, Title, Picture, Table, and Handwriting. Additionally, the publicly available DocLayNet dataset was utilized as supplementary training data. To perform object detection, YOLO11n and YOLO11s models were employed and trained using two strategies: a combined dataset (DocLayNet and the custom dataset) and the custom dataset alone. The highest performance (F1 = 0.94) was achieved by the YOLO11n model trained exclusively on the custom data. Optical character recognition was then conducted on regions classified as Text, using both Tesseract and Kraken OCR, with Tesseract demonstrating superior results. Subsequently, image classification was applied to the Picture class using a ResNet18 model, achieving an accuracy of 98.7% across five subclasses: Decorative_letter, Illustration, Other, Stamp, and Wrong_detection. Furthermore, the CLIP model was utilized to generate semantic descriptions of illustrations. The results confirm the potential of machine learning in the analysis of early printed books, while emphasizing the need for further advancements in OCR performance and visual content interpretation.

2025-06-22T15:33:20Z 10 pages, 8 figures; submitted to TPDL 2025; change in v2: updated e-mail address Klaudia Ropel Krzysztof Kutt Luiz do Valle Miranda Grzegorz J. Nalepa http://arxiv.org/abs/2407.06976v3 Rich Interoperable Metadata for Cultural Heritage Projects at Jagiellonian University 2025-06-24T11:12:22Z

The rich metadata created nowadays for objects stored in libraries has nowhere to be stored, because core standards, namely MARC 21 and Dublin Core, are not flexible enough. The aim of this paper is to summarize our work-in-progress on tackling this problem in research on cultural heritage objects at the Jagiellonian University (JU). We compared the objects' metadata currently being collected at the JU (with examples of manuscript, placard, and obituary) with five widespread metadata standards used by the cultural heritage community: Dublin Core, EAD, MODS, EDM and Digital Scriptorium. Our preliminary results showed that mapping between them is indeed problematic, but we identified requirements that should be followed in further work on the JU cultural heritage metadata schema in order to achieve maximum interoperability. As we move forward, based on the successive versions of the conceptual model, we will conduct experiments to validate the practical feasibility of these mappings and the degree to which the proposed model will actually enable integration with data in these various metadata formats.

2024-07-09T15:52:06Z 10 pages; submitted to TPLD 2025; change in v2: heavily rewritten, new content added; change in v3: updated e-mail address Luiz do Valle Miranda Krzysztof Kutt Elżbieta Sroka Grzegorz J. Nalepa http://arxiv.org/abs/2506.18517v1 Cost for research -- how cost data of research can be included in open metadata to be reused and evaluated 2025-06-23T11:18:05Z

The openCost project aims to enhance transparency in research funding by making publication-related costs publicly accessible, following FAIR principles. It introduces a metadata schema for cost data, allowing aggregation and analysis across institutions. The project promotes open access and cost-efficient models, benefiting academic institutions, funders, and policymakers.

2025-06-23T11:18:05Z Julia Bartlewski Christoph Broschinski Gernot Deinzer Cornelia Lang Dirk Pieper Bianca Schweighofer Colin Sippl Lisa-Marie Stein Alexander Wagner Silke Weisheit http://arxiv.org/abs/2401.12739v3 Decoding University Hierarchy and Prestige in China through Domestic Ph.D. Hiring Network 2025-06-22T12:26:53Z

The academic job market for fresh Ph.D. students to pursue postdoctoral and junior faculty positions plays a crucial role in shaping the future orientations, developments, and status of the global academic system. In this work, we focus on the domestic Ph.D. hiring network among universities in China by exploring the doctoral education and academic employment of nearly 28,000 scientists across all Ph.D.-granting Chinese universities over three decades. We employ the minimum violation rankings algorithm to decode the rankings for universities based on the Ph.D. hiring network, which offers a deep understanding of the structure and dynamics within the network. Our results uncover a consistent, highly structured hierarchy within this hiring network, indicating the imbalances wherein a limited number of universities serve as the main sources of fresh Ph.D. across diverse disciplines. Furthermore, over time, it has become increasingly challenging for Chinese Ph.D. graduates to secure positions at institutions more prestigious than their alma maters. This study quantitatively captures the evolving structure of talent circulation in the domestic environment, providing valuable insights to enhance the organization, diversity, and talent distribution in China's academic enterprise.

2024-01-23T13:15:43Z Chaolin Tian Xunyi Jiang Yurui Huang Langtian Ma Yifang Ma http://arxiv.org/abs/2411.12180v2 The Innovative Distinctiveness of Prizewinners and their Networks 2025-06-22T12:16:30Z

Science prizes purportedly reward innovation and explorations of new phenomena. Yet, in practice prizes may inadvertently divert resources from similarly impactful but less celebrated scholars. Despite this paradox, knowledge of how prizewinning relates to innovation is nascent even as prizes proliferate widely. Analyzing 2,460 worldwide prizes, we compared the innovativeness of over 23,000 prizewinners and matched non-prizewinners whose performance records were statistically equivalent up to the prize year. First, we find that prizewinners are more innovative. Their research is more likely to combine existing ideas in new ways, integrate a topic's historical and contemporary thinking, and incorporate interdisciplinary perspectives. Second, although prizewinners and matched non-prizewinners have statistically equivalent impact and productivity records up to the prize year, at about five years before the prize, prizewinners' papers become more innovative than their matched peers, a difference that widens each year, peaks during the prize year, and then persists for the remainder of their careers. Third, network embeddedness predicts unusual innovativeness. Compared to non-prizewinners, prizewinners' collaborations are shorter in duration, encompass wider exposure to unfamiliar topics, and involve coauthors whose networks minimally overlap with each other. The implications of the findings for the efficacy of reward systems and innovation in science are discussed.

2024-11-19T02:48:14Z Chaolin Tian Yurui Huang Ching Jin Yifang Ma Brian Uzzi http://arxiv.org/abs/2506.17580v1 Context-Aware Scientific Knowledge Extraction on Linked Open Data using Large Language Models 2025-06-21T04:22:34Z

The exponential growth of scientific literature challenges researchers extracting and synthesizing knowledge. Traditional search engines return many sources without direct, detailed answers, while general-purpose LLMs may offer concise responses that lack depth or omit current information. LLMs with search capabilities are also limited by context window, yielding short, incomplete answers. This paper introduces WISE (Workflow for Intelligent Scientific Knowledge Extraction), a system addressing these limits by using a structured workflow to extract, refine, and rank query-specific knowledge. WISE uses an LLM-powered, tree-based architecture to refine data, focusing on query-aligned, context-aware, and non-redundant information. Dynamic scoring and ranking prioritize unique contributions from each source, and adaptive stopping criteria minimize processing overhead. WISE delivers detailed, organized answers by systematically exploring and synthesizing knowledge from diverse sources. Experiments on HBB gene-associated diseases demonstrate WISE reduces processed text by over 80% while achieving significantly higher recall over baselines like search engines and other LLM-based approaches. ROUGE and BLEU metrics reveal WISE's output is more unique than other systems, and a novel level-based metric shows it provides more in-depth information. We also explore how the WISE workflow can be adapted for diverse domains like drug discovery, material science, and social science, enabling efficient knowledge extraction and synthesis from unstructured scientific papers and web sources.

2025-06-21T04:22:34Z Sajratul Y. Rubaiat Hasan M. Jamil http://arxiv.org/abs/2506.09530v2 Linking Data Citation to Repository Visibility: An Empirical Study 2025-06-20T13:13:47Z

In today's data-driven research landscape, dataset visibility and accessibility play a crucial role in advancing scientific knowledge. At the same time, data citation is essential for maintaining academic integrity, acknowledging contributions, validating research outcomes, and fostering scientific reproducibility. As a critical link, it connects scholarly publications with the datasets that drive scientific progress. This study investigates whether repository visibility influences data citation rates. We hypothesize that repositories with higher visibility, as measured by search engine metrics, are associated with increased dataset citations. Using OpenAlex data and repository impact indicators (including the visibility index from Sistrix, the h-index of repositories, and citation metrics such as mean and median citations), we analyze datasets in Social Sciences and Economics to explore their relationship. Our findings suggest that datasets hosted on more visible web domains tend to receive more citations, with a positive correlation observed between web domain visibility and dataset citation counts, particularly for datasets with at least one citation. However, when analyzing domain-level citation metrics, such as the h-index, mean, and median citations, the correlations are inconsistent and weaker. While higher visibility domains tend to host datasets with greater citation impact, the distribution of citations across datasets varies significantly. These results suggest that while visibility plays a role in increasing citation counts, it is not the sole factor influencing dataset citation impact. Other elements, such as dataset quality, research trends, and disciplinary norms, can also contribute to citation patterns.

2025-06-11T09:00:52Z Fakhri Momeni Janete Saldanha Bach Brigitte Mathiak Peter Mutschke