https://arxiv.org/api/IVkaPM4bc8mYzmZh/tlIlc2DWM42026-06-14T15:29:58Z606573515http://arxiv.org/abs/2506.20225v2The role of preprints in open science: Accelerating knowledge transfer from science to technology2025-06-26T06:08:57ZPreprints have become increasingly essential in the landscape of open science, facilitating not only the exchange of knowledge within the scientific community but also bridging the gap between science and technology. However, the impact of preprints on technological innovation, given their unreviewed nature, remains unclear. This study fills this gap by conducting a comprehensive scientometric analysis of patent citations to bioRxiv preprints submitted between 2013 and 2021, measuring and accessing the contribution of preprints in accelerating knowledge transfer from science to technology. Our findings reveal a growing trend of patent citations to bioRxiv preprints, with a notable surge in 2020, primarily driven by the COVID-19 pandemic. Preprints play a critical role in accelerating innovation, not only expedite the dissemination of scientific knowledge into technological innovation but also enhance the visibility of early research results in the patenting process, while journals remain essential for academic rigor and reliability. The substantial number of post-online-publication patent citations highlights the critical role of the open science model-particularly the "open access" effect of preprints-in amplifying the impact of science on technological innovation. This study provides empirical evidence that open science policies encouraging the early sharing of research outputs, such as preprints, contribute to more efficient linkage between science and technology, suggesting an acceleration in the pace of innovation, higher innovation quality, and economic benefits.2025-06-25T08:13:05ZAccepted manuscript for publication in Journal of Informetrics.The final version is available at DOI:10.1016/j.joi.2025.101663Journal of Informetrics (2025)Zhiqi WangYue ChenChun Yang10.1016/j.joi.2025.101663http://arxiv.org/abs/2506.20918v1Metadata Enrichment of Long Text Documents using Large Language Models2025-06-26T00:55:47ZIn this project, we semantically enriched and enhanced the metadata of long text documents, theses and dissertations, retrieved from the HathiTrust Digital Library in English published from 1920 to 2020 through a combination of manual efforts and large language models. This dataset provides a valuable resource for advancing research in areas such as computational social science, digital humanities, and information science. Our paper shows that enriching metadata using LLMs is particularly beneficial for digital repositories by introducing additional metadata access points that may not have originally been foreseen to accommodate various content types. This approach is particularly effective for repositories that have significant missing data in their existing metadata fields, enhancing search results and improving the accessibility of the digital repository.2025-06-26T00:55:47ZManika LambaYou PengSophie NikolovGlen Layne-WortheyJ. Stephen Downiehttp://arxiv.org/abs/2506.17508v2Mapping the Evolution of Research Contributions using KnoVo2025-06-25T06:22:45ZThis paper presents KnoVo (Knowledge Evolution), an intelligent framework designed for quantifying and analyzing the evolution of research novelty in the scientific literature. Moving beyond traditional citation analysis, which primarily measures impact, KnoVo determines a paper's novelty relative to both prior and subsequent work within its multilayered citation network. Given a target paper's abstract, KnoVo utilizes Large Language Models (LLMs) to dynamically extract dimensions of comparison (e.g., methodology, application, dataset). The target paper is then compared to related publications along these same extracted dimensions. This comparative analysis, inspired by tournament selection, yields quantitative novelty scores reflecting the relative improvement, equivalence, or inferiority of the target paper in specific aspects. By aggregating these scores and visualizing their progression, for instance, through dynamic evolution graphs and comparative radar charts, KnoVo facilitates researchers not only to assess originality and identify similar work, but also to track knowledge evolution along specific research dimensions, uncover research gaps, and explore cross-disciplinary connections. We demonstrate these capabilities through a detailed analysis of 20 diverse papers from multiple scientific fields and report on the performance of various open-source LLMs within the KnoVo framework.2025-06-20T23:17:11ZSajratul Y. RubaiatSyed N. SakibHasan M. Jamilhttp://arxiv.org/abs/2506.20141v1Accept More, Reject Less: Reducing up to 19% Unnecessary Desk-Rejections over 11 Years of ICLR Data2025-06-25T05:23:44ZThe explosive growth of AI research has driven paper submissions at flagship AI conferences to unprecedented levels, necessitating many venues in 2025 (e.g., CVPR, ICCV, KDD, AAAI, IJCAI, WSDM) to enforce strict per-author submission limits and to desk-reject any excess papers by simple ID order. While this policy helps reduce reviewer workload, it may unintentionally discard valuable papers and penalize authors' efforts. In this paper, we ask an essential research question on whether it is possible to follow submission limits while minimizing needless rejections. We first formalize the current desk-rejection policies as an optimization problem, and then develop a practical algorithm based on linear programming relaxation and a rounding scheme. Under extensive evaluation on 11 years of real-world ICLR (International Conference on Learning Representations) data, our method preserves up to $19.23\%$ more papers without violating any author limits. Moreover, our algorithm is highly efficient in practice, with all results on ICLR data computed within at most 53.64 seconds. Our work provides a simple and practical desk-rejection strategy that significantly reduces unnecessary rejections, demonstrating strong potential to improve current CS conference submission policies.2025-06-25T05:23:44ZXiaoyu LiZhao SongJiahao Zhanghttp://arxiv.org/abs/2506.22497v1Peer Review as Structured Commentary: Immutable Identity, Public Dialogue, and Reproducible Scholarship2025-06-25T03:57:40ZThis paper reconceptualises peer review as structured public commentary. Traditional academic validation is hindered by anonymity, latency, and gatekeeping. We propose a transparent, identity-linked, and reproducible system of scholarly evaluation anchored in open commentary. Leveraging blockchain for immutable audit trails and AI for iterative synthesis, we design a framework that incentivises intellectual contribution, captures epistemic evolution, and enables traceable reputational dynamics. This model empowers fields from computational science to the humanities, reframing academic knowledge as a living process rather than a static credential.2025-06-25T03:57:40Z66 pages, 0 figures, interdisciplinary framework, includes proposed architecture and metadata layer structuresCraig Steven Wrighthttp://arxiv.org/abs/2501.10326v2Large language models for automated scholarly paper review: A survey2025-06-24T16:45:26ZLarge language models (LLMs) have significantly impacted human society, influencing various domains. Among them, academia is not simply a domain affected by LLMs, but it is also the pivotal force in the development of LLMs. In academic publication, this phenomenon is represented during the incorporation of LLMs into the peer review mechanism for reviewing manuscripts. LLMs hold transformative potential for the full-scale implementation of automated scholarly paper review (ASPR), but they also pose new issues and challenges that need to be addressed. In this survey paper, we aim to provide a holistic view of ASPR in the era of LLMs. We begin with a survey to find out which LLMs are used to conduct ASPR. Then, we review what ASPR-related technological bottlenecks have been solved with the incorporation of LLM technology. After that, we move on to explore new methods, new datasets, new source code, and new online systems that come with LLMs for ASPR. Furthermore, we summarize the performance and issues of LLMs in ASPR, and investigate the attitudes and reactions of publishers and academia to ASPR. Lastly, we discuss the challenges and future directions associated with the development of LLMs for ASPR. This survey serves as an inspirational reference for the researchers and can promote the progress of ASPR for its actual implementation.2025-01-17T17:56:58ZPlease cite the version of Information FusionInformation Fusion, Vol. 124, 103332 (2025)Zhenzhen ZhuangJiandong ChenHongfeng XuYuwen JiangJialiang Lin10.1016/j.inffus.2025.103332http://arxiv.org/abs/2406.19219v3Metrics to Detect Small-Scale and Large-Scale Citation Orchestration2025-06-24T16:16:32ZCitation counts and related metrics have pervasive uses and misuses in academia and research appraisal, serving as scholarly influence and recognition measures. Hence, comprehending the citation patterns exhibited by authors is essential for assessing their research impact and contributions within their respective fields. Although the h-index, introduced by Hirsch in 2005, has emerged as a popular bibliometric indicator, it fails to account for the intricate relationships between authors and their citation patterns. This limitation becomes particularly relevant in cases where citations are strategically employed to boost the perceived influence of certain individuals or groups, a phenomenon that we term "orchestration". Orchestrated citations can introduce biases in citation rankings and therefore necessitate the identification of such patterns. Here, we use Scopus data to investigate orchestration of citations across all scientific disciplines. Orchestration could be small-scale, when the author him/herself and/or a small number of other authors use citations strategically to boost citation metrics like h-index; or large-scale, where extensive collaborations among many co-authors lead to high h-index for many/all of them. We propose three orchestration indicators: extremely low values in the ratio of citations over the square of the h-index (indicative of small-scale orchestration); extremely small number of authors who can explain at least 50% of an author's total citations (indicative of either small-scale or large-scale orchestration); and extremely large number of co-authors with more than 50 co-authored papers (indicative of large-scale orchestration). The distributions, potential thresholds based on 1% (and 5%) percentiles, and insights from these indicators are explored and put into perspective across science.2024-06-27T14:44:43ZIakovos EvdaimonJohn P. A. IoannidisGiannis NikolentzosMichail ChatzianastasisGeorge PanagopoulosMichalis Vazirgiannishttp://arxiv.org/abs/2506.19656v1Video Compression for Spatiotemporal Earth System Data2025-06-24T14:20:05ZLarge-scale Earth system datasets, from high-resolution remote sensing imagery to spatiotemporal climate model outputs, exhibit characteristics analogous to those of standard videos. Their inherent spatial, temporal, and spectral redundancies can thus be readily exploited by established video compression techniques. Here, we present xarrayvideo, a Python library for compressing multichannel spatiotemporal datasets by encoding them as videos. Our approach achieves compression ratios of up to 250x while maintaining high fidelity by leveraging standard, well-optimized video codecs through ffmpeg. We demonstrate the library's effectiveness on four real-world multichannel spatiotemporal datasets: DynamicEarthNet (very high resolution Planet images), DeepExtremeCubes (high resolution Sentinel-2 images), ERA5 (weather reanalysis data), and the SimpleS2 dataset (high resolution multichannel Sentinel-2 images), achieving Peak Signal-to-Noise Ratios (PSNRs) of 55.86, 40.60, 46.58, and 43.23 dB at 0.1 bits per pixel per band (bpppb) and 65.91, 54.28, 62.90, and 55.04 dB at 1 bpppb. We are redistributing two of these datasets, DeepExtremeCubes (2.3 Tb) and DynamicEarthNet (525 Gb), in the machine-learning-ready and cloud-ready TACO format through HuggingFace at significantly reduced sizes (270 Gb and 8.5 Gb, respectively) without compromising quality (PSNR 55.77-56.65 and 60.15). No performance loss is observed when the compressed versions of these datasets are used in their respective deep learning-based downstream tasks (next step reflectance prediction and landcover segmentation). In conclusion, xarrayvideo presents an efficient solution for handling the rapidly growing size of Earth observation datasets, making advanced compression techniques accessible and practical to the Earth science community. The library is available for use at https://github.com/IPL-UV/xarrayvideo2025-06-24T14:20:05ZOscar J. Pellicer-ValeroCesar AybarGustau Camps Vallshttp://arxiv.org/abs/2506.18069v2Unfolding the Past: A Comprehensive Deep Learning Approach to Analyzing Incunabula Pages2025-06-24T11:19:13ZWe developed a proof-of-concept method for the automatic analysis of the structure and content of incunabula pages. A custom dataset comprising 500 annotated pages from five different incunabula was created using resources from the Jagiellonian Digital Library. Each page was manually labeled with five predefined classes: Text, Title, Picture, Table, and Handwriting. Additionally, the publicly available DocLayNet dataset was utilized as supplementary training data. To perform object detection, YOLO11n and YOLO11s models were employed and trained using two strategies: a combined dataset (DocLayNet and the custom dataset) and the custom dataset alone. The highest performance (F1 = 0.94) was achieved by the YOLO11n model trained exclusively on the custom data. Optical character recognition was then conducted on regions classified as Text, using both Tesseract and Kraken OCR, with Tesseract demonstrating superior results. Subsequently, image classification was applied to the Picture class using a ResNet18 model, achieving an accuracy of 98.7% across five subclasses: Decorative_letter, Illustration, Other, Stamp, and Wrong_detection. Furthermore, the CLIP model was utilized to generate semantic descriptions of illustrations. The results confirm the potential of machine learning in the analysis of early printed books, while emphasizing the need for further advancements in OCR performance and visual content interpretation.2025-06-22T15:33:20Z10 pages, 8 figures; submitted to TPDL 2025; change in v2: updated e-mail addressKlaudia RopelKrzysztof KuttLuiz do Valle MirandaGrzegorz J. Nalepahttp://arxiv.org/abs/2407.06976v3Rich Interoperable Metadata for Cultural Heritage Projects at Jagiellonian University2025-06-24T11:12:22ZThe rich metadata created nowadays for objects stored in libraries has nowhere to be stored, because core standards, namely MARC 21 and Dublin Core, are not flexible enough. The aim of this paper is to summarize our work-in-progress on tackling this problem in research on cultural heritage objects at the Jagiellonian University (JU). We compared the objects' metadata currently being collected at the JU (with examples of manuscript, placard, and obituary) with five widespread metadata standards used by the cultural heritage community: Dublin Core, EAD, MODS, EDM and Digital Scriptorium. Our preliminary results showed that mapping between them is indeed problematic, but we identified requirements that should be followed in further work on the JU cultural heritage metadata schema in order to achieve maximum interoperability. As we move forward, based on the successive versions of the conceptual model, we will conduct experiments to validate the practical feasibility of these mappings and the degree to which the proposed model will actually enable integration with data in these various metadata formats.2024-07-09T15:52:06Z10 pages; submitted to TPLD 2025; change in v2: heavily rewritten, new content added; change in v3: updated e-mail addressLuiz do Valle MirandaKrzysztof KuttElżbieta SrokaGrzegorz J. Nalepahttp://arxiv.org/abs/2506.18517v1Cost for research -- how cost data of research can be included in open metadata to be reused and evaluated2025-06-23T11:18:05ZThe openCost project aims to enhance transparency in research funding by making publication-related costs publicly accessible, following FAIR principles. It introduces a metadata schema for cost data, allowing aggregation and analysis across institutions. The project promotes open access and cost-efficient models, benefiting academic institutions, funders, and policymakers.2025-06-23T11:18:05ZJulia BartlewskiChristoph BroschinskiGernot DeinzerCornelia LangDirk PieperBianca SchweighoferColin SipplLisa-Marie SteinAlexander WagnerSilke Weisheithttp://arxiv.org/abs/2401.12739v3Decoding University Hierarchy and Prestige in China through Domestic Ph.D. Hiring Network2025-06-22T12:26:53ZThe academic job market for fresh Ph.D. students to pursue postdoctoral and junior faculty positions plays a crucial role in shaping the future orientations, developments, and status of the global academic system. In this work, we focus on the domestic Ph.D. hiring network among universities in China by exploring the doctoral education and academic employment of nearly 28,000 scientists across all Ph.D.-granting Chinese universities over three decades. We employ the minimum violation rankings algorithm to decode the rankings for universities based on the Ph.D. hiring network, which offers a deep understanding of the structure and dynamics within the network. Our results uncover a consistent, highly structured hierarchy within this hiring network, indicating the imbalances wherein a limited number of universities serve as the main sources of fresh Ph.D. across diverse disciplines. Furthermore, over time, it has become increasingly challenging for Chinese Ph.D. graduates to secure positions at institutions more prestigious than their alma maters. This study quantitatively captures the evolving structure of talent circulation in the domestic environment, providing valuable insights to enhance the organization, diversity, and talent distribution in China's academic enterprise.2024-01-23T13:15:43ZChaolin TianXunyi JiangYurui HuangLangtian MaYifang Mahttp://arxiv.org/abs/2411.12180v2The Innovative Distinctiveness of Prizewinners and their Networks2025-06-22T12:16:30ZScience prizes purportedly reward innovation and explorations of new phenomena. Yet, in practice prizes may inadvertently divert resources from similarly impactful but less celebrated scholars. Despite this paradox, knowledge of how prizewinning relates to innovation is nascent even as prizes proliferate widely. Analyzing 2,460 worldwide prizes, we compared the innovativeness of over 23,000 prizewinners and matched non-prizewinners whose performance records were statistically equivalent up to the prize year. First, we find that prizewinners are more innovative. Their research is more likely to combine existing ideas in new ways, integrate a topic's historical and contemporary thinking, and incorporate interdisciplinary perspectives. Second, although prizewinners and matched non-prizewinners have statistically equivalent impact and productivity records up to the prize year, at about five years before the prize, prizewinners' papers become more innovative than their matched peers, a difference that widens each year, peaks during the prize year, and then persists for the remainder of their careers. Third, network embeddedness predicts unusual innovativeness. Compared to non-prizewinners, prizewinners' collaborations are shorter in duration, encompass wider exposure to unfamiliar topics, and involve coauthors whose networks minimally overlap with each other. The implications of the findings for the efficacy of reward systems and innovation in science are discussed.2024-11-19T02:48:14ZChaolin TianYurui HuangChing JinYifang MaBrian Uzzihttp://arxiv.org/abs/2506.17580v1Context-Aware Scientific Knowledge Extraction on Linked Open Data using Large Language Models2025-06-21T04:22:34ZThe exponential growth of scientific literature challenges researchers extracting and synthesizing knowledge. Traditional search engines return many sources without direct, detailed answers, while general-purpose LLMs may offer concise responses that lack depth or omit current information. LLMs with search capabilities are also limited by context window, yielding short, incomplete answers. This paper introduces WISE (Workflow for Intelligent Scientific Knowledge Extraction), a system addressing these limits by using a structured workflow to extract, refine, and rank query-specific knowledge. WISE uses an LLM-powered, tree-based architecture to refine data, focusing on query-aligned, context-aware, and non-redundant information. Dynamic scoring and ranking prioritize unique contributions from each source, and adaptive stopping criteria minimize processing overhead. WISE delivers detailed, organized answers by systematically exploring and synthesizing knowledge from diverse sources. Experiments on HBB gene-associated diseases demonstrate WISE reduces processed text by over 80% while achieving significantly higher recall over baselines like search engines and other LLM-based approaches. ROUGE and BLEU metrics reveal WISE's output is more unique than other systems, and a novel level-based metric shows it provides more in-depth information. We also explore how the WISE workflow can be adapted for diverse domains like drug discovery, material science, and social science, enabling efficient knowledge extraction and synthesis from unstructured scientific papers and web sources.2025-06-21T04:22:34ZSajratul Y. RubaiatHasan M. Jamilhttp://arxiv.org/abs/2506.09530v2Linking Data Citation to Repository Visibility: An Empirical Study2025-06-20T13:13:47ZIn today's data-driven research landscape, dataset visibility and accessibility play a crucial role in advancing scientific knowledge. At the same time, data citation is essential for maintaining academic integrity, acknowledging contributions, validating research outcomes, and fostering scientific reproducibility. As a critical link, it connects scholarly publications with the datasets that drive scientific progress. This study investigates whether repository visibility influences data citation rates. We hypothesize that repositories with higher visibility, as measured by search engine metrics, are associated with increased dataset citations. Using OpenAlex data and repository impact indicators (including the visibility index from Sistrix, the h-index of repositories, and citation metrics such as mean and median citations), we analyze datasets in Social Sciences and Economics to explore their relationship. Our findings suggest that datasets hosted on more visible web domains tend to receive more citations, with a positive correlation observed between web domain visibility and dataset citation counts, particularly for datasets with at least one citation. However, when analyzing domain-level citation metrics, such as the h-index, mean, and median citations, the correlations are inconsistent and weaker. While higher visibility domains tend to host datasets with greater citation impact, the distribution of citations across datasets varies significantly. These results suggest that while visibility plays a role in increasing citation counts, it is not the sole factor influencing dataset citation impact. Other elements, such as dataset quality, research trends, and disciplinary norms, can also contribute to citation patterns.2025-06-11T09:00:52ZFakhri MomeniJanete Saldanha BachBrigitte MathiakPeter Mutschke