https://arxiv.org/api/A8JiwoAkamDfoLwiriWa7kf5zeY 2026-06-13T15:05:09Z 6065 405 15 http://arxiv.org/abs/2512.19675v1 Multimodal LLMs for Historical Dataset Construction from Archival Image Scans: German Patents (1877-1918) 2025-12-22T18:53:03Z

We leverage multimodal large language models (LLMs) to construct a dataset of 306,070 German patents (1877-1918) from 9,562 archival image scans using our LLM-based pipeline powered by Gemini-2.5-Pro and Gemini-2.5-Flash-Lite. Our benchmarking exercise provides tentative evidence that multimodal LLMs can create higher quality datasets than our research assistants, while also being more than 795 times faster and 205 times cheaper in constructing the patent dataset from our image corpus. About 20 to 50 patent entries are embedded on each page, arranged in a double-column format and printed in Gothic and Roman fonts. The font and layout complexity of our primary source material suggests to us that multimodal LLMs are a paradigm shift in how datasets are constructed in economic history. We open-source our benchmarking and patent datasets as well as our LLM-based data pipeline, which can be easily adapted to other image corpora using LLM-assisted coding tools, lowering the barriers for less technical researchers. Finally, we explain the economics of deploying LLMs for historical dataset construction and conclude by speculating on the potential implications for the field of economic history.

2025-12-22T18:53:03Z Niclas Griesshaber Jochen Streb http://arxiv.org/abs/2503.13464v2 Mapping Research Data at the University of Bologna 2025-12-22T14:00:21Z

Research data management (RDM) strategies and practices play a pivotal role in adhering to the paradigms of reproducibility and transparency by enabling research sharing in accordance with the principles of Open Science. Discipline-specificity is an essential factor when understanding RDM declinations, to tailor a comprehensive support service and to enhance interdisciplinarity. In this paper we present the results of a mapping carried out to gather information on research data generated and managed within the University of Bologna (UniBO). The aim is to identify differences and commonalities between disciplines and potential challenges for institutional support. We analyzed the data management plans (DMPs) of European competitive projects drafted by researchers affiliated with UniBO. We applied descriptive statistics to the collected variables to answer three main questions: How diverse is the range of data managed within the University of Bologna? Which trends of problems and patterns in terms of data management can influence/improve data stewardship service? Is there an interdisciplinary approach to data production within the University? The research work evidenced many points of contact between different disciplines in terms of data produced, formats used and modest predilection for data reuse. Hot topics such as data confidentiality, needed either on privacy or intellectual property rights (IPR) premises, and long-term preservation pose challenges to all researchers. These results show an increasing attention to RDM while highlighting the relevance of training and support to face the relatively new challenges posed by this approach.

2025-02-26T11:41:58Z 32 pages, 12 figures Data Science Journal, 24(), p. 38 (2025) C. Basalti G. Caldoni S. Coppini B. Gualandi M. Marino F. Masini S. Peroni 10.5334/dsj-2025-038 http://arxiv.org/abs/2512.18979v1 Research on Novelty Measurement Indicator of Academic Papers Based on the Atypical Recombination of Knowledge 2025-12-22T02:42:48Z

The advancement of science is inherently dependent on the recombination of existing knowledge, and innovative research typically relies on the atypical recombination of established knoweldge bases. This study introduces a Knowledge Eccentricity to enable timely assessment of the novelty of research outputs by quantifying their degree of deviation from the existing knowledge system. For empirical analysis, we selected sample data including research articles published in Science and Nature, top 1% highly cited papers, and zero-cited papers for the year 2005, 2010, 2015, 2020, and 2025. We calculated the knowledge eccentricity scores for these papers and examined their potential influencing factors. The results indicate that team size exerts a significant negative effect on paper novelty, meaning larger team size is less conductive to enhancing the novelty of research outputs. Conversely, the number of references shows a signifcant positive correlation with paper novelty, which means that a greater number of references is associated with a moderate imporovement in a paper's novelty. The proposed indicator offers strong timeliness and operability, allowing for the evaluation of a paper's novelty immediately upon its publication.

2025-12-22T02:42:48Z in Chinese language Liang Guoqiang Sun Jian Lin Gege Zhang Shuo http://arxiv.org/abs/2512.18283v1 Improving Data Reusability in Interactive Information Retrieval: Insights from the Community 2025-12-20T09:12:33Z

In this study, we conducted semi-structured interviews with 21 IIR researchers to investigate their data reuse practices. This study aims to expand upon current findings by exploring IIR researchers' information-obtaining behaviors regarding data reuse. We identified the information about shared data characteristics that IIR researchers need when evaluating data reusability, as well as the sources they typically consult to obtain this information. We consider this work to be an initial step toward revealing IIR researchers' data reuse practices and identifying what the community needs to do to promote data reuse. We hope that this study, as well as future research, will inspire more individuals to contribute to ongoing efforts aimed at designing standards, infrastructures, and policies, as well as fostering a sustainable culture of data sharing and reuse in this field.

2025-12-20T09:12:33Z Accepted by CHIIR 2025 Tianji Jiang Wenqi Li Jiqun Liu http://arxiv.org/abs/2509.08713v2 The More You Automate, the Less You See: Hidden Pitfalls of AI Scientist Systems 2025-12-20T05:26:07Z

AI scientist systems, capable of autonomously executing the full research workflow from hypothesis generation and experimentation to paper writing, hold significant potential for accelerating scientific discovery. However, the internal workflow of these systems have not been closely examined. This lack of scrutiny poses a risk of introducing flaws that could undermine the integrity, reliability, and trustworthiness of their research outputs. In this paper, we identify four potential failure modes in contemporary AI scientist systems: inappropriate benchmark selection, data leakage, metric misuse, and post-hoc selection bias. To examine these risks, we design controlled experiments that isolate each failure mode while addressing challenges unique to evaluating AI scientist systems. Our assessment of two prominent open-source AI scientist systems reveals the presence of several failures, across a spectrum of severity, which can be easily overlooked in practice. Finally, we demonstrate that access to trace logs and code from the full automated workflow enables far more effective detection of such failures than examining the final paper alone. We thus recommend journals and conferences evaluating AI-generated research to mandate submission of these artifacts alongside the paper to ensure transparency, accountability, and reproducibility.

2025-09-10T16:04:24Z NeurIPS 2025 AI4Science (Spotlight) Ziming Luo Atoosa Kasirzadeh Nihar B. Shah http://arxiv.org/abs/2512.18122v1 Accelerating End-to-End PDF to Markdown Conversion Through Assisted Generation 2025-12-19T23:02:39Z

Converting data from machine-unreadable formats like PDFs into Markdown has the potential to enhance the accessibility of scientific research. Existing end-to-end decoder transformer models can transform screenshots of PDFs into Markdown, offering more flexibility than pipeline-based methods. Yet, decoding text token by token from scratch is inefficient, especially when dense text can be directly copied from the PDF. To address this challenge, this paper modifies Prompt Lookup Decoding (PLD) to extract candidate sequences directly from PDF files, leveraging the high n-gram overlap between PDFs and their Markdown equivalents. A new method, Copy Lookup Decoding (CLD), is introduced here to enhance PLD's candidate generation mechanism. Experiments demonstrate that CLD can accelerate the conversion process by up to 1.70$\times$ at original quality. The codebase for this paper is open-source on GitHub (https://github.com/Fireblossom/CopyLookup).

2025-12-19T23:02:39Z Accepted NLDB 2025 Changxu Duan 10.1007/978-3-031-97141-9_3 http://arxiv.org/abs/2512.18115v1 Layout-Aware Text Editing for Efficient Transformation of Academic PDFs to Markdown 2025-12-19T22:43:12Z

Academic documents stored in PDF format can be transformed into plain text structured markup languages to enhance accessibility and enable scalable digital library workflows. Markup languages allow for easier updates and customization, making academic content more adaptable and accessible to diverse usage, such as linguistic corpus compilation. Such documents, typically delivered in PDF format, contain complex elements including mathematical formulas, figures, headers, and tables, as well as densely layouted text. Existing end-to-end decoder transformer models can transform screenshots of documents into markup language. However, these models exhibit significant inefficiencies; their token-by-token decoding from scratch wastes a lot of inference steps in regenerating dense text that could be directly copied from PDF files. To solve this problem, we introduce EditTrans, a hybrid editing-generation model whose features allow identifying a queue of to-be-edited text from a PDF before starting to generate markup language. EditTrans contains a lightweight classifier fine-tuned from a Document Layout Analysis model on 162,127 pages of documents from arXiv. In our evaluations, EditTrans reduced the transformation latency up to 44.5% compared to end-to-end decoder transformer models, while maintaining transformation quality. Our code and reproducible dataset production scripts are open-sourced.

2025-12-19T22:43:12Z Accepted ICDAR 2025 Changxu Duan 10.1007/978-3-032-04614-7_13 http://arxiv.org/abs/2504.14512v2 Revisiting the field normalization approaches/practices 2025-12-19T17:12:36Z

Field normalization plays a crucial role in scientometrics to ensure fair comparisons across different disciplines. In this paper, we revisit the effectiveness of several widely used field normalization methods. Our findings indicate that source-side normalization (as employed in SNIP) does not fully eliminate citation bias across different fields and the imbalanced paper growth rates across fields are a key factor for this phenomenon. To address the issue of skewness, logarithmic transformation has been applied. Recently, a combination of logarithmic transformation and mean-based normalization, expressed as ln(c+1)/mu, has gained popularity. However, our analysis shows that this approach does not yield satisfactory results. Instead, we find that combining logarithmic transformation (ln(c+1)) with z-score normalization provides a better alternative. Furthermore, our study suggests that the better performance is achieved when combining both source-side and target-side field normalization methods.

2025-04-20T07:10:32Z Xinyue Lu Li Li Zhesi Shen http://arxiv.org/abs/2512.22181v1 Interpretable Link Prediction in AI-Driven Cancer Research: Uncovering Co-Authorship Patterns 2025-12-19T16:25:16Z

Artificial intelligence (AI) is transforming cancer diagnosis and treatment. The intricate nature of this disease necessitates the collaboration of diverse stakeholders with varied expertise to ensure the effectiveness of cancer research. Despite its importance, forming effective interdisciplinary research teams remains challenging. Understanding and predicting collaboration patterns can help researchers, organizations, and policymakers optimize resources and foster impactful research. We examined co-authorship networks as a proxy for collaboration within AI-driven cancer research. Using 7,738 publications (2000-2017) from Scopus, we constructed 36 overlapping co-authorship networks representing new, persistent, and discontinued collaborations. We engineered both attribute-based and structure-based features and built four machine learning classifiers. Model interpretability was performed using Shapley Additive Explanations (SHAP). Random forest achieved the highest recall for all three types of examined collaborations. The discipline similarity score emerged as a crucial factor, positively affecting new and persistent patterns while negatively impacting discontinued collaborations. Additionally, high productivity and seniority were positively associated with discontinued links. Our findings can guide the formation of effective research teams, enhance interdisciplinary cooperation, and inform strategic policy decisions.

2025-12-19T16:25:16Z 24 pages Shahab Mosallaie Andrea Schiffauerova Ashkan Ebadi http://arxiv.org/abs/2512.16434v1 OpenAlex: Features, advantages and limitations of an open database for retrieving and analysing scholarly outputs 2025-12-18T11:37:39Z

OpenAlex is an open bibliographic database that has been proposed as an alternative to commercial platforms in a context defined by the aim of transforming science evaluation systems into more transparent sources based on open data. This paper analyses its features, information sources, entities, advantages and limitations. The results reveal numerous records lacking abstracts, affiliations and references; deficiencies in identifying document types and languages; and issues with authority control and versioning. Although OpenAlex has been adopted in important initiatives and has yielded results comparable to those obtained with commercial databases, gaps in its metadata and a lack of consistency point to a need for intensive data cleaning, suggesting it should be used with caution. The study concludes by identifying three lines of action to improve data quality: increasing publishers' commitment to completing metadata in primary sources; creating coordination structures to channel the contributions of institutional users; and endowing the project with sufficient human resources and reliable procedures to address internal quality control tasks and user support requests.

2025-12-18T11:37:39Z Ángel Borrego Cristóbal Urbano http://arxiv.org/abs/2508.00836v5 Rxiv-Maker: an automated template engine for streamlined scientific publications 2025-12-18T09:31:26Z

The rapid growth of preprint servers has accelerated scientific dissemination but has also shifted the technical burden of manuscript preparation to authors. This challenge is particularly acute in computational research, where manuscripts must remain synchronised with evolving data and code. We present Rxiv-Maker, a framework that resolves this by converting simple Markdown files into professionally typeset, publication-ready PDFs. Its core feature is the ability to execute embedded code, creating a self-updating manuscript where figures and statistical values are generated directly from source data during compilation. This ensures that the final document is always current and fully reproducible. By integrating with standard tools like Git and Visual Studio (VS) Code, Rxiv-Maker provides an efficient, transparent, and collaborative authoring experience, applying principles of software engineering to academic writing to foster open and verifiable science.

2025-06-26T03:28:28Z Bruno M. Saraiva António D. Brito Guillaume Jaquemet Ricardo Henriques http://arxiv.org/abs/2512.16339v1 Beyond openness: Inclusiveness and usability of Chinese scholarly data in OpenAlex 2025-12-18T09:27:12Z

OpenAlex, launched in 2022 as a fully open scholarly data source, promises greater inclusiveness compared to traditional proprietary databases. This study evaluates whether OpenAlex delivers on that promise by examining its coverage and metadata quality for Chinese-language journals and their articles. Using the 2023 edition of A Guide to the Core Journals of China (GCJC) and Wanfang Data as a benchmark, we analyze three aspects: (1) journal-level coverage, (2) article-level coverage, and (3) completeness and accuracy of metadata fields. Results show that OpenAlex indexes only 37% of GCJC journals and 24% of their articles, with substantial disciplinary and temporal variation. Metadata quality is uneven: while basic fields such as title and publication year are complete, bibliographic details, author affiliations, and cited references are frequently missing or inaccurate. DOI coverage is limited, and language information is often incorrect, with most Chinese-language articles labeled as English. These findings highlight significant challenges for achieving full inclusiveness and usability in research evaluation and related activities. We conclude with recommendations for improving data aggregation strategies, DOI registration practices, and metadata standardization to enhance the integration of local scholarly outputs into global open infrastructures.

2025-12-18T09:27:12Z Lin Zhang Zhe Cao Jianhua Liu Nees Jan van Eck http://arxiv.org/abs/2409.12177v2 LitFM: A Retrieval Augmented Structure-aware Foundation Model For Citation Graphs 2025-12-18T05:31:34Z

With the advent of large language models (LLMs), managing scientific literature via LLMs has become a promising direction of research. However, existing approaches often overlook the rich structural and semantic relevance among scientific literature, limiting their ability to discern the relationships between pieces of scientific knowledge, and suffer from various types of hallucinations. These methods also focus narrowly on individual downstream tasks, limiting their applicability across use cases. Here we propose LitFM, the first literature foundation model designed for a wide variety of practical downstream tasks on domain-specific literature, with a focus on citation information. At its core, LitFM contains a novel graph retriever to integrate graph structure by navigating citation graphs and extracting relevant literature, thereby enhancing model reliability. LitFM also leverages a knowledge-infused LLM, fine-tuned through a well-developed instruction paradigm. It enables LitFM to extract domain-specific knowledge from literature and reason relationships among them. By integrating citation graphs during both training and inference, LitFM can generalize to unseen papers and accurately assess their relevance within existing literature. Additionally, we introduce new large-scale literature citation benchmark datasets on three academic fields, featuring sentence-level citation information and local context. Extensive experiments validate the superiority of LitFM, achieving 28.1% improvement on retrieval task in precision, and an average improvement of 7.52% over state-of-the-art across six downstream literature-related tasks

2024-09-05T22:26:21Z 12 pages, 8 figures Jiasheng Zhang Jialin Chen Ali Maatouk Ngoc Bui Qianqian Xie Leandros Tassiulas Jie Shao Hua Xu Rex Ying http://arxiv.org/abs/2512.22167v1 IANEC: Digital Forensic Investigation of Contemporary Writers' Archives 2025-12-17T10:00:27Z

The IANEC project (Investigation of Digital Archives of Contemporary Writers), led by the GREYC Research Lab and funded by the French Ministry of Culture aims to develop dedicated digital forensic investigation tools to automate the analysis of archival corpora from the Institut M{é}moires de l'{É}dition Contemporaine (IMEC). The project is based on the observation that born-digital archival materials are increasingly prevalent in contemporary archival institutions, and that digital forensics technologies have become essential for the extraction, identification, processing, and description of natively digital archival corpora.*

2025-12-17T10:00:27Z in French language Emmanuel Giguet GREYC http://arxiv.org/abs/2512.22164v1 Artificial Intelligence Applications in Lean Startup Methodology: A Bibliometric Analysis of Research Trends and Future Directions 2025-12-17T06:40:57Z

This study presents a comprehensive bibliometric analysis of the emerging intersection between artificial intelligence (AI) and lean startup methodology. Using the PRISMA 2020 framework, we systematically analyzed 12 peer-reviewed articles published between 2010 and June 2025, sourced from the Scopus database. The analysis employed VOS viewer software to conduct co-authorship, keyword co-occurrence, and citation network analyses. Results reveal three distinct research clusters: operational integration of AI within startup experimentation processes, AI-enhanced learning systems for entrepreneurial contexts, and strategic implications of AI for uncertainty management in startups. The findings indicate a developing research domain characterized by fragmented authorship networks, limited international collaboration, and geographic concentration in developed economies, particularly the United States and Germany. Key research themes include business model innovation, iterative methods, and machine learning applications, with artificial intelligence serving as a bridging concept across thematic clusters. The analysis identifies significant research gaps in ethical considerations, cross-cultural validation, and empirical testing of AI-enabled lean startup frameworks. While current research demonstrates growing interest in AI integration within entrepreneurial experimentation, the field requires enhanced theoretical consolidation, methodological rigor, and interdisciplinary collaboration to achieve practical relevance and academic maturity. This study contributes to the emerging discourse on digital entrepreneurship by providing a systematic overview of research trends and identifying priority areas for future investigation at the intersection of AI and lean startup methodologies.

2025-12-17T06:40:57Z Revised version (19 pages, 10 figures, 4 tables) following minor journal-review comments; currently under journal review Parisa Omidmand Rasam Dorri Alireza Mozaffari Saeid Ataei