https://arxiv.org/api/zp/nVQ2zvnODZhdejcQ/RaLziU8 2026-06-14T03:30:59Z 6065 570 15 http://arxiv.org/abs/2510.04749v1 LLM-Based Information Extraction to Support Scientific Literature Research and Publication Workflows 2025-10-06T12:27:45Z

The increasing volume of scholarly publications requires advanced tools for efficient knowledge discovery and management. This paper introduces ongoing work on a system using Large Language Models (LLMs) for the semantic extraction of key concepts from scientific documents. Our research, conducted within the German National Research Data Infrastructure for and with Computer Science (NFDIxCS) project, seeks to support FAIR (Findable, Accessible, Interoperable, and Reusable) principles in scientific publishing. We outline our explorative work, which uses in-context learning with various LLMs to extract concepts from papers, initially focusing on the Business Process Management (BPM) domain. A key advantage of this approach is its potential for rapid domain adaptation, often requiring few or even zero examples to define extraction targets for new scientific fields. We conducted technical evaluations to compare the performance of commercial and open-source LLMs and created an online demo application to collect feedback from an initial user-study. Additionally, we gathered insights from the computer science research community through user stories collected during a dedicated workshop, actively guiding the ongoing development of our future services. These services aim to support structured literature reviews, concept-based information retrieval, and integration of extracted knowledge into existing knowledge graphs.

2025-10-06T12:27:45Z This PDF is the author-prepared camera-ready version corresponding to the accepted manuscript and supersedes the submitted version that was inadvertently published as the version of record New Trends in Theory and Practice of Digital Libraries. TPDL 2025. Communications in Computer and Information Science, vol 2694. pp 90-99 Samy Ateia Udo Kruschwitz Melanie Scholz Agnes Koschmider Moayad Almohaishi 10.1007/978-3-032-06136-2_9 http://arxiv.org/abs/2508.20115v2 Flexible metadata harvesting for ecology using large language models 2025-10-06T10:07:54Z

Large, open datasets can accelerate ecological research, particularly by enabling researchers to develop new insights by reusing datasets from multiple sources. However, to find the most suitable datasets to combine and integrate, researchers must navigate diverse ecological and environmental data provider platforms with varying metadata availability and standards. To overcome this obstacle, we have developed a large language model (LLM)-based metadata harvester that flexibly extracts metadata from any dataset's landing page, and converts these to a user-defined, unified format using existing metadata standards. We validate that our tool is able to extract both structured and unstructured metadata with equal accuracy, aided by our LLM post-processing protocol. Furthermore, we utilise LLMs to identify links between datasets, both by calculating embedding similarity and by unifying the formats of extracted metadata to enable rule-based processing. Our tool, which flexibly links the metadata of different datasets, can therefore be used for ontology creation or graph-based queries, for example, to find relevant ecological and environmental datasets in a virtual research environment.

2025-08-21T10:10:29Z Zehao Lu Thijs L van der Plas Parinaz Rashidi W Daniel Kissling Ioannis N Athanasiadis 10.1007/978-3-032-06136-2_32 http://arxiv.org/abs/2510.02743v1 Bi-National Academic Funding and Collaboration Dynamics: The Case of the German-Israeli Foundation 2025-10-03T06:05:26Z

Academic grant programs are widely used to motivate international research collaboration and boost scientific impact across borders. Among these, bi-national funding schemes -- pairing researchers from just two designated countries - are common yet understudied compared with national and multinational funding. In this study, we explore whether bi-national programs genuinely foster new collaborations, high-quality research, and lasting partnerships. To this end, we conducted a bibliometric case study of the German-Israeli Foundation (GIF), covering 642 grants, 2,386 researchers, and 52,847 publications. Our results show that GIF funding catalyzes collaboration during, and even slightly before, the grant period, but rarely produces long-lasting partnerships that persist once the funding concludes. By tracing co-authorship before, during, and after the funding period, clustering collaboration trajectories with temporally-aware K-means, and predicting cluster membership with ML models (best: XGBoost, 74% accuracy), we find that 45% of teams with no prior joint work become active while funded, yet activity declines symmetrically post-award; roughly one-third sustain collaboration longer-term, and a small subset achieve high, lasting output. Moreover, there is no clear pattern in the scientometrics of the team's operating as a predictor for long-term collaboration before the grant. This refines prior assumptions that international funding generally forges enduring networks. The results suggest policy levers such as sequential funding, institutional anchoring (centers, shared infrastructure, mobility), and incentives favoring genuinely new pairings have the potential to convert short-term boosts into resilient scientific bridges and inform the design of bi-national science diplomacy instruments.

2025-10-03T06:05:26Z Amit Bengiat Teddy Lazebnik Philipp Mayr Ariel Rosenfeld http://arxiv.org/abs/2508.00827v4 Legal Knowledge Graph Foundations, Part I: URI-Addressable Abstract Works (LRMoo F1 to schema.org) 2025-10-02T15:15:20Z

Building upon a formal, event-centric model for the diachronic evolution of legal norms grounded in the IFLA Library Reference Model (LRMoo), this paper addresses the essential first step of publishing this model's foundational entity-the abstract legal Work (F1)-on the Semantic Web. We propose a detailed, property-by-property mapping of the LRMoo F1 Work to the widely adopted schema.org/Legislation vocabulary. Using Brazilian federal legislation from the Normas.leg.br portal as a practical case study, we demonstrate how to create interoperable, machine-readable descriptions via JSON-LD, focusing on stable URN identifiers, core metadata, and norm relationships. This structured mapping establishes a stable, URI-addressable anchor for each legal norm, creating a verifiable "ground truth". It provides the essential, interoperable foundation upon which subsequent layers of the model, such as temporal versions (Expressions) and internal components, can be built. By bridging formal ontology with web-native standards, this work paves the way for building deterministic and reliable Legal Knowledge Graphs (LKGs), overcoming the limitations of purely probabilistic models.

2025-05-12T15:11:11Z This version formalizes the LRMoo event-centric model for the legal lifecycle (enactment, publication). This provides a more precise and ontologically-grounded mapping to Schema.org, with a clearer case study and improved diagrams Hudson de Martim http://arxiv.org/abs/2510.01961v1 KTBox: A Modular LaTeX Framework for Semantic Color, Structured Highlighting, and Scholarly Communication 2025-10-02T12:32:01Z

The communication of technical insight in scientific manuscripts often relies on ad-hoc formatting choices, resulting in inconsistent visual emphasis and limited portability across document classes. This paper introduces ktbox, a modular LaTeX framework that unifies semantic color palettes, structured highlight boxes, taxonomy trees, and author metadata utilities into a coherent system for scholarly writing. The framework is distributed as a set of lightweight, namespaced components: ktcolor.sty for semantic palettes, ktbox.sty for structured highlight and takeaway environments, ktlrtree.sty for taxonomy trees with fusion and auxiliary annotations, and ktorcid.sty for ORCID-linked author metadata. Each component is independently usable yet interoperable, ensuring compatibility with major templates such as IEEEtran, acmart, iclr conference, and beamer. Key features include auto-numbered takeaway boxes, wide-format highlights, flexible taxonomy tree visualizations, and multi-column layouts supporting embedded tables, enumerations, and code blocks. By adopting a clear separation of concerns and enforcing a consistent naming convention under the kt namespace, the framework transforms visual styling from cosmetic add-ons into reproducible, extensible building blocks of scientific communication, improving clarity, portability, and authoring efficiency across articles, posters, and presentations.

2025-10-02T12:32:01Z 14 pages, 3 figures. First public release of the KTBox framework as a modular LaTeX package. Source code: https://github.com/mangalbhaskar/ktbox, CTAN: https://ctan.org/tex-archive/macros/latex/contrib/ktbox/. Planned to extend this work into a Q1 journal submission in the near future Bhaskar Mangal Ashutosh Bhatia Yashvardhan Sharma Kamlesh Tiwari Rashmi Verma http://arxiv.org/abs/2510.01593v1 Investigating Industry--Academia Collaboration in Artificial Intelligence: PDF-Based Bibliometric Analysis from Leading Conferences 2025-10-02T02:16:07Z

This study presents a bibliometric analysis of industry--academia collaboration in artificial intelligence (AI) research, focusing on papers from two major international conferences, AAAI and IJCAI, from 2010 to 2023. Most previous studies have relied on publishers and other databases to analyze bibliographic information. However, these databases have problems, such as missing articles and omitted metadata. Therefore, we adopted a novel approach to extract bibliographic information directly from the article PDFs: we examined 20,549 articles and identified the collaborative papers through a classification process of author affiliation. The analysis explores the temporal evolution of collaboration in AI, highlighting significant changes in collaboration patterns over the past decade. In particular, this study examines the role of key academic and industrial institutions in facilitating these collaborations, focusing on emerging global trends. Additionally, a content analysis using document classification was conducted to examine the type of first author in collaborative research articles and explore the potential differences between collaborative and noncollaborative research articles. The results showed that, in terms of publication, collaborations are mainly led by academia, but their content is not significantly different from that of others. The affiliation metadata are available at https://github.com/mm-doshisha/ICADL2024.

2025-10-02T02:16:07Z Accepted at ICADL2024 Kazuhiro Yamauchi Marie Katsurai http://arxiv.org/abs/2506.19065v2 LEGATO: Large-scale End-to-end Generalizable Approach to Typeset OMR 2025-10-01T18:59:01Z

We propose Legato, a new end-to-end model for optical music recognition (OMR), a task of converting music score images to machine-readable documents. Legato is the first large-scale pretrained OMR model capable of recognizing full-page or multi-page typeset music scores and the first to generate documents in ABC notation, a concise, human-readable format for symbolic music. Bringing together a pretrained vision encoder with an ABC decoder trained on a dataset of more than 214K images, our model exhibits the strong ability to generalize across various typeset scores. We conduct comprehensive experiments on a range of datasets and metrics and demonstrate that Legato outperforms the previous state of the art. On our most realistic dataset, we see a 68\% and 47.6\% absolute error reduction on the standard metrics TEDn and OMR-NED, respectively.

2025-06-23T19:35:59Z Guang Yang Victoria Ebert Nazif Tamer Brian Siyuan Zheng Luiza Pozzobon Noah A. Smith http://arxiv.org/abs/2510.03307v1 The QIC-Index: A Novel, Data-Centric Metric for Quantifying the Impact of Research Data Sharing 2025-09-30T19:05:31Z

We introduce the QIC-Index, a novel metric to address the failure of publication-centric metrics to value research data sharing. The QIC-Index quantifies the impact of individual data objects by calculating a score based on their Quality (Q), Impact (I), and Collaboration (C). By rewarding the sharing of high-quality, impactful, and collaborative data, our framework aligns individual incentives with the goals of open science and aims to foster a more transparent and efficient research culture.

2025-09-30T19:05:31Z Martin G. Frasch http://arxiv.org/abs/2503.13238v4 Arab Spring's Impact on Science through the Lens of Scholarly Attention, Funding, and Migration 2025-09-30T18:43:18Z

The 2010-2011 Arab Spring reverberated far beyond politics, reshaping how the Middle East and North Africa region (MENA) is studied. Analyzing 3.7 million Scopus-indexed articles published between 2002 and 2019, we find that mentions of ten of these countries in titles or abstracts rose significantly after 2011 relative to the global baseline, with Egypt receiving the greatest attention in the region. We link this surge to two intertwined mechanisms: an increase in research funding directed at the MENA region and the emigration of researchers who continued publishing on their countries of origin. Our analysis reveals that Saudi Arabia has emerged as a regional hub for studying the affected countries, attracting funding and scholars, and thereby playing a significant role in shaping the scientific narrative on the region. These findings demonstrate how political upheaval can reshape global knowledge flows by altering who studies whom, with what resources, and in which disciplines.

2025-03-17T14:51:44Z Yasaman Asgari Hongyu Zhou Ozgur Kadir Ozer Rezvaneh Rezapour Mary Ellen Sloane Alexandre Bovet http://arxiv.org/abs/2509.26001v1 First Workshop on Building Innovative Research Systems for Digital Libraries (BIRDS 2025) 2025-09-30T09:33:11Z

We propose the first workshop on Building Innovative Research Systems for Digital Libraries (BIRDS) to take place at TPDL 2025 as a full-day workshop. BIRDS addresses practitioners working in digital libraries and GLAMs as well as researchers from computational domains such as data science, information retrieval, natural language processing, and data modelling. Our interdisciplinary workshop focuses on connecting members of both worlds. One of today's biggest challenges is the increasing information flood. Large language models like ChatGPT seem to offer good performance for answering questions on the web. So, shall we just build upon that idea and use chatbots in digital libraries? Or do we need to design and develop specialized and effective access paths? Answering these questions requires to connect different communities, practitioners from real digital libraries and researchers in the area of computer science. In brief, our workshop's goal is thus to support researchers and practitioners to build the next generation of innovative and effective digital library systems.

2025-09-30T09:33:11Z Workshop accepted at and held @ TPDL'25 in Tampere, Finland. Webpage: https://ws-birds.github.io/birds2025.github.io/ Christin Katharina Kreutz Hermann Kroll http://arxiv.org/abs/2509.24511v1 The Landscape of problematic papers in the field of non-coding RNA 2025-09-29T09:25:49Z

In recent years, the surge in retractions has been accompanied by numerous papers receiving comments that raise concerns about their reliability. The prevalence of problematic papers undermines the reliability of scientific research and threatens the foundation of evidence-based medicine. In this study,we focus on the field of non-coding RNA(ncRNA) as a case study to explore the typical characteristics of problematic papers from various perspectives, aiming to provide insights for addressing large-scale fraudulent publications. Research on under-investigated ncRNAs is more likely to yield problematic papers. These problematic papers often exhibit significant textual similarity, and many others sharing this similarity also display suspicious instances of image duplication. Healthcare institutions are particularly prone to publishing problematic papers, especially those with a low publication volume. Most problematic papers are found in a limited number of journals, and many journals inadequately address the commented papers. Our findings suggest that numerous problematic papers may still remain unidentified. The revealed characteristics offer valuable insights for formulating strategies to address the issue of fraudulent papers at scale.

2025-09-29T09:25:49Z 13 pages, 6 figures, 2 tables Ying Lou Zhengyi Zhou Guosheng Wang Zhesi Shen Menghui Li http://arxiv.org/abs/2509.24283v1 Overview of SCIDOCA 2025 Shared Task on Citation Prediction, Discovery, and Placement 2025-09-29T04:55:18Z

We present an overview of the SCIDOCA 2025 Shared Task, which focuses on citation discovery and prediction in scientific documents. The task is divided into three subtasks: (1) Citation Discovery, where systems must identify relevant references for a given paragraph; (2) Masked Citation Prediction, which requires selecting the correct citation for masked citation slots; and (3) Citation Sentence Prediction, where systems must determine the correct reference for each cited sentence. We release a large-scale dataset constructed from the Semantic Scholar Open Research Corpus (S2ORC), containing over 60,000 annotated paragraphs and a curated reference set. The test set consists of 1,000 paragraphs from distinct papers, each annotated with ground-truth citations and distractor candidates. A total of seven teams registered, with three submitting results. We report performance metrics across all subtasks and analyze the effectiveness of submitted systems. This shared task provides a new benchmark for evaluating citation modeling and encourages future research in scientific document understanding. The dataset and task materials are publicly available at https://github.com/daotuanan/scidoca2025-shared-task.

2025-09-29T04:55:18Z 16 pages, SCIDOCA 2025 An Dao Vu Tran Le-Minh Nguyen Yuji Matsumoto http://arxiv.org/abs/2509.22616v1 Metrics Over Merit: The Hidden Costs of Citation Impact in Research 2025-09-26T17:42:41Z

Once upon a time, scientists' worth was measured by their ideas, proofs, and perhaps how eloquently they debated Hilbert's problems at seminars. But now, citation metrics have come to center stage and handed us new masters: FWCI and CNCI. This paper critically, and with a touch of satire, examines how these seemingly objective metrics are shaping, and often distorting, the scientific landscape. Through examples and analysis, we highlight the consequences of relying too heavily on such indicators in evaluating researchers and scientific contributions.

2025-09-26T17:42:41Z 4 pages, 3 references, Math Intelligencer (2025) Vugar Ismailov 10.1007/s00283-025-10444-8 http://arxiv.org/abs/2403.16851v3 Can social media provide early warning of retraction? Evidence from critical tweets identified by human annotation and large language models 2025-09-25T14:58:23Z

Timely detection of problematic research is essential for safeguarding scientific integrity. To explore whether social media commentary can serve as an early indicator of potentially problematic articles, this study analysed 3,815 tweets referencing 604 retracted articles and 3,373 tweets referencing 668 comparable non-retracted articles. Tweets critical of the articles were identified through both human annotation and large language models (LLMs). Human annotation revealed that 8.3% of retracted articles were associated with at least one critical tweet prior to retraction, compared to only 1.5% of non-retracted articles, highlighting the potential of tweets as early warning signals of retraction. However, critical tweets identified by LLMs (GPT-4o mini, Gemini 2.0 Flash-Lite, and Claude 3.5 Haiku) only partially aligned with human annotation, suggesting that fully automated monitoring of post-publication discourse should be applied with caution. A human-AI collaborative approach may offer a more reliable and scalable alternative, with human expertise helping to filter out tweets critical of issues unrelated to the research integrity of the articles. Overall, this study provides insights into how social media signals, combined with generative AI technologies, may support efforts to strengthen research integrity.

2024-03-25T15:15:09Z 27 pages, 5 figures Journal of the Association for Information Science and Technology, 2025 Er-Te Zheng Hui-Zhen Fu Mike Thelwall Zhichao Fang 10.1002/asi.70028 http://arxiv.org/abs/2509.25237v1 Quantum est in Libris: Navigating Archives with GenAI, Uncovering Tension Between Preservation and Innovation 2025-09-25T09:58:09Z

"Quantum est in libris" explores the intersection of the archaic and the modern. On one side, there are manuscript materials from the Estonian National Museum's (ERM) more than century-old archive describing the life experiences of Estonian people; on the other side, there is technology that transforms these materials into a dynamic and interactive experience. Connecting technology and cultural heritage is the visitor, who turns texts into inputs for a screen sculpture. Historical narratives are visually brought to life through the contemporary technological language. Because the video AI models we employed, Runway Gen-3 and Gen-4, have not previously interacted with Estonian heritage, we can observe how machines today "read the world" and create future heritage. "Quantum est in libris" introduces an exciting yet unsettling new dimension to the concept of cultural heritage: in a world where data are fluid and interpretations unstable, heritage status becomes fragile. In the digital environment, heritage issues are no longer just about preservation and transmission, but also about representation of the media, machine creativity, and interpretive error. Who or what shapes memory processes and memory spaces, and how?

2025-09-25T09:58:09Z 5 pages, 4 figures, Mar Canet Sola Varvara Guljajeva