https://arxiv.org/api/hnbKLK7G2blJxa2PW2H9E09ngy4 2026-06-14T00:14:47Z 6065 525 15 http://arxiv.org/abs/2511.01454v1 "Don't Teach Minerva": Guiding LLMs Through Complex Syntax for Faithful Latin Translation with RAG 2025-11-03T11:11:27Z

Translating a morphology-rich, low-resource language like Latin poses significant challenges. This paper introduces a reproducible draft-based refinement pipeline that elevates open-source Large Language Models (LLMs) to a performance level statistically comparable to top-tier proprietary systems. Our method first uses a fine-tuned NLLB-1.3B model to generate a high-quality, structurally faithful draft. A zero-shot LLM (Llama-3.3 or Qwen3) then polishes this draft, a process that can be further enhanced by augmenting the context with retrieved out-context examples (RAG). We demonstrate the robustness of this approach on two distinct benchmarks: a standard in-domain test set (Rosenthal, 2023) and a new, challenging out-of-domain (OOD) set of 12th-century Latin letters (2025). Our central finding is that this open-source RAG system achieves performance statistically comparable to the GPT-5 baseline, without any task-specific LLM fine-tuning. We release the pipeline, the Chartres OOD set, and evaluation scripts and models to facilitate replicability and further research.

2025-11-03T11:11:27Z Sergio Torres Aguilar http://arxiv.org/abs/2511.01439v1 Reforming research funding: Combining editorial preregistration with grant peer review 2025-11-03T10:47:36Z

Competitive grant funding is associated with high costs and a potential bias to favor conservative research. This comment proposes integrating editorial preregistration, in the form of registered reports, into grant peer review processes as a reform strategy. Linking funding decisions to in principle accepted study protocols would reduce reviewer burden, strengthen methodological rigor, and provide an institutional foundation for (more) replication, theory driven research, and high risk research. Our proposal also minimizes strategic proposal writing and ensures scholarly output through the publication of preregistered protocols, regardless of funding outcomes. Possible implementation models include direct coupling of journal acceptance with funding, co review mechanisms, voucher systems, and lotteries. While challenges remain in aligning journal and funding agency procedures, the integration of preregistration and funding offers a promising pathway toward a more transparent and efficient research ecosystem.

2025-11-03T10:47:36Z Lutz Bornmann Gerald Schweiger http://arxiv.org/abs/2511.01353v1 AI Literacy in UAE Libraries: Assessing Competencies, Training Needs, and Ethical Considerations for the Digital Age 2025-11-03T09:00:15Z

The study explores the current state of artificial intelligence (AI) literacy levels among library professionals employing a quantitative approach consisting of 92 surveys of LIS professionals in the United Arab Emirates (UAE). Findings of the study revealed the presence of strong cognitive competencies, while there were gaps observed in behavioral and normative competencies, especially related to AI biases, AI-powered learning, and ethical considerations. There was a disconnect observed between the perceived importance of AI skills and the effectiveness of the current training programs.

2025-11-03T09:00:15Z This is the accepted manuscript version. The final published version will appear in College & Research Libraries, November 2026 Zafar Imam Khan http://arxiv.org/abs/2511.01942v1 Towards Defect Phase Diagrams: From Research Data Management to Automated Workflows 2025-11-03T07:39:32Z

Defect phase diagrams provide a unified description of crystal defect states for materials design and are central to the scientific objectives of the Collaborative Research Centre (CRC) 1394. Their construction requires the systematic integration of heterogeneous experimental and simulation data across research groups and locations. In this setting, research data management (RDM) is a key enabler of new scientific insight by linking distributed research activities and making complex data reproducible and reusable. To address the challenge of heterogeneous data sources and formats, a comprehensive RDM infrastructure has been established that links experiment, data, and analysis in a seamless workflow. The system combines: (1) a joint electronic laboratory notebook and laboratory information management system, (2) easy-to-use large-object data storage, (3) automatic metadata extraction from heterogeneous and proprietary file formats, (4) interactive provenance graphs for data exploration and reuse, and (5) automated reporting and analysis workflows. The two key technological elements are the openBIS electronic laboratory notebook and laboratory information management system, and a newly developed companion application that extends openBIS with large-scale data handling, automated metadata capture, and federated access to distributed research data. This integrated approach reduces friction in data capture and curation, enabling traceable and reusable datasets that accelerate the construction of defect phase diagrams across institutions.

2025-11-03T07:39:32Z Khalil Rejiba Sang-Hyeok Lee Christina Gasper Martina Freund Sandra Korte-Kerzel Ulrich Kerzel 10.1002/adem.202502882 http://arxiv.org/abs/2511.01113v1 S2Doc -- Spatial-Semantic Document Format 2025-11-02T23:06:03Z

Documents are a common way to store and share information, with tables being an important part of many documents. However, there is no real common understanding of how to model documents and tables in particular. Because of this lack of standardization, most scientific approaches have their own way of modeling documents and tables, leading to a variety of different data structures and formats that are not directly compatible. Furthermore, most data models focus on either the spatial or the semantic structure of a document, neglecting the other aspect. To address this, we developed S2Doc, a flexible data structure for modeling documents and tables that combines both spatial and semantic information in a single format. It is designed to be easily extendable to new tasks and supports most modeling approaches for documents and tables, including multi-page documents. To the best of our knowledge, it is the first approach of its kind to combine all these aspects in a single format.

2025-11-02T23:06:03Z 8 pages, 2 figures, submitted to LREC2026 Sebastian Kempf Frank Puppe http://arxiv.org/abs/2407.17032v4 Gymnasium: A Standard Interface for Reinforcement Learning Environments 2025-11-02T13:42:19Z

Reinforcement Learning (RL) is a continuously growing field that has the potential to revolutionize many areas of artificial intelligence. However, despite its promise, RL research is often hindered by the lack of standardization in environment and algorithm implementations. This makes it difficult for researchers to compare and build upon each other's work, slowing down progress in the field. Gymnasium is an open-source library that provides a standard API for RL environments, aiming to tackle this issue. Gymnasium's main feature is a set of abstractions that allow for wide interoperability between environments and training algorithms, making it easier for researchers to develop and test RL algorithms. In addition, Gymnasium provides a collection of easy-to-use environments, tools for easily customizing environments, and tools to ensure the reproducibility and robustness of RL research. Through this unified framework, Gymnasium significantly streamlines the process of developing and testing RL algorithms, enabling researchers to focus more on innovation and less on implementation details. By providing a standardized platform for RL research, Gymnasium helps to drive forward the field of reinforcement learning and unlock its full potential. Gymnasium is available online at https://github.com/Farama-Foundation/Gymnasium

2024-07-24T06:35:05Z Accepted at NeurIPS Datasets and Benchmarks 2025 Mark Towers Ariel Kwiatkowski Jordan Terry John U. Balis Gianluca De Cola Tristan Deleu Manuel Goulão Andreas Kallinteris Markus Krimmel Arjun KG Rodrigo Perez-Vicente Andrea Pierré Sander Schulhoff Jun Jet Tai Hannah Tan Omar G. Younis http://arxiv.org/abs/2510.27259v1 Research Output of Webology Journal (2013-2017): A Scientometric Analysis 2025-10-31T07:55:16Z

Webology is an international peer-reviewed journal in English devoted to the field of the World Wide Web and serves as a forum for discussion and experimentation. It serves as a forum for new research in information dissemination and communication processes in general, and in the context of the World Wide Web in particular. This paper presents a Scientometric analysis of the Webology Journal. The paper analyses the pattern of growth of the research output published in the journal, pattern of authorship, author productivity, and subjects covered to the papers over the period (2013-2017). It is found that 62 papers were published during the period of study (2013-2017). The maximum numbers of articles were collaborative in nature. The subject concentration of the journal noted was Social Networking/Web 2.0/Library 2.0 and Scientometrics or Bibliometrics. Iranian researchers contributed the maximum number of articles (37.10%). The study applied standard formula and statistical tools to bring out the factual result.

2025-10-31T07:55:16Z 13 pages, 3 figures, Research Paper International Journal of Movement Education and Social Science; Volume 7 Issue 3; 2018 Muneer Ahmad M. Sadik Batcha Basharat Ahmad Wani Mohammad Idrees Khan S. Roselin Jahina http://arxiv.org/abs/2308.12883v2 Computational Dating for the Nuzi Cuneiform Archive: The Least Squares Constrained by Family Trees and Synchronisms 2025-10-31T05:33:23Z

We introduce a computational method of dating for an archive in ancient Mesopotamia. We use the name index Nuzi Personal Names (NPN) published in 1943. We made an electronic version of NPN and added the kinships of the two powerful families to NPN to reflect the Nuzi studies after 1943. Nuzi is a town from the 15th - 14th century B.C.E.for a period of some five generations in Arrapha. The cuneiform tablets listed in NPN are for contracts on land transactions, marriage, loans, slavery, etc. In NPN, the kinships and cuneiform tablets (contracts, documents, texts) involved are listed for each person. We reconstruct family trees from the added NPN to formulate the least squares problem with the constraints: a person's father is at least 22.5 years older than the person, contractors were living at the time of the contract, etc. Our results agree with the Assyriological results of M. P. Maidman on the seniority among siblings of a powerful family. Our method could be applied to the other clay tablet archives once we have the name index in the format of NPN.

2023-08-23T07:59:25Z Sumie Ueda Takashi Tsuchiya Yoshiaki Itoh http://arxiv.org/abs/2510.25718v1 Retrieval-Augmented Search for Large-Scale Map Collections with ColPali 2025-10-29T17:27:21Z

Multimodal approaches have shown great promise for searching and navigating digital collections held by libraries, archives, and museums. In this paper, we introduce map-RAS: a retrieval-augmented search system for historic maps. In addition to introducing our framework, we detail our publicly-hosted demo for searching 101,233 map images held by the Library of Congress. With our system, users can multimodally query the map collection via ColPali, summarize search results using Llama 3.2, and upload their own collections to perform inter-collection search. We articulate potential use cases for archivists, curators, and end-users, as well as future work with our system in both machine learning and the digital humanities. Our demo can be viewed at: http://www.mapras.com.

2025-10-29T17:27:21Z 5 pages, 5 figures Jamie Mahowald Benjamin Charles Germain Lee http://arxiv.org/abs/2510.25283v1 Measuring the Research Output and Performance of the University of Ibadan from 2014 to 2023: A Scientometric Analysis 2025-10-29T08:39:36Z

This study employs scientometric methods to assess the research output and performance of the University of Ibadan from 2014 to 2023. By analyzing publication trends, citation patterns, and collaboration networks, the research aims to comprehensively evaluate the university's research productivity, impact, and disciplinary focus. This article's endeavors are characterized by innovation, interdisciplinary collaboration, and commitment to excellence, making the University of Ibadan a significant hub for cutting-edge research in Nigeria and beyond. The goal of the current study is to ascertain the influence of the university's research output and publication patterns between 2014 and 2023. The study focuses on the departments at the University of Ibadan that contribute the most, the best journals for publishing, the nations that collaborate, the impact of citations both locally and globally, well-known authors and their total production, and the research output broken down by year. According to the university's ten-year publication data, 7159 papers with an h-index of 75 were published between 2014 and 2023, garnering 218572 citations. Furthermore, the VOSviewer software mapping approach is used to illustrate the stenographical mapping of data through graphs. The findings of this study will contribute to understanding the university's research strengths, weaknesses, and potential areas for improvement. Additionally, the results will inform evidence-based decision-making for enhancing research strategies and policies at the University of Ibadan.

2025-10-29T08:39:36Z 16 pages, 5 figures, Research Paper Nigerian Libraries; Volume 59, Issue 1; 2025 Muneer Ahmad Undie Felicia Nkatv 10.61955/HFYDJH http://arxiv.org/abs/2510.26824v1 LeMat-Synth: a multi-modal toolbox to curate broad synthesis procedure databases from scientific literature 2025-10-28T17:58:18Z

The development of synthesis procedures remains a fundamental challenge in materials discovery, with procedural knowledge scattered across decades of scientific literature in unstructured formats that are challenging for systematic analysis. In this paper, we propose a multi-modal toolbox that employs large language models (LLMs) and vision language models (VLMs) to automatically extract and organize synthesis procedures and performance data from materials science publications, covering text and figures. We curated 81k open-access papers, yielding LeMat-Synth (v 1.0): a dataset containing synthesis procedures spanning 35 synthesis methods and 16 material classes, structured according to an ontology specific to materials science. The extraction quality is rigorously evaluated on a subset of 2.5k synthesis procedures through a combination of expert annotations and a scalable LLM-as-a-judge framework. Beyond the dataset, we release a modular, open-source software library designed to support community-driven extension to new corpora and synthesis domains. Altogether, this work provides an extensible infrastructure to transform unstructured literature into machine-readable information. This lays the groundwork for predictive modeling of synthesis procedures as well as modeling synthesis--structure--property relationships.

2025-10-28T17:58:18Z 29 pages, 13 figures, 6 tables Magdalena Lederbauer Siddharth Betala Xiyao Li Ayush Jain Amine Sehaba Georgia Channing Grégoire Germain Anamaria Leonescu Faris Flaifil Alfonso Amayuelas Alexandre Nozadze Stefan P. Schmid Mohd Zaki Sudheesh Kumar Ethirajan Elton Pan Mathilde Franckel Alexandre Duval N. M. Anoop Krishnan Samuel P. Gleason http://arxiv.org/abs/2506.03187v2 Comparing Retrieval Strategies to Capture Interdisciplinary Scientific Research: A Bibliometric Evaluation of the Integration of Neuroscience and Computer Science 2025-10-28T17:05:07Z

Interdisciplinary scientific research is increasingly important in knowledge production, funding policies, and academic discussions on scholarly communication. While many studies focus on interdisciplinary corpora defined a priori -- usually through keyword-based searches within assumed interdisciplinary domains -- few explore interdisciplinarity as an emergent intersection between two distinct fields. Thus, methodological proposals for building databases at the intersection of two fields of knowledge are scarce. The goal of this article is to develop and compare different strategies for defining an interdisciplinary corpus between two bodies of knowledge. As a case study, we focus on the intersection between neuroscience and computer science. To this end, we develop and compare four retrieval strategies, two of them based on keywords and two based on citation and reference patterns. Our results show that the reference-based strategy provides better retrieval, pseudorecall, and F1. While we focus on comparing strategies for the study of the intersection between the fields of neuroscience and computer science, this methodological reflection is applicable to a wide range of interdisciplinary domains.

2025-05-30T19:29:18Z Malena Mendez Isla Agustin Mauro Diego Kozlowski http://arxiv.org/abs/2510.24122v1 Comparing Disciplinary Classifications in SSH: Organizational, Channel-Based, and Text-Based Perspectives 2025-10-28T06:49:29Z

This study investigates how different approaches to disciplinary classification represent the Social Sciences and Humanities (SSH) in the Flemish VABB-SHW database. We compare organizational classification (based on author affiliation), channel-based cognitive classification (based on publication venues), and text-based publication-level classification (using channel titles, publication titles, and abstracts, depending on availability). The analysis shows that text-based classification generally aligns more closely with channel-based categories, confirming that the channel choice provides relevant information about publication content. At the same time, it is closer to organizational classification than channel-based categories are, suggesting that textual features capture author affiliations more directly than publishing channels do. Comparison across the three systems highlights cases of convergence and divergence, offering insights into how disciplines such as "Sociology" and "History" extend across fields, while "Law" remains more contained. Publication-level classification also clarifies the disciplinary profiles of multidisciplinary journals in the database, which in VABB-SHW show distinctive profiles with stronger emphases on SSH and health sciences. At the journal level, fewer than half of outlets with more than 50 publications have their channel-level classification fully or partially supported by more than 90% of publications. These results demonstrate the added value of text-based methods for validating classifications and for analysing disciplinary dynamics.

2025-10-28T06:49:29Z Submitted to Scientometrics Cristina Arhiliuc Raf Guns Tim C. E. Engels http://arxiv.org/abs/2510.23146v1 Fake scientific journals are here to stay 2025-10-27T09:22:54Z

Scientific publishing is facing an alarming proliferation of fraudulent practices that threaten the integrity of research communication. The production and dissemination of fake research have become a profitable business, undermining trust in scientific journals and distorting the evaluation processes that depend on them. This brief piece examines the problem of fake journals through a three-level typology. The first level concerns predatory journals, which prioritise financial gain over scholarly quality by charging authors publication fees while providing superficial or fabricated peer review. The second level analyses hijacked journals, in which counterfeit websites impersonate legitimate titles to deceive authors into submitting and paying for publication. The third level addresses hacked journals, where legitimate platforms are compromised through cyberattacks or internal manipulation, enabling the distortion of review and publication processes. Together, these forms of misconduct expose deep vulnerabilities in the scientific communication ecosystem, exacerbated by the pressure to publish and the marketisation of research outputs. The manuscript concludes that combating these practices requires structural reforms in scientific evaluation and governance. Only by reducing the incentives that sustain the business of fraudulent publishing can the scholarly community restore credibility and ensure that scientific communication fulfils the essential purpose of reliable advancement of knowledge.

2025-10-27T09:22:54Z 7 pages, 1 figure. Expanded version of blog post published in Spanish Enrique Orduña-Malea http://arxiv.org/abs/2508.13234v2 The Role of AI in Facilitating Interdisciplinary Collaboration: Evidence from AlphaFold 2025-10-27T07:32:43Z

The acceleration of artificial intelligence (AI) in science is recognized and many scholars have begun to explore its role in interdisciplinary collaboration. However, the mechanisms and extent of this impact are still unclear. This study, using AlphaFold's impact on structural biologists, examines how AI technologies influence interdisciplinary collaborative patterns. By analyzing 1,247 AlphaFold-related papers and 7,700 authors from Scopus, we employ bibliometric analysis and causal inference to compare interdisciplinary collaboration between AlphaFold adopters and non-adopters. Contrary to the widespread belief that AI facilitates interdisciplinary collaboration, our findings show that AlphaFold increased structural biology-computer science collaborations by just 0.48%, with no measurable effect on other disciplines. Specifically, AI creates interdisciplinary collaboration demands with specific disciplines due to its technical characteristics, but this demand is weakened by technological democratization and other factors. These findings demonstrate that artificial intelligence (AI) alone has limited efficacy in bridging disciplinary divides or fostering meaningful interdisciplinary collaboration.

2025-08-18T00:31:03Z 29pages, 2figures Naixuan Zhao Chunli Wei Xinyan Zhang Jiang Li